核心思路是使用正則表達(dá)式對網(wǎng)頁的html5中的路徑名和文件名進(jìn)行抓取,
然后對路徑繼續(xù)進(jìn)行同樣的抓取,用遞歸的方式進(jìn)行搜索。最后把網(wǎng)站上的內(nèi)容文件全部下載下來
import urllib
import sys
import BeautifulSoup
import re
import os
path = []
def extract(url):
content = urllib.urlopen(url).read()
#reg = r'(?:href|HREF)="?((?:http://)?.+?\.txt)'
reg = r'<a href="(.*)">.*'
url_re = re.compile(reg)
url_lst = re.findall(url_re, content)
for lst in url_lst:
ext = lst.split('.')[-1]
if ext[-1] == '/':
newUrl = url + lst
extract(newUrl)
else:
path.append(url + lst)
print "downloading with urllib"
url = 'http://139.196.233.65/js/'
extract(url)
filePath = 'E:/6-學(xué)習(xí)文檔/91-JS/Download/js'
filePath = unicode(filePath, 'utf8')
for p in path:
fileTitle = p.split('/js')[-1]
file = filePath + fileTitle
dir = os.path.dirname(file)
isExists=os.path.exists(dir)
if isExists == False:
os.makedirs(dir)
urllib.urlretrieve(p, file)
#for lst in url_lst:
# file = filePath + lst
# lst = url + '/' + lst
# urllib.urlretrieve(lst, file)
本站僅提供存儲服務(wù),所有內(nèi)容均由用戶發(fā)布,如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容,請
點(diǎn)擊舉報。