用python写网络爬虫(第二版)
时间:2019-10-29 14:45:45
收藏:0
阅读:196
目录
示例网站:http://example.python-scraping.com
资源提供:https://www.epubit.com/
第一章:网络爬虫简介
1.1 网络爬虫何时会有用?
- 以结构化的格式,获取网上的批量数据(理论上可以手工,但是自动化可以省时省力)
1.2 网络爬虫是否合法?
- 被抓取的数据用于个人用途,且在合理使用版权法的条件下,通常没有问题
1.3 python3
- 工具:
- anaconda
- virtual environment wrapper (https://virtuallenvwrapper.readthedocs.io/en/latest)
- conda (https://conda.io/docs/intro.html)
- python 版本:python3.4+
1.4 背景调研
- 调研工具:
- robots.txt
- sitemap
- google -> WHQIS
1.4.1 检查robots.txt
- 了解当前网站的爬取限制
- 可以发现和网站结构相关的线索
- 详见:http://robotstxt.org
1.4.2 检查网站地图(sitemap)
- 帮助爬虫定位网站最新的内容,无需爬取每一个网页
- 网站地图标准定义:http://www.sitemap.org/protocol.html
1.4.3 估算网站大小
- 目标网站大小会影响我们爬取方式:效率问题
- 工具:https://www.google.com/advanced_search
- 在域名后面添加url路径,可以对结果过滤,仅显示网站的某些部分
1.4.4 识别网站所有技术
- detectem模块 (pip install detectem)
- 工具:
- 安装 Docker (http://www.docker.com/products/overview)
- bash:$docker pull scrapinghub/splash
- bash:$pip install detectem
- python 虚拟环境(https://docs.python.org/3/library/venv.html)
- conda 环境(https://conda.io/docs/using/envs.com)
- 查看项目的README(https://github.com/spectresearch/detectem)
$ det http://example.python-scraping.com
'''
[{'name': 'jquery', 'version': '1.11.0'},
{'name': 'modernizr', 'version': '2.7.1'},
{'name': 'nginx', 'version': '1.12.2'}]
'''
$ docker pull wappalyzer/cli
$ docker run wappalyzer/cli http://example.python-scraping.com
1.4.5 寻找网站所有者
- 寻找网站所有者:使用WHOIS协议查询网站域名注册所有者
- python 中有针对该协议封装的库(https://pypi.python.org/pypi/python-whois)
- 安装:pip install python-whois
import whois print(whois.whois('url'))
1.5 编写第一个网络爬虫
- 爬取:下载包涵感兴趣数据的网页
- 爬取所用的方法有很多,选取哪种更合适:取决于目标网站的结构
- 三种爬取网站的常见方法:
- 爬取网站地图
- 使用数据库ID便历每一个网页
- 跟踪网页链接
1.5.1 抓取与爬取的对比
- 抓取:针对特定网站,并在站点上获取指定信息
- 爬取:通用的方式构建,目标是一系列顶级域名的网站或是整个网络。可以用来收集更具体的信息,更常见的是爬取整个网络。从不同站点或页面获取的小而通用的信息,然后跟踪连接到其他页面中。
1.5.2 下载网页
1.5.2.1 下载网页
- 下载时经常遇到临时错误:
- 服务器过载(503 Service Unavailable)
- 短暂等待后继续尝试重新下载
- 网页不存在(404 Not Found)
- 请求时发生问题(4XX)-重新下载无效果
- 服务端存在问题(5XX)-可重新下载
- 服务器过载(503 Service Unavailable)
1.5.2.2 设置代理
import urllib.request
from urllib.error import URLError, HTTPError, ContentTooShortError
# user_agent='wswp' 设置用户代理
def download(url, num_retries=2, user_agent='wswp'):
print('Downloading:', url)
# 设置用户代理
request = urllib.request.Request(url)
request.add_header('User-agent', user_agent)
try:
html = urllib.request.urlopen(request).read()
except (URLError, HTTPError, ContentTooShortError) as e:
print('Download error:', e.reason)
html = None
if num_retries > 0:
if hasattr(e, 'code') and 500 <= e.code < 600:
# recursively retry 5xx HTTP errors
return download(url, num_retries - 1)
return html
1.5.3 网站地图爬虫
- 使用正则表达式将robots.txt的url从
标签中取出
# 导入url解析库
import urllib.request
# 导入正则库
import re
# 导入解析错误库
from urllib.error import URLError, HTTPError, ContentTooShortError
def download(url, num_retries=2, user_agent='wswp', charset='utf-8'):
print('Downloading:', url)
request = urllib.request.Request(url)
request.add_header('User-agent', user_agent)
try:
resp = urllib.request.urlopen(request)
cs = resp.headers.get_content_charset()
if not cs:
cs = charset
html = resp.read().decode(cs)
except (URLError, HTTPError, ContentTooShortError) as e:
print('Download error:', e.reason)
html = None
if num_retries > 0:
if hasattr(e, 'code') and 500 <= e.code < 600:
# recursively retry 5xx HTTP errors
return download(url, num_retries - 1)
return html
def crawl_sitemap(url):
# download the sitemap file
sitemap = download(url)
# extract the sitemap links
links = re.findall('<loc>(.*?)</loc>', sitemap)
# download each link
for link in links:
html = download(link)
# scrape html here
test_url = 'http://example.python-scraping.com/sitemap.xml'
crawl_sitemap(test_url)
'''
Downloading: http://example.python-scraping.com/sitemap.xml
Downloading: http://example.python-scraping.com/places/default/view/Afghanistan-1
Downloading: http://example.python-scraping.com/places/default/view/Aland-Islands-2
Downloading: http://example.python-scraping.com/places/default/view/Albania-3
Downloading: http://example.python-scraping.com/places/default/view/Algeria-4
Downloading: http://example.python-scraping.com/places/default/view/American-Samoa-5
Downloading: http://example.python-scraping.com/places/default/view/Andorra-6
Downloading: http://example.python-scraping.com/places/default/view/Angola-7
Downloading: http://example.python-scraping.com/places/default/view/Anguilla-8
Downloading: http://example.python-scraping.com/places/default/view/Antarctica-9
Downloading: http://example.python-scraping.com/places/default/view/Antigua-and-Barbuda-10
Downloading: http://example.python-scraping.com/places/default/view/Argentina-11
...
'''
1.5.4 ID便历爬虫
import itertools
import urllib.request
from urllib.error import URLError, HTTPError, ContentTooShortError
def download(url, num_retries=2, user_agent='wswp', charset='utf-8'):
print('Downloading:', url)
request = urllib.request.Request(url)
request.add_header('User-agent', user_agent)
try:
resp = urllib.request.urlopen(request)
cs = resp.headers.get_content_charset()
if not cs:
cs = charset
html = resp.read().decode(cs)
except (URLError, HTTPError, ContentTooShortError) as e:
print('Download error:', e.reason)
html = None
if num_retries > 0:
if hasattr(e, 'code') and 500 <= e.code < 600:
# recursively retry 5xx HTTP errors
return download(url, num_retries - 1)
return html
def crawl_site(url, max_errors=5):
num_errors = 0
for page in itertools.count(1):
pg_url = '{}{}'.format(url, page)
html = download(pg_url)
if html is None:
num_errors += 1
if num_errors == max_errors:
# reached max number of errors, so exit
break
else:
num_errors = 0
# success - can scrape the result
test_url2 = 'http://example.python-scraping.com/view/-'
# 暂时存在问题,待调
crawl_sitemap(test_url2)
1.5.5 链接爬虫
1.5.6 使用 request库
1.6 本章小结
原文:https://www.cnblogs.com/Mario-mj/p/11756363.html
评论(0)