用python写网络爬虫（第二版）

时间：2019-10-29 14:45:45 收藏：0 阅读：196

第一章：网络爬虫简介

示例网站：http://example.python-scraping.com

资源提供：https://www.epubit.com/

第一章：网络爬虫简介

1.1 网络爬虫何时会有用？

以结构化的格式，获取网上的批量数据（理论上可以手工，但是自动化可以省时省力）

1.2 网络爬虫是否合法？

被抓取的数据用于个人用途，且在合理使用版权法的条件下，通常没有问题

1.3 python3

工具：
- anaconda
- virtual environment wrapper （https://virtuallenvwrapper.readthedocs.io/en/latest）
- conda (https://conda.io/docs/intro.html)
python 版本：python3.4+

1.4 背景调研

调研工具：
- robots.txt
- sitemap
- google -> WHQIS

1.4.1 检查robots.txt

了解当前网站的爬取限制
可以发现和网站结构相关的线索
详见：http://robotstxt.org

1.4.2 检查网站地图(sitemap)

帮助爬虫定位网站最新的内容，无需爬取每一个网页
网站地图标准定义：http://www.sitemap.org/protocol.html

1.4.3 估算网站大小

目标网站大小会影响我们爬取方式：效率问题
工具：https://www.google.com/advanced_search
- 在域名后面添加url路径，可以对结果过滤，仅显示网站的某些部分

1.4.4 识别网站所有技术

detectem模块 (pip install detectem)
工具：
- 安装 Docker （http://www.docker.com/products/overview）
- bash:$docker pull scrapinghub/splash
- bash:$pip install detectem
- python 虚拟环境（https://docs.python.org/3/library/venv.html）
- conda 环境（https://conda.io/docs/using/envs.com）
- 查看项目的README（https://github.com/spectresearch/detectem）

$ det http://example.python-scraping.com

'''
[{'name': 'jquery', 'version': '1.11.0'},
 {'name': 'modernizr', 'version': '2.7.1'},
 {'name': 'nginx', 'version': '1.12.2'}]
'''
$ docker pull wappalyzer/cli
$ docker run wappalyzer/cli http://example.python-scraping.com

1.4.5 寻找网站所有者

寻找网站所有者：使用WHOIS协议查询网站域名注册所有者
- python 中有针对该协议封装的库（https://pypi.python.org/pypi/python-whois）
- 安装：pip install python-whois
```
import whois
print(whois.whois('url'))
```

1.5 编写第一个网络爬虫

爬取：下载包涵感兴趣数据的网页
爬取所用的方法有很多，选取哪种更合适：取决于目标网站的结构
三种爬取网站的常见方法：
- 爬取网站地图
- 使用数据库ID便历每一个网页
- 跟踪网页链接

1.5.1 抓取与爬取的对比

抓取：针对特定网站，并在站点上获取指定信息
爬取：通用的方式构建，目标是一系列顶级域名的网站或是整个网络。可以用来收集更具体的信息，更常见的是爬取整个网络。从不同站点或页面获取的小而通用的信息，然后跟踪连接到其他页面中。

1.5.2 下载网页

1.5.2.1 下载网页

下载时经常遇到临时错误：
- 服务器过载（503 Service Unavailable）
  - 短暂等待后继续尝试重新下载
- 网页不存在（404 Not Found）
- 请求时发生问题（4XX）-重新下载无效果
- 服务端存在问题（5XX）-可重新下载

1.5.2.2 设置代理

import urllib.request
from urllib.error import URLError, HTTPError, ContentTooShortError

# user_agent='wswp' 设置用户代理
def download(url, num_retries=2, user_agent='wswp'):
    print('Downloading:', url)
    # 设置用户代理
    request = urllib.request.Request(url)
    request.add_header('User-agent', user_agent)
    
    try:
        html = urllib.request.urlopen(request).read()
    except (URLError, HTTPError, ContentTooShortError) as e:
        print('Download error:', e.reason)
        html = None
        if num_retries > 0:
            if hasattr(e, 'code') and 500 <= e.code < 600:
                # recursively retry 5xx HTTP errors
                return download(url, num_retries - 1)
    return html

1.5.3 网站地图爬虫

使用正则表达式将robots.txt的url从标签中取出

# 导入url解析库
import urllib.request
# 导入正则库
import re
# 导入解析错误库
from urllib.error import URLError, HTTPError, ContentTooShortError


def download(url, num_retries=2, user_agent='wswp', charset='utf-8'):
    print('Downloading:', url)
    request = urllib.request.Request(url)
    request.add_header('User-agent', user_agent)
    try:
        resp = urllib.request.urlopen(request)
        cs = resp.headers.get_content_charset()
        if not cs:
            cs = charset
        html = resp.read().decode(cs)
    except (URLError, HTTPError, ContentTooShortError) as e:
        print('Download error:', e.reason)
        html = None
        if num_retries > 0:
            if hasattr(e, 'code') and 500 <= e.code < 600:
                # recursively retry 5xx HTTP errors
                return download(url, num_retries - 1)
    return html


def crawl_sitemap(url):
    # download the sitemap file
    sitemap = download(url)
    # extract the sitemap links
    links = re.findall('<loc>(.*?)</loc>', sitemap)
    # download each link
    for link in links:
        html = download(link)
        # scrape html here

test_url = 'http://example.python-scraping.com/sitemap.xml'
crawl_sitemap(test_url)

'''
Downloading: http://example.python-scraping.com/sitemap.xml
Downloading: http://example.python-scraping.com/places/default/view/Afghanistan-1
Downloading: http://example.python-scraping.com/places/default/view/Aland-Islands-2
Downloading: http://example.python-scraping.com/places/default/view/Albania-3
Downloading: http://example.python-scraping.com/places/default/view/Algeria-4
Downloading: http://example.python-scraping.com/places/default/view/American-Samoa-5
Downloading: http://example.python-scraping.com/places/default/view/Andorra-6
Downloading: http://example.python-scraping.com/places/default/view/Angola-7
Downloading: http://example.python-scraping.com/places/default/view/Anguilla-8
Downloading: http://example.python-scraping.com/places/default/view/Antarctica-9
Downloading: http://example.python-scraping.com/places/default/view/Antigua-and-Barbuda-10
Downloading: http://example.python-scraping.com/places/default/view/Argentina-11
...
'''

1.5.4 ID便历爬虫

import itertools
import urllib.request
from urllib.error import URLError, HTTPError, ContentTooShortError


def download(url, num_retries=2, user_agent='wswp', charset='utf-8'):
    print('Downloading:', url)
    request = urllib.request.Request(url)
    request.add_header('User-agent', user_agent)
    try:
        resp = urllib.request.urlopen(request)
        cs = resp.headers.get_content_charset()
        if not cs:
            cs = charset
        html = resp.read().decode(cs)
    except (URLError, HTTPError, ContentTooShortError) as e:
        print('Download error:', e.reason)
        html = None
        if num_retries > 0:
            if hasattr(e, 'code') and 500 <= e.code < 600:
                # recursively retry 5xx HTTP errors
                return download(url, num_retries - 1)
    return html


def crawl_site(url, max_errors=5):
    num_errors = 0
    for page in itertools.count(1):
        pg_url = '{}{}'.format(url, page)
        html = download(pg_url)
        if html is None:
            num_errors += 1
            if num_errors == max_errors:
                # reached max number of errors, so exit
                break
        else:
            num_errors = 0
            # success - can scrape the result
test_url2 = 'http://example.python-scraping.com/view/-'
# 暂时存在问题，待调
crawl_sitemap(test_url2)

1.5.5 链接爬虫

1.5.6 使用 request库

1.6 本章小结

原文：https://www.cnblogs.com/Mario-mj/p/11756363.html