爬虫-反爬与反反爬（12）

时间：2020-07-21 13:02:45 收藏：0 阅读：74

概念：

爬虫：批量获取对方的消息

反爬：使用技术，防止被别人爬取

反反爬：使用技术，绕过反爬策略

反爬虫的目的：

1】防止暴力的初级爬虫
2】失控的爬虫，防止一些被遗弃但没有关闭的爬虫
3】重要的数据保存

常见的反爬虫策略：

user-agent反爬
ip频率限制
必须登录反爬

解决思路：因为user-agent带有Bytespider爬虫标记，这可以通过Nginx规则来限定流氓爬虫的访问，直接返回403错误。

1、在/etc/nginx/conf.d目录下（因Nginx的安装区别，可能站点配置文件的路径有所不同）新建文件deny_agent.config配置文件：

#forbidden Scrapy
if ($http_user_agent ~* (Scrapy|Curl|HttpClient))
{
    return 403;
}

#forbidden UA
if ($http_user_agent ~ "Bytespider|FeedDemon|JikeSpider|
Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|
ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|
DigExt|YisouSpider|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms|^$" )
{
    return 403;
}

#forbidden not GET|HEAD|POST method access
if ($request_method !~ ^(GET|HEAD|POST)$)
{
    return 403;
}

注意UA：

FeedDemon             内容采集
BOT/0.1 (BOT for JCE) sql注入
CrawlDaddy            sql注入
Java                  内容采集
Jullo                 内容采集
Feedly                内容采集
UniversalFeedParser   内容采集
ApacheBench           cc攻击器
Swiftbot              无用爬虫
YandexBot             无用爬虫
AhrefsBot             无用爬虫
YisouSpider           无用爬虫（已被UC神马搜索收购，此蜘蛛可以放开！）
jikeSpider            无用爬虫
MJ12bot               无用爬虫
ZmEu phpmyadmin       漏洞扫描
WinHttp               采集cc攻击
EasouSpider           无用爬虫
HttpClient            tcp攻击
Microsoft URL Control 扫描
YYSpider              无用爬虫
jaunty                wordpress爆破扫描器
oBot                  无用爬虫
Python-urllib         内容采集
Indy Library          扫描
FlightDeckReports Bot 无用爬虫
Linguee Bot           无用爬虫

View Code

使用啊步云实现反ip限制

#通过ip代理绕过ip反爬
import requests
from scrapy import Selector
def get_html(url):
    # 代理服务器
    print("开始下载url : {}".format(url))
    proxyHost = "http-dyn.abuyun.com"
    proxyPort = "9020"
    # 代理隧道验证信息
    proxyUser = "H58G6G30137G865D"
    proxyPass = "043F1F63DA9899C8"

    proxyMeta = "http://%(user)s:%(pass)s@%(host)s:%(port)s" % {
        "host": proxyHost,
        "port": proxyPort,
        "user": proxyUser,
        "pass": proxyPass,
    }

    proxies = {
        "http": proxyMeta,
        "https": proxyMeta,
    }
    from fake_useragent import UserAgent
    ua = UserAgent()
    print(ua.random)
    headers = {
        "User-Agent": ua.random
    }

    resp = requests.get(url, proxies=proxies, headers=headers)
    return resp

#1. 随机去ip可能会重复
#2. 用的人太多
#1. 为什么代理可行，在什么情况下ip代理可行()

if __name__ == "__main__":
    for i in range(1, 30):
        job_list_url = "https://www.lagou.com/zhaopin/Python/{}/?filterOption={}".format(i, i)
        job_list_res = get_html(job_list_url)
        job_list_html = job_list_res.content.decode("utf8")
        sel = Selector(text=job_list_html)
        all_lis = sel.xpath(
            "//div[@id=‘s_position_list‘]//ul[@class=‘item_con_list‘]/li//div[@class=‘position‘]//a[1]/@href").extract()
        for url in all_lis:
            success = False
            while 1:
                try:
                    job_res = get_html(url)
                    job_html = job_res.content.decode("utf8")
                    job_sel = Selector(text=job_html)
                    print(job_html)
                    print(job_sel.xpath("//div[@class=‘job-name‘]//span[1]/text()").extract()[0])
                except Exception as e:
                    print("下载失败")
                    pass

原文：https://www.cnblogs.com/topass123/p/13342040.html