• 欢迎访问搞代码网站,推荐使用最新版火狐浏览器和Chrome浏览器访问本网站!
  • 如果您觉得本站非常有看点,那么赶紧使用Ctrl+D 收藏搞代码吧

Python自定义scrapy中间模块避免重复采集的方法

python 搞代码 4年前 (2022-01-09) 14次浏览 已收录 0个评论

本文实例讲述了Python自定义scrapy中间模块避免重复采集的方法。分享给大家供大家参考。具体如下:

from scrapy import logfrom scrapy.http import Requestfrom scrapy.item import BaseItemfrom scrapy.utils.request import request_fingerprintfrom myproject.items import MyItemclass IgnoreVisitedItems(object):  """Middleware to ignore re-visiting item pages if they  were already visited before.   The requests to be filtered by have a meta['filter_visited']  flag enabled and optionally define an id to use   for identifying them, which defaults the request fingerprint,  although you'd want to use the item id,  if you already have it beforehand to make it more robust.  """  FILTER_VISITED = 'filter_visited'  VISITED_ID = 'visited_id'  CONTEXT_KEY = 'visited_ids'  def process_spider_output(self, response, result, spider):    context = getattr(spider, 'context', {})    visited_ids = context.setdefault(self.CONTEXT_KEY, {})    ret = []    for x in result:      visited = False      if isinstance(x, Request):        if self.FILTER_VISITED in x.meta:          visit_id = self._visited_id(x)          if visit_id in visited_ids:            log.msg("Ignoring already visited: %s" % x.url,                level=log.INFO, spider=spider)            visited = True      elif isinstance(x, BaseItem):        visit_id = self._visited_id(response.request)        if visit_id:          visited_ids[visit_id] = True          x['visit_id'] = visit_id          x['visit_status'] = 'new'      if visited:        ret.appe<p>本文来源gao!%daima.com搞$代*!码9网(</p>nd(MyItem(visit_id=visit_id, visit_status='old'))      else:        ret.append(x)    return ret  def _visited_id(self, request):    return request.meta.get(self.VISITED_ID) or request_fingerprint(request)

希望本文所述对大家的Python程序设计有所帮助。


搞代码网(gaodaima.com)提供的所有资源部分来自互联网,如果有侵犯您的版权或其他权益,请说明详细缘由并提供版权或权益证明然后发送到邮箱[email protected],我们会在看到邮件的第一时间内为您处理,或直接联系QQ:872152909。本网站采用BY-NC-SA协议进行授权
转载请注明原文链接:Python自定义scrapy中间模块避免重复采集的方法
喜欢 (0)
[搞代码]
分享 (0)
发表我的评论
取消评论

表情 贴图 加粗 删除线 居中 斜体 签到

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址