博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
scrapy Data flow
阅读量:4886 次
发布时间:2019-06-11

本文共 1239 字,大约阅读时间需要 4 分钟。

The data flow in Scrapy is controlled by the execution engine, and goes like this:

1. The Engine gets the initial Requests to crawl from the Spider.
2. The Engine schedules the Requests in the Scheduler and asks for the next Requests to crawl.
3. The Scheduler returns the next Requests to the Engine.
4. The Engine sends the Requests to the Downloader, passing through the Downloader Middlewares (see
process_request()).
5. Once the page finishes downloading the Downloader generates a Response (with that page) and sends it to the
Engine, passing through the Downloader Middlewares (see process_response()).
6. The Engine receives the Response from the Downloader and sends it to the Spider for processing, passing
through the Spider Middleware (see process_spider_input()).
7. The Spider processes the Response and returns scraped items and new Requests (to follow) to the Engine,
passing through the Spider Middleware (see process_spider_output()).
8. The Engine sends processed items to Item Pipelines, then send processed Requests to the Scheduler and asks
for possible next Requests to crawl.
9. The process repeats (from step 1) until there are no more requests from the Scheduler.

转载于:https://www.cnblogs.com/wuhua1/p/8409913.html

你可能感兴趣的文章
Java多线程(九):生产者消费者模型
查看>>
Leetcode: Plus One Linked List
查看>>
hadoop23---自定义rpc架构(duboo的原理)
查看>>
android122 zhihuibeijing 主页面搭建
查看>>
Struts2 中的数据传输
查看>>
Linux下重要日志文件及查看方式
查看>>
mysql技巧之select count的比较
查看>>
JAVA WEB 过滤器
查看>>
Aliyun发送短信接口调用方法
查看>>
Spring Boot 多环境如何配置
查看>>
用户场景
查看>>
Win7 64位安装VS2013后,远程数据库无法访问,内存访问非法
查看>>
ROS注册级别LEVEL0-6,原来使用GRE通道是不要钱滴
查看>>
【模板】矩阵快速幂
查看>>
解决JSONObject.fromObject数字为null时被转换为0
查看>>
python 的类变量和对象变量
查看>>
java中的三大特性
查看>>
RemoteExt 远程验证
查看>>
HDU 2204(容斥原理)
查看>>
删除文件到回收站rm命令
查看>>