Commit 23c872f0c8416b115d073c908915720cd6d5d7ff
Ticket #1
Fix: Do not post content from cached pages.
Scrapy maintains a http cache, it knows what pages it has crawled
previously. `Response` object has a `flags` attribute which is a list
of flags like 'cached', 'redirected', etc.
Comments:
| | | | 17 | callback='parse_start'),) | 17 | callback='parse_start'),) |
---|
18 | | 18 | |
---|
19 | def parse_start(self, response): | 19 | def parse_start(self, response): |
---|
20 | xpath = Selector() | | xpath = Selector() |
---|
21 | loader = ItemLoader(item=PostscraperItem(), response=response) | | loader = ItemLoader(item=PostscraperItem(), response=response) |
---|
| | 20 | if 'cached' not in response.flags: | | | 21 | xpath = Selector() |
---|
| | 22 | loader = ItemLoader(item=PostscraperItem(), response=response) |
---|
22 | | 23 | |
---|
23 | loader.add_xpath('content', '//div[@class="report"]/p/text()') | 24 | loader.add_xpath('content', '//div[@class="report"]/p/text()') |
---|
24 | loader.add_xpath('audio', | 25 | loader.add_xpath('audio', |
---|