Commit 890f5ac40e64fd9b28a5fc3ede0509fee7abbe57

  • avatar
  • arvind
  • Sun Mar 30 19:30:13 IST 2014
Ticket #2:Disbale http_caching.  Index page was also being cached, which means
that whenever the spider ran, it got only a cached version of the page,
hence will not get new posts.
Setting caching policy to be compliant with RFC2616 does not help, the
pages being served from the web server do not have any cache-control
directives.

Fix: Using anydbm module to implement an equivalent of caching, maintain a db
of urls which have been crawled and posted, do not process those urls
again.

TODO: Find a more idiomatic way of doing this. This can move to sweets, maybe.
.gitignore
(2 / 0)
  
55include/
66local/
77lib/
8build/
89*.db
910*.pid
1011conf.py
12urlCache
  
1313
1414# Crawl responsibly by identifying yourself (and your website) on the user-agent
1515#USER_AGENT = 'postScraper (+http://www.yourdomain.com)'
16HTTPCACHE_ENABLED = True
16# HTTPCACHE_ENABLED = True
17#HTTPCACHE_POLICY = 'scrapy.contrib.httpcache.RFC2616Policy'
18# SPIDER_MIDDLEWARES = {
19# 'postScraper.middlewares.deltafetch.DeltaFetch': 100,
20# }
21
22# DELTAFETCH_ENABLED = True
23# DOTSCRAPY_ENABLED = True
  
33from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
44from scrapy.selector import Selector
55from scrapy.contrib.loader import ItemLoader
6
76from postScraper.items import PostscraperItem
87
98import facebook
109import conf
10import anydbm
1111
1212
1313class SwaraSpider(CrawlSpider):
1818 callback='parse_start'),)
1919
2020 def parse_start(self, response):
21 if 'cached' not in response.flags:
21 db = anydbm.open('urlCache', 'c')
22 if response.url not in db.keys():
2223 xpath = Selector()
2324 loader = ItemLoader(item=PostscraperItem(), response=response)
2425
3838 description=content[0]['content'].encode('utf8'),
3939 message="#CGNetSwara http://cgnetswara.org/" +
4040 content[1]['audio'])
41 print str(response.url)
42 print type(response.url)
43 db[response.url] = str(True)
44 db.close()
45 else:
46 print "Not posting content from " + response.url
47 db.close()