Spiders#
Spiders are the business end of okami. Their main function is to provide URL parsing rules and web page content parsing for a particular web page.
Spiders are fed Task and Response object after HttpMiddleware and Downloader are finished processing Request object for a page. Response object contains complete HTTP response.
Spiders, if needed, can also handle authentication and session for HTTP negotiation with website.
Full scraping cycle is defined in process section on architecture page.
Notes#
- Spiders should have unique class property
name
. - Keep your spiders in a python package. You can have multiple packages. Define them in settings module.
SPIDERS=["path.to.package.spiders"]
- Okami finds and loads spiders from this packages using property
name
.
Development#
Make sure everything is properly configured. During development you can test run your spider using command below:
okami process spider-name url
This will run a spider with name spider-name
against a page at url
. Output should be a JSON representation of a list of Item objects.
Details#
Below are required and optional implementation details for every spider.
Required#
Spider.name
- required and it should be unique-
Spider.urls
dictionary defines rules used by okami to parse a list of valid URLs from page content for further website processing.Spider.urls.start
- URLs used as starting URLs for processing websiteSpider.urls.allow
- Allowed URLs for further website processingSpider.urls.avoid
- URLs that are avoided during website processing
-
Spider.items <okami.Spider.items>
method receives Task object, processes page content and returns a list of Item objects.
Optional#
Spider.tasks <okami.Spider.tasks>
method is optionally used in caseSpider.urls
does not get all URLs. Method receives Task object, processes page content and returns a list of Task objects with urls and optionally data for further processing.Spider.session <okami.Spider.session>
method is optionally used to handle authentication etc.Spider.request <okami.Spider.request>
method is optionally used to define a dictionary of extra arguments passed into Request object used by Downloader to create an HTTP request and download a pageSpider.delta <okami.Spider.hash>
method is optionally used to provide a custom delta key in case delta scraping mode is enabled
Example#
Below is an example Spider implementation.
class Example(Spider): """ An example Spider implementation. """ name = "example.com" urls = dict( start=["http://localhost:8000/"], allow=[ "//nav//a/@href", "//div[@id='product-list']//div//a/@href", ], avoid=[ "//a[contains(@href, '/about/')]/@href", "//a[contains(@href, '/sale/')]/@href", ] ) async def items(self, task, response): items = [] document = lxml.html.document_fromstring(html=response.text) products = document.xpath("//div[@class='product']") for product in products: iid = int(product.xpath(".//@product-id")[0]) name = product.xpath(".//h2/text()")[0] desc = product.xpath(".//p/text()")[0] category = product.xpath(".//span/text()")[0] price = float(product.xpath(".//em/text()")[0]) images = product.xpath(".//div//img/@src") item = Product( iid=iid, url=str(response.url), name=name, category=category, desc=desc, price=price, images=images, ) items.append(item) return items
And an example Item implementation.
class Product(Item): """ An example Item object implementation. You will of course have your own. """ def __init__(self, iid, url, name, category, desc, price, images=None): self.iid = iid self.url = url self.name = name self.category = category self.desc = desc self.price = price self.images = images or [] def to_dict(self): return dict( iid=self.iid, url=self.url, name=self.name, category=self.category, desc=self.desc, price=self.price, images=self.images, )