How can I build a crawling robot?
When is a crawler the right choice ?
If you can answer yes to all of the following questions, the crawler might be a better choice than a scraper:
- Is the navigation in the page available through <a> tags?
- Is all of the information you want on a single type of page, like a product details page or product listing?
Building a crawler
Crawlers are defined using page processors rather than the scraper's steps. A page processor is a set of rules directing the crawler's behavior. Without page processors, a crawler simply visits all pages on the Web site while extracting absolutely no information. You can define as many page processors as you want, with each page processor consisting of two configuration sections: conditions and actions.
Conditions determine whether the page processor is executed for a given Web page. All page processors are checked against every single page the crawler visits, and if they match, the crawler’s configured actions are applied. You can add any number of conditions to a Web page to make sure it only matches the exact pages you want it to.All defined conditions for a page processor must return values of true before the configured actions are applied to the given Web page.
Actions define the crawler’s activities upon visiting a given Web page, such as extracting certain information or adding discovered URL's to the crawl list.
How does the crawler determine which pages to go to?
By default, the crawler visits every page referenced by an <a> tag anywhere on the page and which is within the same domain as the starting URL. We refer to the list of pages that the crawler will visit as the crawl list.
You can control how the crawler works through a Web site by creating a page processor with the condition Match every page and the action Don't follow any links on page. This alone will direct the crawler to visit only the first page at the starting URL.
You can then add another page processor matching the first page and choose the action Add url attribute to crawl list, including CSS selectors to match all <a> tags associated with links you want the crawler to visit. This allows you to determine exactly which links on which page types you want the crawler to process, dramatically speeding up the execution and avoiding potential garbage data extractions. Note that manual additions to the crawl list like this ignore the default within the same subdomain boundary and the Don't follow any links on page action.
See the short animation below for an example crawler configuration.
Whether your robot crawls or scrapes, it must have pre-defined output fields in which to store extracted data.
In the example above, we've defined two output fields:urlandtitle. These names will be applied to the fields/columns (columns can be compared to spreadsheet columns) storing the data extracted from Web sites. Output fields must be configured before definingpage processors.