Recursive crawlingΒΆ

SOSSE can crawl recursively all pages it finds, or the recursion level can be limited when crawling large websites or public sites.

No limit recursionΒΆ

Recursing with no limit is achieved by using a policy with a Recursion set to Crawl all pages (the default).

For example, a full domain can extracted with 2 policies:

  • A policy for the domain with a URL regex that matches the domain, and Recursion set to Crawl all pages

  • A default policy (with the URL regex set to .*) with a Recursion set to Never crawl

Limited recursionΒΆ

Crawling pages up to a certain level can be simply achieved by setting the Recursion to Depending on depth and setting the Recursion depth when queueing the initial URL.

../_images/crawl_on_depth_add.png

Partial limited recursionΒΆ

A mixed approach is also possible, by setting a Recursion to Depending on depth in one policy, and setting it to Crawl all pages in an other and a positive Recursion depth.

For example, one could crawl all Wikipedia, and crawl external links up to 2 levels with the following policies:

  • A policy for Wikipedia, with Recursion depth of 2:

../_images/policy_all.png
  • A default policy with a Depending on depth condition:

../_images/policy_on_depth.png