Recursive crawlingΒΆ
SOSSE can crawl recursively all pages it finds, or the recursion level can be limited when crawling large websites or public sites.
No limit recursionΒΆ
Recursing with no limit is achieved by using a policy with a Recursion set to Crawl all pages
(the default).
For example, a full domain can extracted with 2 policies:
A policy for the domain with a
URL regex
that matches the domain, andRecursion
set toCrawl all pages
A default policy (with the
URL regex
set to.*
) with aRecursion
set toNever crawl
Limited recursionΒΆ
Crawling pages up to a certain level can be simply achieved by setting the Recursion to Depending on depth
and setting the Recursion depth
when queueing the initial URL.
Partial limited recursionΒΆ
A mixed approach is also possible, by setting a Recursion to Depending on depth
in one policy, and setting it to Crawl all pages
in an other and a positive Recursion depth
.
For example, one could crawl all Wikipedia, and crawl external links up to 2 levels with the following policies:
A policy for Wikipedia, with
Recursion depth
of 2:
A default policy with a
Depending on depth
condition: