Crawl Policies#

Policy matching#

Crawl policies define which pages are indexed and how they are indexed. The policy list can be reached by clicking Crawl policies from the Administration interface.

../_images/crawl_policy_list.png

When the crawler indexes a page or evaluates a link to queue it, it will find the best matching policy to know how to handle the link. The policy with the longest URL regex matching is selected. On last resort, the default policy .* is selected.

You can see which policy would match by typing an URL in the search bar of the Crawl policies pages, or in the Crawl a new URL page (see Adding URLs to the crawl queue).

Indexing decision#

../_images/crawl_policy_decision.png

URL regexp#

The regexp matched against URL to crawl. The default .* policy’s regexp cannot be modified.

Documents#

Shows the URLs in the database that match the regexp.

Condition, Crawl depth#

Condition and Crawl depth parameters define which links to recurse into.

Condition can be one of:

  • Crawl all pages: URLs matching the policy will be crawled

  • Depending on depth: URLs matching the policy are crawled depending on the recursion level (see Recursive crawling)

  • Never crawl: URLS matching the policy are not crawled unless they are queued manually (in this case, no recursion occurs)

Crawl depth is only relevant when the Condition is Crawl all pages and defines the recursion depth for links outside the policy. See Recursive crawling for more explanations.

Mimetype regex#

The mimetype of pages must match the regexp to be crawled.

Index URL parameters#

When enabled, URLs are stored with URLs parameters. Otherwise, URLs parameters are removed before indexing. This can be useful if some parameters are random, change sorting or filtering, …

Browser#

../_images/crawl_policy_browser.png

Default browse mode#

Can be one of:

  • Detect: the first time a domain is accessed, it is crawled with both Chromium and Python Requests. If the text content varies, it is assumed that the website is dynamic and Chromium will be used for subsequent crawling of pages in this domain. If the text content is the same, Python Request will be used since it is faster.

  • Chromium: Chromium is used.

  • Python Requests: Python Requests is used.

Take screenshots#

Enables taking screenshots of pages for offline use.

Screenshot format#

Format of the image JPG or PNG.

Script#

Javascript code to be executed in the context of the web pages when they have finished loading. This can be used to handle authentication, validate forms, remove headers, …

For example, the following script could be used to click on a GDPR compliance I agree button:

const BUTTON_TEXT = "I agree";
const XPATH_PATTERN = `//*[text()="${BUTTON_TEXT}"]`;
const button = document.evaluate(XPATH_PATTERN, document, null, XPathResult.FIRST_ORDERED_NODE_TYPE, null);

if (button && button.singleNodeValue) {
    button.singleNodeValue.click();
}

In case the script triggers an error, further processing of the page is aborted and the error message is stored in the document error field. It can be useful to use a tool such as Tamperonkey to debug these kind of script.

Updates#

../_images/crawl_policy_updates.png

Crawl frequency, Recrawl dt#

How often pges should be reindexed:

  • Once: pages are not recrawled.

  • Constant: pages are recrawled every Recrawl dt min.

  • Adaptive: pages recrawled more often when they change. The interval between recrawls starts at Recrawl dt min. Then, when the page is recrawled the interval is multiplied by 2 if the content is unchanged, divided by 2 otherwise. The interval stays enclosed between Recrawl dt min and Recrawl dt max.

Hash mode#

Define how changes between recrawl are detected:

  • Hash raw content: raw text content is compared.

  • Normalize numbers before: numbers are replaced by 0s before comparing, it can be useful to ignore counters, clock changes, …

Authentication#

See Authentication handling for general guidelines on authentication.

../_images/crawl_policy_auth.png

Login URL#

If crawling a page matching the policy gets redirected to the Login URL, the crawler will attempt to authenticate using the parameters definedbelow.

Form selector#

CSS selector pointing to the authentication <form> element.

Authentication fields#

This defines the <input> fields to fill in the form. The fields are matched by their name attribute and filled with the value. (hidden fields, like CSRF preventing field, are automatically populated by the crawler)