Types of Archives¶

Sosse can create different types of snapshots of the pages it crawls. These snapshots can be browsed offline, as described below.

By default, archived pages can be accessed via the archive link in the search results. See search results.

🔖 HTML Archive¶

By default, the crawlers store HTML pages and the files they depend on (such as images and stylesheets). This behavior can be controlled in the ⚡ Collection.

It is also possible to use a browser to take the snapshot, in which case the snapshot is taken after the page is rendered (following JavaScript execution).

All HTML archived pages can be cleared with the clear_html_archive management command.

📷 Screenshots Archive¶

The crawlers can take screenshots of the pages they browse. Pages saved this way also store link information and can be browsed offline. Screenshots can be enabled in the ⚡ Collection.

✏ Text Archive¶

The text content of all crawled pages is stored. This text archive retains link information and can be used to navigate to other archived pages. The text archive is created for all indexed documents.

🏠 Browsable Home¶

The entry points of crawled websites (the URLs that were manually queued) are displayed on the homepage to easily navigate archived websites. The websites displayed can be customized using the show on homepage option of documents.

The homepage can be configured to show only the search bar by disabling the browsable home option.

Online Detection¶

When the online_search_redirect option is set, making a search will redirect the user to the online_search_redirect defined search engine when Sosse is online, or initiate a Sosse search if offline. Searching locally or online can be forced from the User profile.