Configuration file reference#

SOSSE can be configured through the configuration file /etc/sosse/sosse.conf. Configuration variables are grouped in 3 sections, depending on which component they affect. Modyifing any of these option requires restarting the crawlers or the wewb interface.

[common] section#

This section describes options common to the web interface and the crawlers.

secret_key

Default: CHANGE ME

Run sosse-admin generate_secret to create a new one.

See https://docs.djangoproject.com/en/3.2/ref/settings/#secret-key

Warning

Keep the secret key used in production secret!


debug

Default: False

Debug mode.

Warning

Don’t run with debug turned on in production!


db_name

Default: sosse

PostgreSQL database name.


db_user

Default: sosse

PostgreSQL username.


db_pass

Default: CHANGE ME

PostgreSQL password.


db_host

Default: 127.0.0.1

PostgreSQL hostname or IP address.


db_port

Default: 5432

PostgreSQL port.


[webserver] section#

This section describes options dedicated to the web interface.


atom_access_token

Default: <empty>

When anonymous search are disabled a token can be used to access Atom feeds without authenticating. The token can be passed to HTTP requests as an url parameter, for example ?token=<Atom access token>. Setting an empty string disables token access.


search_shortcut_char

Default: !

Special character used as search shortcut.


default_search_redirect

Default: <empty>

Default search engine to use. Leave empty to use SOSSE by default, use the search engine “Short name” otherwise

Warning

This field is case sensitive.


online_search_redirect

Default: <empty>

Search engine to use when the connectivity check succeeds (see online_check_url). Leave empty to use the default_search_redirect by default, use the search engine “Short name” otherwise

Warning

This field is case sensitive.


online_check_url

Default: https://google.com/

URL used to define online or offline mode.


online_check_timeout

Default: 1.0

Timeout in seconds used to define online or offline mode.


online_check_cache

Default: 10

Online check is done once every online_check_cache request. The special value once can be used to run the check only once, when the first request is done. 0 can be used to disable caching.

Note

The cache is effective on a uwSGI worker basis, and as long as the uWSGI worker is alive. So even with a value of once a new request will be done everytime a new worker is spawned.


sosse_shortcut

Default: <empty>

In case the default_shortcut is not empty this defines which shortcut searches SOSSE.


allowed_host

Default: *

FDQN of the webserver, “*” for any.

See https://docs.djangoproject.com/en/3.2/ref/settings/#allowed-hosts


static_url

Default: /static/


static_root

Default: /var/lib/sosse/static/


screenshots_url

Default: /screenshots/


screenshots_dir

Default: /var/lib/sosse/screenshots/


html_snapshot_url

Default: /snap/

Url path to HTML snapshot

Danger

This value is hardcoded inside stored HTML snapshot. If you modify it, any HTML page previously stored as a snapshot will need to be crawled again in order to update internal links.


html_snapshot_dir

Default: /var/lib/sosse/html/


use_i18n

Default: True

See https://docs.djangoproject.com/en/3.2/ref/settings/#use-i18n


use_l10n

Default: True

See https://docs.djangoproject.com/en/3.2/ref/settings/#use-l10n


language_code

Default: en-us

See https://docs.djangoproject.com/en/3.2/ref/settings/#language-code


datetime_format

Default: N j, Y, P

See https://docs.djangoproject.com/en/3.2/ref/settings/#datetime-format


use_tz

Default: True

See https://docs.djangoproject.com/en/3.2/ref/settings/#use-tz


timezone

Default: UTC

See https://docs.djangoproject.com/en/3.2/ref/settings/#time-zone


default_page_size

Default: 20

Default result count returned.


max_page_size

Default: 200

Maximum user-defined result count.


data_upload_max_memory_size

Default: 2621440

See https://docs.djangoproject.com/en/3.2/ref/settings/#data-upload-max-memory-size


data_upload_max_number_fields

Default: 1000

See https://docs.djangoproject.com/en/3.2/ref/settings/#data-upload-max-number-fields


atom_feed_size

Default: 200

Number of result returned by Atom feeds.


exclude_not_indexed

Default: True

Exclude page queued for indexing but not yet indexed from search results.


exclude_redirect

Default: True

Exclude page redirection from search results.


cache_follows_redirect

Default: True

Accessing a cached page of a redirection url automatically follows the redirection.


admin_page_size

Default: 100

Number of items by list in the administration pages.


search_strip

Default: <empty>

Removes this string from search queries.


crawl_status_autorefresh

Default: 5

Delay between crawl status page autorefresh (in seconds).


browsable_home

Default: True

Display entry point documents on the homepage.




[crawler] section#

This section describes options dedicated to the web interface.

crawler_count

Default: <empty>

Number of crawlers running concurrently (defaults to the number of CPU available).


proxy

Default: <empty>

Url of the HTTP proxy server to use. Example: http://192.168.0.1:8080/


user_agent

Default: SOSSE

User agent sent by crawlers.


requests_timeout

Default: 10

Timeout in secounds when retrieving pages with Requests (no timeout if 0).


fail_over_lang

Default: english

Language used to parse web pages when the original language could not be detected.


hashing_algo

Default: md5

Hashing algorithms used to define if the content of a page has changed.


screenshots_size

Default: 1920x1080

Resolution of the browser used to take screenshots.


default_browser

Default: chromium

Defines which browser to use by default when browsing mode is auto-detected (can be either “firefox” or “chromium”).


chromium_options

Default: –enable-precise-memory-info –disable-default-apps –incognito –headless

Options passed to Chromium’s command line. You may need to add --no-sandbox to run the crawler as root, or --disable-dev-shm-usage to run in a virtualized container.


firefox_options

Default: –headless

Options passed to Firefox’s command line.


js_stable_time

Default: 0.1

When loading a page in a browser, wait js_stable_time seconds before checking the DOM stays unchanged.


js_stable_retry

Default: 100

Check at most js_stable_retry times for the page to stay unchanged before processing.


tmp_dl_dir

Default: /var/lib/sosse/downloads

Base directory where files are temporarily downloaded.


dl_check_time

Default: 0.1

Download detection will every dl_check_time seconds for a started download.


dl_check_retry

Default: 2

Download detection will retry dl_check_retry times for a started download.


max_file_size

Default: 500

Maximum file size to index (in kB).


max_html_asset_size

Default: 5000

Maximum file size of html assets (css, images, etc.) to download (in kB).


max_redirects

Default: 5

Maximum numbers of redirect before aborting. (this is accurate when using Requests only, some redirects may be missed on Chromium)


browser_idle_exit_time

Default: 5

Close the browser when the crawler is idle for browser_idle_exit_time seconds.


browser_crash_sleep

Default: 1.0

Sleep browser_crash_sleep seconsds before retrying after the browser crashed.


browser_crash_retry

Default: 1

Retry browser_crash_retry times to index the page on browser crashes.


css_parser

Default: internal

Choose which CSS parser implementation to use. May be one of internal or cssutils: You may want to change this option when HTML snapshots have broken styles.