Configuration file reference#

SOSSE can be configured through the configuration file /etc/sosse/sosse.conf. Configuration variables are grouped in 3 sections, depending on which component they affect. Modyifing any of these option requires restarting the crawlers or the wewb interface.

[common] section#

This section describes options common to the web interface and the crawlers.

secret_key

Default: CHANGE ME

SECURITY WARNING: keep the secret key used in production secret! Run sosse-admin generate_secret to create a new one.

See https://docs.djangoproject.com/en/3.2/ref/settings/#secret-key


debug

Default: False

SECURITY WARNING: don’t run with debug turned on in production!


db_name

Default: sosse

PostgreSQL database name.


db_user

Default: sosse

PostgreSQL username.


db_pass

Default: CHANGE ME

PostgreSQL password.


db_host

Default: 127.0.0.1

PostgreSQL hostname or IP address.


db_port

Default: 5432

PostgreSQL port.


[webserver] section#

This section describes options dedicated to the web interface.

anonymous_search

Default: False

Anonymous users (users not logged in) can do searches.


atom_access_token

Default: <empty>

When anonymous search are disabled a token can be used to access Atom feeds without authenticating. The token can be passed to HTTP requests as an url parameter, for example ?token=<Atom access token>. Setting an empty string disables token access.


search_shortcut_char

Default: !

Special character used as search shortcut.


default_search_redirect

Default: <empty>

Default search engine to use. Leave empty to use SOSSE by default, use the search engine “Short name” otherwise (case sensitive).


sosse_shortcut

Default: <empty>

In case the default_shortcut is not empty this defines which shortcut searches SOSSE.


allowed_host

Default: *

FDQN of the webserver, “*” for any.

See https://docs.djangoproject.com/en/3.2/ref/settings/#allowed-hosts


static_url

Default: /static/


static_root

Default: /var/lib/sosse/static/


screenshots_url

Default: /screenshots/


screenshots_dir

Default: /var/lib/sosse/screenshots/


use_i18n

Default: True

See https://docs.djangoproject.com/en/3.2/ref/settings/#use-i18n


use_l10n

Default: True

See https://docs.djangoproject.com/en/3.2/ref/settings/#use-l10n


language_code

Default: en-us

See https://docs.djangoproject.com/en/3.2/ref/settings/#language-code


datetime_format

Default: N j, Y, P

See https://docs.djangoproject.com/en/3.2/ref/settings/#datetime-format


use_tz

Default: True

See https://docs.djangoproject.com/en/3.2/ref/settings/#use-tz


timezone

Default: UTC

See https://docs.djangoproject.com/en/3.2/ref/settings/#time-zone


default_page_size

Default: 20

Default result count returned.


max_page_size

Default: 200

Maximum user-defined result count.


data_upload_max_memory_size

Default: 2621440

See https://docs.djangoproject.com/en/3.2/ref/settings/#data-upload-max-memory-size


data_upload_max_number_fields

Default: 1000

See https://docs.djangoproject.com/en/3.2/ref/settings/#data-upload-max-number-fields


atom_feed_size

Default: 200

Size of Atom feeds.


exclude_not_indexed

Default: True

Exclude page queued for indexing, but not yet indexed from search results.


exclude_redirect

Default: True

Exclude page redirection from search results.


cache_follows_redirect

Default: True

Accessing a cached page of an url that redirected automatically follows the redirection.


admin_page_size

Default: 100

Number of items by list in the administration pages.


search_strip

Default: <empty>

Removes this string from search queries.


crawl_status_autorefresh

Default: 5

Delay between crawl status page autorefresh (in seconds).


browsable_home

Default: False

Display entry point documents on the homepage.


[crawler] section#

This section describes options dedicated to the web interface.

crawler_count

Default: <empty>

Number of crawlers running concurrently (default to the number of CPU available).


proxy

Default: <empty>

Url of the HTTP proxy server to use.


user_agent

Default: SOSSE

User agent used by crawlers.


requests_timeout

Default: 10

Timeout when retrieving pages with Requests (no timeout if 0).


fail_over_lang

Default: english

Language used to parse web pages when the original language could not be detected.


hashing_algo

Default: md5

Hashing algorithms used to define if the content of a page has changed.


screenshots_size

Default: 1920x1080

Resolution of the browser used to take screenshots.


browser_options

Default: –enable-precise-memory-info –disable-default-apps –incognito –headless

Options passed to Chromium’s command line. You may need to add --no-sandbox to run the crawler as root, or --disable-dev-shm-usage to run in a virtualized container.


js_stable_time

Default: 0.1

When loading a page in a browser, wait js_stable_time seconds before checking the DOM stays unchanged.


js_stable_retry

Default: 100

Check at most js_stable_retry times for the page to stay unchanged before processing.


tmp_dl_dir

Default: /var/lib/sosse/downloads

Base directory where files are temporarily downloaded.


dl_check_time

Default: 0.1

Download detection will every dl_check_time seconds for a started download.


dl_check_retry

Default: 2

Download detection will retry dl_check_retry times for a started download.


max_file_size

Default: 500

Maximum file size to index (in kb).


max_redirects

Default: 5

Maximum numbers of redirect before aborting. (this is accurate when using Requests only, some redirects may be missed on Chromium)


browser_idle_exit_time

Default: 5

Close the browser when the crawler is idle for browser_idle_exit_time seconds.


browser_crash_sleep

Default: 1.0

Sleep browser_crash_sleep seconsds before retrying after the browser crashed.


browser_crash_retry

Default: 1

Retry browser_crash_retry time to index the page on browser crashes.