Configuration file reference

SOSSE can be configured through the configuration file /etc/sosse/sosse.conf. Configuration variables are grouped in 3 sections, depending on which component they affect. Modyifing any of these option requires restarting the crawlers or the web interface.

Note

Configuration options can also be set using environment variables by prefixing with SOSSE_. For example, the proxy option of the crawler can be set by settings the SOSSE_PROXY environment variable. Envionment variable options have highher precedence than options from the configuration file.

[common] section

This section describes options common to the web interface and the crawlers.

secret_key

Default: CHANGE ME

Run sosse-admin generate_secret to create a new one.

See https://docs.djangoproject.com/en/3.2/ref/settings/#secret-key

Warning

Keep the secret key used in production secret!


debug

Default: False

Debug mode.

Warning

Don’t run with debug turned on in production!


db_name

Default: sosse

PostgreSQL database name.


db_user

Default: sosse

PostgreSQL username.


db_pass

Default: CHANGE ME

PostgreSQL password.


db_host

Default: 127.0.0.1

PostgreSQL hostname or IP address.


db_port

Default: 5432

PostgreSQL port.


[webserver] section

This section describes options dedicated to the web interface.


search_shortcut_char

Default: !

Special character used as search shortcut.


default_search_redirect

Default: <empty>

Default search engine to use. Leave empty to use SOSSE by default, use the search engine “Short name” otherwise

Warning

This field is case sensitive.


online_search_redirect

Default: <empty>

Search engine to use when the connectivity check succeeds (see online_check_url). Leave empty to use the default_search_redirect by default, use the search engine “Short name” otherwise

Warning

This field is case sensitive.


online_check_url

Default: https://google.com/

URL used to define online or offline mode.


online_check_timeout

Default: 1.0

Timeout in seconds used to define online or offline mode.


online_check_cache

Default: 10

Online check is done once every online_check_cache request. The special value once can be used to run the check only once, when the first request is done. 0 can be used to disable caching.

Note

The cache is effective on a uwSGI worker basis, and as long as the uWSGI worker is alive. So even with a value of once a new request will be done everytime a new worker is spawned.


sosse_shortcut

Default: <empty>

In case the default_shortcut is not empty this defines which shortcut searches SOSSE.


allowed_host

Default: *

FDQN of the webserver, “*” for any.

See https://docs.djangoproject.com/en/3.2/ref/settings/#allowed-hosts


static_url

Default: /static/


static_root

Default: /var/lib/sosse/static/


screenshots_url

Default: /screenshots/


screenshots_dir

Default: /var/lib/sosse/screenshots/


html_snapshot_url

Default: /snap/

Url path to HTML snapshot

Danger

This value is hardcoded inside stored HTML snapshot. If you modify it, any HTML page previously stored as a snapshot will need to be crawled again in order to update internal links.


html_snapshot_dir

Default: /var/lib/sosse/html/


use_i18n

Default: True

See https://docs.djangoproject.com/en/3.2/ref/settings/#use-i18n


use_l10n

Default: True

See https://docs.djangoproject.com/en/3.2/ref/settings/#use-l10n


language_code

Default: en-us

See https://docs.djangoproject.com/en/3.2/ref/settings/#language-code


datetime_format

Default: N j, Y, P

See https://docs.djangoproject.com/en/3.2/ref/settings/#datetime-format


use_tz

Default: True

See https://docs.djangoproject.com/en/3.2/ref/settings/#use-tz


timezone

Default: UTC

See https://docs.djangoproject.com/en/3.2/ref/settings/#time-zone


default_page_size

Default: 20

Default result count returned.


max_page_size

Default: 200

Maximum user-defined result count.


data_upload_max_memory_size

Default: 2621440

See https://docs.djangoproject.com/en/3.2/ref/settings/#data-upload-max-memory-size


data_upload_max_number_fields

Default: 1000

See https://docs.djangoproject.com/en/3.2/ref/settings/#data-upload-max-number-fields


atom_access_token

Default: <empty>

When anonymous search are disabled a token can be used to access Atom feeds without authenticating. The token can be passed to HTTP requests as an url parameter, for example ?token=<Atom access token>. Setting an empty string disables token access.


atom_feed_size

Default: 200

Number of result returned by Atom feeds.


atom_cached_bin_passthrough

Default: True

Cached links from the Atom feed to binary files returns binary files instead of the related metadata cached page (http://x.x.x.x/html/<url>).


exclude_not_indexed

Default: True

Exclude page queued for indexing but not yet indexed from search results.


exclude_redirect

Default: True

Exclude page redirection from search results.


cache_follows_redirect

Default: True

Accessing a cached page of a redirection url automatically follows the redirection.


admin_page_size

Default: 100

Number of items by list in the administration pages.


search_strip

Default: <empty>

Removes this string from search queries.


crawl_status_autorefresh

Default: 5

Delay between crawl status page autorefresh (in seconds).


browsable_home

Default: True

Display entry point documents on the homepage.




[crawler] section

This section describes options dedicated to the web interface.

crawler_count

Default: <empty>

Number of crawlers running concurrently (defaults to the number of CPU available).


proxy

Default: <empty>

Url of the HTTP proxy server to use. Example: http://192.168.0.1:8080/


user_agent

Default: SOSSE

User agent sent by crawlers.


fake_user_agent_browser

Default: <empty>

Use a preset UA using the fake-useragent library. The UA will be selected among the provided browser, specified as a comma-separated list of values among: chrome, edge, firefox, safari.

Note

To enable fake-useragent, the user_agent option must be set to empty.


fake_user_agent_os

Default: <empty>

Use a preset UA using the fake-useragent library. The UA will be selected among the provided operating system, specified as a comma-separated list of values among: windows, linux, macos.

Note

To enable fake-useragent, the user_agent option must be set to empty.


fake_user_agent_platform

Default: <empty>

Use a preset UA using the fake-useragent library. The UA will be selected among the provided platform, specified as a comma-separated list of values among: pc, mobile, tablet.

Note

To enable fake-useragent, the user_agent option must be set to empty.


requests_timeout

Default: 10

Timeout in secounds when retrieving pages with Requests (no timeout if 0).


fail_over_lang

Default: english

Language used to parse web pages when the original language could not be detected.


hashing_algo

Default: md5

Hashing algorithms used to define if the content of a page has changed.


screenshots_size

Default: 1920x1080

Resolution of the browser used to take screenshots.


default_browser

Default: chromium

Defines which browser to use by default when browsing mode is auto-detected (can be either “firefox” or “chromium”).


chromium_options

Default: –enable-precise-memory-info –disable-default-apps –headless

Options passed to Chromium’s command line. You may need to add --no-sandbox to run the crawler as root, or --disable-dev-shm-usage to run in a virtualized container.


firefox_options

Default: –headless

Options passed to Firefox’s command line.


js_stable_time

Default: 0.1

When loading a page in a browser, wait js_stable_time seconds before checking the DOM stays unchanged.


js_stable_retry

Default: 100

Check at most js_stable_retry times for the page to stay unchanged before processing.


tmp_dl_dir

Default: /var/lib/sosse/downloads

Base directory where files are temporarily downloaded.


browser_config_dir

Default: /var/lib/sosse/browser_config

Base directory where browser configuration files and profiles are stored.


dl_check_time

Default: 0.1

Download detection will every dl_check_time seconds for a started download.


dl_check_retry

Default: 2

Download detection will retry dl_check_retry times for a started download.


max_file_size

Default: 500

Maximum file size to index (in kB).


max_html_asset_size

Default: 5000

Maximum file size of html assets (css, images, etc.) to download (in kB).


max_redirects

Default: 5

Maximum numbers of redirect before aborting. (this is accurate when using Requests only, some redirects may be missed on Chromium)


browser_idle_exit_time

Default: 5

Close the browser when the crawler is idle for browser_idle_exit_time seconds.


browser_crash_sleep

Default: 1.0

Sleep browser_crash_sleep seconds before retrying after the browser crashed.


browser_crash_retry

Default: 1

Retry browser_crash_retry times to index the page on browser crashes.


css_parser

Default: internal

Choose which CSS parser implementation to use. May be one of internal or cssutils: You may want to change this option when HTML snapshots have broken styles.