Configuration file reference#
SOSSE can be configured through the configuration file /etc/sosse/sosse.conf
. Configuration variables are grouped in 3 sections, depending
on which component they affect. Modyifing any of these option requires restarting the crawlers or the wewb interface.
[common] section#
This section describes options common to the web interface and the crawlers.
- secret_key
Default: CHANGE ME
Run
sosse-admin generate_secret
to create a new one.See https://docs.djangoproject.com/en/3.2/ref/settings/#secret-key
Warning
Keep the secret key used in production secret!
- debug
Default: False
Debug mode.
Warning
Don’t run with debug turned on in production!
- db_name
Default: sosse
PostgreSQL database name.
- db_user
Default: sosse
PostgreSQL username.
- db_pass
Default: CHANGE ME
PostgreSQL password.
- db_host
Default: 127.0.0.1
PostgreSQL hostname or IP address.
- db_port
Default: 5432
PostgreSQL port.
[webserver] section#
This section describes options dedicated to the web interface.
- anonymous_search
Default: False
Anonymous users (users not logged in) can do searches.
- atom_access_token
Default: <empty>
When anonymous search are disabled a token can be used to access Atom feeds without authenticating. The token can be passed to HTTP requests as an url parameter, for example
?token=<Atom access token>
. Setting an empty string disables token access.
- search_shortcut_char
Default: !
Special character used as search shortcut.
- default_search_redirect
Default: <empty>
Default search engine to use. Leave empty to use SOSSE by default, use the search engine “Short name” otherwise
Warning
This field is case sensitive.
- online_search_redirect
Default: <empty>
Search engine to use when the connectivity check succeeds (see online_check_url). Leave empty to use the default_search_redirect by default, use the search engine “Short name” otherwise
Warning
This field is case sensitive.
- online_check_url
Default: https://google.com/
URL used to define online or offline mode.
- online_check_timeout
Default: 1.0
Timeout in seconds used to define online or offline mode.
- online_check_cache
Default: 10
Online check is done once every
online_check_cache
request. The special valueonce
can be used to run the check only once, when the first request is done.0
can be used to disable caching.Note
The cache is effective on a uwSGI worker basis, and as long as the uWSGI worker is alive. So even with a value of
once
a new request will be done everytime a new worker is spawned.
- sosse_shortcut
Default: <empty>
In case the default_shortcut is not empty this defines which shortcut searches SOSSE.
- allowed_host
Default: *
FDQN of the webserver, “*” for any.
See https://docs.djangoproject.com/en/3.2/ref/settings/#allowed-hosts
- static_url
Default: /static/
- static_root
Default: /var/lib/sosse/static/
- screenshots_url
Default: /screenshots/
- screenshots_dir
Default: /var/lib/sosse/screenshots/
- html_snapshot_url
Default: /snap/
Url path to HTML snapshot
Danger
This value is hardcoded inside stored HTML snapshot. If you modify it, any HTML page previously stored as a snapshot will need to be crawled again in order to update internal links.
- html_snapshot_dir
Default: /var/lib/sosse/html/
- use_i18n
Default: True
See https://docs.djangoproject.com/en/3.2/ref/settings/#use-i18n
- use_l10n
Default: True
See https://docs.djangoproject.com/en/3.2/ref/settings/#use-l10n
- language_code
Default: en-us
See https://docs.djangoproject.com/en/3.2/ref/settings/#language-code
- datetime_format
Default: N j, Y, P
See https://docs.djangoproject.com/en/3.2/ref/settings/#datetime-format
- use_tz
Default: True
See https://docs.djangoproject.com/en/3.2/ref/settings/#use-tz
- timezone
Default: UTC
See https://docs.djangoproject.com/en/3.2/ref/settings/#time-zone
- default_page_size
Default: 20
Default result count returned.
- max_page_size
Default: 200
Maximum user-defined result count.
- data_upload_max_memory_size
Default: 2621440
See https://docs.djangoproject.com/en/3.2/ref/settings/#data-upload-max-memory-size
- data_upload_max_number_fields
Default: 1000
See https://docs.djangoproject.com/en/3.2/ref/settings/#data-upload-max-number-fields
- atom_feed_size
Default: 200
Number of result returned by Atom feeds.
- exclude_not_indexed
Default: True
Exclude page queued for indexing but not yet indexed from search results.
- exclude_redirect
Default: True
Exclude page redirection from search results.
- cache_follows_redirect
Default: True
Accessing a cached page of a redirection url automatically follows the redirection.
- admin_page_size
Default: 100
Number of items by list in the administration pages.
- search_strip
Default: <empty>
Removes this string from search queries.
- crawl_status_autorefresh
Default: 5
Delay between crawl status page autorefresh (in seconds).
- browsable_home
Default: True
Display entry point documents on the homepage.
- links_no_referrer
Default: True
Omit the referrer header when accessing external links.
- links_new_tab
Default: False
Open external links in a new tab.
[crawler] section#
This section describes options dedicated to the web interface.
- crawler_count
Default: <empty>
Number of crawlers running concurrently (defaults to the number of CPU available).
- proxy
Default: <empty>
Url of the HTTP proxy server to use. Example: http://192.168.0.1:8080/
- user_agent
Default: SOSSE
User agent sent by crawlers.
- requests_timeout
Default: 10
Timeout in secounds when retrieving pages with Requests (no timeout if 0).
- fail_over_lang
Default: english
Language used to parse web pages when the original language could not be detected.
- hashing_algo
Default: md5
Hashing algorithms used to define if the content of a page has changed.
- screenshots_size
Default: 1920x1080
Resolution of the browser used to take screenshots.
- default_browser
Default: chromium
Defines which browser to use by default when browsing mode is auto-detected (can be either “firefox” or “chromium”).
- chromium_options
Default: –enable-precise-memory-info –disable-default-apps –incognito –headless
Options passed to Chromium’s command line. You may need to add
--no-sandbox
to run the crawler as root, or--disable-dev-shm-usage
to run in a virtualized container.
- firefox_options
Default: –headless
Options passed to Firefox’s command line.
- js_stable_time
Default: 0.1
When loading a page in a browser, wait
js_stable_time
seconds before checking the DOM stays unchanged.
- js_stable_retry
Default: 100
Check at most
js_stable_retry
times for the page to stay unchanged before processing.
- tmp_dl_dir
Default: /var/lib/sosse/downloads
Base directory where files are temporarily downloaded.
- dl_check_time
Default: 0.1
Download detection will every
dl_check_time
seconds for a started download.
- dl_check_retry
Default: 2
Download detection will retry
dl_check_retry
times for a started download.
- max_file_size
Default: 500
Maximum file size to index (in kB).
- max_html_asset_size
Default: 5000
Maximum file size of html assets (css, images, etc.) to download (in kB).
- max_redirects
Default: 5
Maximum numbers of redirect before aborting. (this is accurate when using Requests only, some redirects may be missed on Chromium)
- browser_idle_exit_time
Default: 5
Close the browser when the crawler is idle for
browser_idle_exit_time
seconds.
- browser_crash_sleep
Default: 1.0
Sleep
browser_crash_sleep
seconsds before retrying after the browser crashed.
- browser_crash_retry
Default: 1
Retry
browser_crash_retry
times to index the page on browser crashes.
- css_parser
Default: internal
Choose which CSS parser implementation to use. May be one of
internal
orcssutils
: You may want to change this option when HTML snapshots have broken styles.