Configuration file reference#
SOSSE can be configured through the configuration file /etc/sosse/sosse.conf
. Configuration variables are grouped in 3 sections, depending
on which component they affect. Modyifing any of these option requires restarting the crawlers or the wewb interface.
[common] section#
This section describes options common to the web interface and the crawlers.
- secret_key
Default: CHANGE ME
Run
sosse-admin generate_secret
to create a new one.See https://docs.djangoproject.com/en/3.2/ref/settings/#secret-key
Warning
Keep the secret key used in production secret!
- debug
Default: False
Debug mode.
Warning
Donât run with debug turned on in production!
- db_name
Default: sosse
PostgreSQL database name.
- db_user
Default: sosse
PostgreSQL username.
- db_pass
Default: CHANGE ME
PostgreSQL password.
- db_host
Default: 127.0.0.1
PostgreSQL hostname or IP address.
- db_port
Default: 5432
PostgreSQL port.
[webserver] section#
This section describes options dedicated to the web interface.
- anonymous_search
Default: False
Anonymous users (users not logged in) can do searches.
- atom_access_token
Default: <empty>
When anonymous search are disabled a token can be used to access Atom feeds without authenticating. The token can be passed to HTTP requests as an url parameter, for example
?token=<Atom access token>
. Setting an empty string disables token access.
- search_shortcut_char
Default: !
Special character used as search shortcut.
- default_search_redirect
Default: <empty>
Default search engine to use. Leave empty to use SOSSE by default, use the search engine âShort nameâ otherwise
Warning
This field is case sensitive.
- sosse_shortcut
Default: <empty>
In case the default_shortcut is not empty this defines which shortcut searches SOSSE.
- allowed_host
Default: *
FDQN of the webserver, â*â for any.
See https://docs.djangoproject.com/en/3.2/ref/settings/#allowed-hosts
- static_url
Default: /static/
- static_root
Default: /var/lib/sosse/static/
- screenshots_url
Default: /screenshots/
- screenshots_dir
Default: /var/lib/sosse/screenshots/
- html_snapshot_url
Default: /snap/
Url path to HTML snapshot
Danger
This value is hardcoded inside stored HTML snapshot. If you modify it, any HTML page previously stored as a snapshot will need to be crawled again in order to update internal links.
- html_snapshot_dir
Default: /var/lib/sosse/html/
- use_i18n
Default: True
See https://docs.djangoproject.com/en/3.2/ref/settings/#use-i18n
- use_l10n
Default: True
See https://docs.djangoproject.com/en/3.2/ref/settings/#use-l10n
- language_code
Default: en-us
See https://docs.djangoproject.com/en/3.2/ref/settings/#language-code
- datetime_format
Default: N j, Y, P
See https://docs.djangoproject.com/en/3.2/ref/settings/#datetime-format
- use_tz
Default: True
See https://docs.djangoproject.com/en/3.2/ref/settings/#use-tz
- timezone
Default: UTC
See https://docs.djangoproject.com/en/3.2/ref/settings/#time-zone
- default_page_size
Default: 20
Default result count returned.
- max_page_size
Default: 200
Maximum user-defined result count.
- data_upload_max_memory_size
Default: 2621440
See https://docs.djangoproject.com/en/3.2/ref/settings/#data-upload-max-memory-size
- data_upload_max_number_fields
Default: 1000
See https://docs.djangoproject.com/en/3.2/ref/settings/#data-upload-max-number-fields
- atom_feed_size
Default: 200
Number of result returned by Atom feeds.
- exclude_not_indexed
Default: True
Exclude page queued for indexing but not yet indexed from search results.
- exclude_redirect
Default: True
Exclude page redirection from search results.
- cache_follows_redirect
Default: True
Accessing a cached page of a redirection url automatically follows the redirection.
- admin_page_size
Default: 100
Number of items by list in the administration pages.
- search_strip
Default: <empty>
Removes this string from search queries.
- crawl_status_autorefresh
Default: 5
Delay between crawl status page autorefresh (in seconds).
- browsable_home
Default: False
Display entry point documents on the homepage.
- links_no_referrer
Default: True
Omit the referrer header when accessing external links.
- links_new_tab
Default: False
Open external links in a new tab.
[crawler] section#
This section describes options dedicated to the web interface.
- crawler_count
Default: <empty>
Number of crawlers running concurrently (defaults to the number of CPU available).
- proxy
Default: <empty>
Url of the HTTP proxy server to use. Example: http://192.168.0.1:8080/
- user_agent
Default: SOSSE
User agent sent by crawlers.
- requests_timeout
Default: 10
Timeout in secounds when retrieving pages with Requests (no timeout if 0).
- fail_over_lang
Default: english
Language used to parse web pages when the original language could not be detected.
- hashing_algo
Default: md5
Hashing algorithms used to define if the content of a page has changed.
- screenshots_size
Default: 1920x1080
Resolution of the browser used to take screenshots.
- default_browser
Default: chromium
Defines which browser to use by default when browsing mode is auto-detected (can be either âfirefoxâ or âchromiumâ).
- chromium_options
Default: âenable-precise-memory-info âdisable-default-apps âincognito âheadless
Options passed to Chromiumâs command line. You may need to add
--no-sandbox
to run the crawler as root, or--disable-dev-shm-usage
to run in a virtualized container.
- firefox_options
Default: âheadless
Options passed to Firefoxâs command line.
- js_stable_time
Default: 0.1
When loading a page in a browser, wait
js_stable_time
seconds before checking the DOM stays unchanged.
- js_stable_retry
Default: 100
Check at most
js_stable_retry
times for the page to stay unchanged before processing.
- tmp_dl_dir
Default: /var/lib/sosse/downloads
Base directory where files are temporarily downloaded.
- dl_check_time
Default: 0.1
Download detection will every
dl_check_time
seconds for a started download.
- dl_check_retry
Default: 2
Download detection will retry
dl_check_retry
times for a started download.
- max_file_size
Default: 500
Maximum file size to index (in kB).
- max_html_asset_size
Default: 5000
Maximum file size of html assets (css, images, etc.) to download (in kB).
- max_redirects
Default: 5
Maximum numbers of redirect before aborting. (this is accurate when using Requests only, some redirects may be missed on Chromium)
- browser_idle_exit_time
Default: 5
Close the browser when the crawler is idle for
browser_idle_exit_time
seconds.
- browser_crash_sleep
Default: 1.0
Sleep
browser_crash_sleep
seconsds before retrying after the browser crashed.
- browser_crash_retry
Default: 1
Retry
browser_crash_retry
times to index the page on browser crashes.
- css_parser
Default: internal
Choose which CSS parser implementation to use. May be one of
internal
orcssutils
: You may want to change this option when HTML snapshots have broken styles.