[common] section¶
This section describes options common to the web interface and the crawlers.
- secret_key
Default: CHANGE ME
Run
sosse-admin generate_secretto create a new one.See https://docs.djangoproject.com/en/3.2/ref/settings/#secret-key
Warning
Keep the secret key used in production secret!
- debug
Default: False
Debug mode.
Warning
Don’t run with debug turned on in production!
- db_name
Default: sosse
PostgreSQL database name.
- db_user
Default: sosse
PostgreSQL username.
- db_pass
Default: CHANGE ME
PostgreSQL password.
- db_host
Default: 127.0.0.1
PostgreSQL hostname or IP address.
- db_port
Default: 5432
PostgreSQL port.
[webserver] section¶
This section describes options dedicated to the web interface.
- anonymous_search
Default: False
Anonymous users (users not logged in) can do searches.
- search_shortcut_char
Default: !
Special character used as search shortcut.
- default_search_redirect
Default: <empty>
Default search engine to use. Leave empty to use Sosse by default, use the search engine “Short name” otherwise
Warning
This field is case sensitive.
- online_search_redirect
Default: <empty>
Search engine to use when the connectivity check succeeds (see online_check_url). Leave empty to use the default_search_redirect by default, use the search engine “Short name” otherwise
Warning
This field is case sensitive.
- online_check_url
Default: https://google.com/
URL used to define online or offline mode.
- online_check_timeout
Default: 1.0
Timeout in seconds used to define online or offline mode.
- online_check_cache
Default: 10
Online check is done once every
online_check_cacherequest. The special valueoncecan be used to run the check only once, when the first request is done.0can be used to disable caching.Note
The cache is effective on a uWSGI worker basis, and as long as the uWSGI worker is alive. So even with a value of
oncea new request will be done everytime a new worker is spawned.
- sosse_shortcut
Default: <empty>
In case the default_shortcut is not empty this defines which shortcut searches Sosse.
- allowed_host
Default: *
FDQN of the webserver, “*” for any.
See https://docs.djangoproject.com/en/3.2/ref/settings/#allowed-hosts
- static_url
Default: /static/
- static_root
Default: /var/lib/sosse/static/
- screenshots_url
Default: /screenshots/
- screenshots_dir
Default: /var/lib/sosse/screenshots/
- scripts_dir
Default: /var/lib/sosse/scripts/
- html_snapshot_url
Default: /snap/
Url path to HTML snapshot
Danger
This value is hardcoded inside stored HTML snapshot. If you modify it, any HTML page previously stored as a snapshot will need to be crawled again in order to update internal links.
- html_snapshot_dir
Default: /var/lib/sosse/html/
- use_i18n
Default: True
See https://docs.djangoproject.com/en/3.2/ref/settings/#use-i18n
- use_l10n
Default: True
See https://docs.djangoproject.com/en/3.2/ref/settings/#use-l10n
- language_code
Default: en-us
See https://docs.djangoproject.com/en/3.2/ref/settings/#language-code
- datetime_format
Default: N j, Y, P
See https://docs.djangoproject.com/en/3.2/ref/settings/#datetime-format
- use_tz
Default: True
See https://docs.djangoproject.com/en/3.2/ref/settings/#use-tz
- timezone
Default: UTC
See https://docs.djangoproject.com/en/3.2/ref/settings/#time-zone
- default_page_size
Default: 20
Default result count returned.
- max_page_size
Default: 200
Maximum user-defined result count.
- data_upload_max_memory_size
Default: 2621440
See https://docs.djangoproject.com/en/3.2/ref/settings/#data-upload-max-memory-size
- data_upload_max_number_fields
Default: 1000
See https://docs.djangoproject.com/en/3.2/ref/settings/#data-upload-max-number-fields
- atom_access_token
Default: <empty>
When anonymous search are disabled a token can be used to access Atom feeds without authenticating. The token can be passed to HTTP requests as an url parameter, for example
?token=<Atom access token>. Setting an empty string disables token access.
- atom_feed_size
Default: 200
Number of result returned by Atom feeds.
- atom_archive_bin_passthrough
Default: True
Archive links from the Atom feed to binary files returns binary files instead of the related metadata archive page.
- csv_export
Default: True
Enable CSV export.
- csv_export_size
Default: 200
Number of results returned by CSV export.
- exclude_not_indexed
Default: True
Exclude page queued for indexing but not yet indexed from search results.
- exclude_redirect
Default: True
Exclude page redirection from search results.
- archive_follows_redirect
Default: True
Accessing the archive page of a redirection url automatically follows the redirection.
- admin_page_size
Default: 100
Number of items by list in the administration pages.
- search_strip
Default: <empty>
Removes this string from search queries.
- crawl_status_autorefresh
Default: 5
Delay between crawl info autorefresh in Crawl queue, and Crawlers pages (in seconds).
- browsable_home
Default: True
Display entry point documents on the homepage.
- links_no_referrer
Default: True
Omit the referrer header when accessing external links.
- links_new_tab
Default: False
Open external links in a new tab.
- home_search_history_size
Default: 3
Number of recent searches displayed on the homepage.
[crawler] section¶
This section describes options dedicated to the web interface.
- crawler_count
Default: <empty>
Number of crawlers running concurrently (defaults to the number of CPU available divided by 2).
- proxy
Default: <empty>
Url of the HTTP proxy server to use. Example: http://192.168.0.1:8080/
- user_agent
Default: Sosse
User agent sent by crawlers.
- fake_user_agent_browser
Default: <empty>
Use a preset UA using the fake-useragent library. The UA will be selected among the provided browser, specified as a comma-separated list of values among: chrome, edge, firefox, safari.
Note
To enable fake-useragent, the
user_agentoption must be set to empty.
- fake_user_agent_os
Default: <empty>
Use a preset UA using the fake-useragent library. The UA will be selected among the provided operating system, specified as a comma-separated list of values among: windows, linux, macos.
Note
To enable fake-useragent, the
user_agentoption must be set to empty.
- fake_user_agent_platform
Default: <empty>
Use a preset UA using the fake-useragent library. The UA will be selected among the provided platform, specified as a comma-separated list of values among: pc, mobile, tablet.
Note
To enable fake-useragent, the
user_agentoption must be set to empty.
- requests_timeout
Default: 10
Timeout in secounds when retrieving pages with Requests (no timeout if 0).
- fail_over_lang
Default: english
Language used to parse web pages when the original language could not be detected.
- hashing_algo
Default: md5
Hashing algorithms used to define if the content of a page has changed.
- screenshots_size
Default: 1920x1080
Resolution of the browser used to take screenshots.
- default_browser
Default: chromium
Defines which browser to use by default when browsing mode is auto-detected (can be either “firefox” or “chromium”).
- chromium_options
Default: –enable-precise-memory-info –disable-default-apps –headless
Options passed to Chromium’s command line. You may need to add
--no-sandboxto run the crawler as root, or--disable-dev-shm-usageto run in a virtualized container.
- firefox_options
Default: –headless
Options passed to Firefox’s command line.
- js_stable_time
Default: 0.1
When loading a page in a browser, wait
js_stable_timeseconds before checking the DOM stays unchanged.
- js_stable_retry
Default: 100
Check at most
js_stable_retrytimes for the page to stay unchanged before processing.
- tmp_dl_dir
Default: /var/lib/sosse/downloads
Base directory where files are temporarily downloaded.
- browser_config_dir
Default: /var/lib/sosse/browser_config
Base directory where browser configuration files and profiles are stored.
- dl_check_time
Default: 0.1
Download detection will every
dl_check_timeseconds for a started download.
- dl_check_retry
Default: 2
Download detection will retry
dl_check_retrytimes for a started download.
- max_file_size
Default: 1000000
Maximum file size to index (in kB).
- max_html_asset_size
Default: 50000
Maximum file size of html assets (css, images, etc.) to download (in kB).
- max_redirects
Default: 5
Maximum numbers of redirect before aborting. (this is accurate when using Requests only, some redirects may be missed on Chromium)
- browser_idle_exit_time
Default: 5
Close the browser when the crawler is idle for
browser_idle_exit_timeseconds.
- browser_crash_sleep
Default: 1.0
Sleep
browser_crash_sleepseconds before retrying after the browser crashed.
- browser_crash_retry
Default: 1
Retry
browser_crash_retrytimes to index the page on browser crashes.
- css_parser
Default: internal
Choose which CSS parser implementation to use. May be one of
internalorcssutils: You may want to change this option when HTML snapshots have broken styles.
- worker_crash_retry
Default: 1
Retry
worker_crash_retrytimes to index the page on worker crashes.