File Downloads¶

SOSSE allows for the automation of file downloads from websites. The example below demonstrates how to download new eBooks daily from Project Gutenberg.

Project Gutenberg is a digital library offering over 75,000 free eBooks, including many classic literary works.

Note

Project Gutenberg provides several methods for retrieving its content if you wish to download it. See the Offline Catalogs and Feeds for more information. If you wish to download the full database, there are more appropriate methods than crawling, such as the Mirroring How-To. 🐞

Creating Crawl Policies¶

Crawl policies are essential for controlling how SOSSE accesses and downloads content from websites. For more details, see the Crawl Policies documentation.

For Project Gutenberg, we will create two policies:

RSS Feed Policy This policy reads the RSS feed of Project Gutenberg daily to monitor new content. Configure it as follows:

In the ⚡ Crawl tab, use a regular expression ^http://www.gutenberg.org/cache/epub/feeds/today.rss$ to target the daily RSS feed.
In the 🔖 Archive tab, disable Archive content (as we don’t need to archive the original feed).
In the 🕑 Recurrence tab, set Crawl frequency to Constant time and clear the Recrawl dt max field.

Book Download Policy This policy downloads the reference page and the EPUB version of each new book identified in the RSS feed. Configure it as follows:

In the ⚡ Crawl tab, use the regular expressions:
- ^https://www.gutenberg.org/ebooks/[0-9]+$ for the reference page.
- ^https://www.gutenberg.org/ebooks/[0-9]+.epub3.images$ and ^https://www.gutenberg.org/cache/epub/.*epub$ for the EPUB files.
In the 🕑 Recurrence tab, set Crawl frequency to Once (as reference pages and books do not need updates after initial download). Additionally, clear both the Recrawl dt min and Recrawl dt max fields.

../_images/guide_download_crawl_policies.png

Start Crawling¶

To start crawling, go to the Crawl a new URL page and enter the URL od the RSS feed: http://www.gutenberg.org/cache/epub/feeds/today.rss.

Check the parameters, then click Confirm. Once confirmed, you will be able to see the crawl queue retrieving the files from the feed.

../_images/guide_download_crawl_queue.png

View the Library¶

To view all the books indexed from the RSS feed, go to the homepage and unfold the Params section. We can execute a query to fetch all the pages linked within the RSS feed, with the following parameters:

Sort: First crawled descending.
Search options: - Acion: Keep - Field: Linked by URL - Operator: Equal to - Value: https://www.gutenberg.org/cache/epub/feeds/today.rss

This will display all the books that were loaded from the RSS feed.

../_images/guide_download_view_library.png

Each link will point to the archived page containing information about the book:

Following the link, you will be able to download the book:

Additional Options¶

You may also be interested in the max_file_size (defaults to 5 MB) and max_html_asset_size (defaults to 50 MB) configuration options, which control the size limits of files being downloaded.

Additionally, you can use the atom feed feature to create an Atom feed that points to the downloaded EPUB files, which could be useful for integrating with an EPUB reader or sharing updates.