File DownloadsΒΆ

Sosse allows for the automation of file downloads from websites. The example below demonstrates how to download new eBooks daily from Project Gutenberg.

Project Gutenberg is a digital library offering over 75,000 free eBooks, including many classic literary works.

Note

Project Gutenberg provides several methods for retrieving its content if you wish to download it. See the Offline Catalogs and Feeds for more information. If you wish to download the full database, there are more appropriate methods than crawling, such as the Mirroring How-To. 🐞

Collections SetupΒΆ

Collections are essential for controlling how Sosse accesses and downloads content from websites. For more details, see the Collections documentation.

Project Gutenberg Collection

  • In the ⚑ Crawl tab, set Unlimited depth URL regex:

    ^http://www.gutenberg.org/cache/epub/feeds/today.rss$
    ^https://www.gutenberg.org/ebooks/[0-9]+$
    ^https://www.gutenberg.org/ebooks/[0-9]+.epub3.images$
    ^https://www.gutenberg.org/cache/epub/.*epub$
    
  • In the πŸ”– Archive tab, ensure Archive content is enabled to download the EPUB files.

  • In the πŸ•‘ Recurrence tab, set Crawl frequency to Once (as reference pages and books do not need updates after initial download). Additionally, clear both the Recrawl dt min and Recrawl dt max fields.

../_images/guide_download_collections.png

Start CrawlingΒΆ

To start crawling, go to the Crawl a new URL page and enter the URL of the RSS feed: http://www.gutenberg.org/cache/epub/feeds/today.rss.

Check the parameters, then click Add to Crawl Queue. Once confirmed, you will be able to see the crawl queue retrieving the files from the feed.

../_images/guide_download_crawl_queue.png

View the LibraryΒΆ

To view all the books indexed from the RSS feed, go to the homepage and unfold the Params section. We can execute a query to fetch all the pages linked within the RSS feed, with the following parameters:

  • Sort: First crawled descending.

  • Search options:

    • Action: Keep

    • Field: Linked by URL

    • Operator: Equal to

    • Value: https://www.gutenberg.org/cache/epub/feeds/today.rss

This will display all the books that were loaded from the RSS feed.

../_images/guide_download_view_library.png

Each link will point to the archived page containing information about the book:

Book Information Page

Following the link, you will be able to download the book:

EPUB Download Page

Additional OptionsΒΆ

You may want to usethe atom feed feature to create an Atom feed that points to the downloaded EPUB files, which could be useful for integrating with an EPUB reader or sharing updates.