File DownloadsΒΆ
Sosse allows for the automation of file downloads from websites. The example below demonstrates how to download new eBooks daily from Project Gutenberg.
Project Gutenberg is a digital library offering over 75,000 free eBooks, including many classic literary works.
Note
Project Gutenberg provides several methods for retrieving its content if you wish to download it. See the Offline Catalogs and Feeds for more information. If you wish to download the full database, there are more appropriate methods than crawling, such as the Mirroring How-To. π
Collections SetupΒΆ
Collections are essential for controlling how Sosse accesses and downloads content from websites. For more details, see the Collections documentation.
Project Gutenberg Collection
In the
β‘ Crawltab, setUnlimited depth URL regex:^http://www.gutenberg.org/cache/epub/feeds/today.rss$ ^https://www.gutenberg.org/ebooks/[0-9]+$ ^https://www.gutenberg.org/ebooks/[0-9]+.epub3.images$ ^https://www.gutenberg.org/cache/epub/.*epub$
In the
π Archivetab, ensureArchive contentis enabled to download the EPUB files.In the
π Recurrencetab, setCrawl frequencytoOnce(as reference pages and books do not need updates after initial download). Additionally, clear both theRecrawl dt minandRecrawl dt maxfields.
Start CrawlingΒΆ
To start crawling, go to the Crawl a new URL page and enter the URL of the RSS feed:
http://www.gutenberg.org/cache/epub/feeds/today.rss.
Check the parameters, then click Add to Crawl Queue. Once confirmed, you will be able to see the crawl queue
retrieving the files from the feed.
View the LibraryΒΆ
To view all the books indexed from the RSS feed, go to the homepage and unfold the Params section. We can
execute a query to fetch all the pages linked within the RSS feed, with the following parameters:
Sort:
First crawled descending.Search options:
Action:
KeepField:
Linked by URLOperator:
Equal toValue:
https://www.gutenberg.org/cache/epub/feeds/today.rss
This will display all the books that were loaded from the RSS feed.
Each link will point to the archived page containing information about the book:
Following the link, you will be able to download the book:
Additional OptionsΒΆ
You may want to usethe atom feed feature to create an Atom feed that points to the downloaded EPUB files, which could be useful for integrating with an EPUB reader or sharing updates.