Atom and RSS feeds#
SOSSE can crawl Atom and RSS feeds, this can be useful to crawl websites that are updated often, and skip already indexed pages. To index a syndication feed, it needs to be queued explicitly.
Note
SOSSE crawler does not recurse into feeds declared in the <head>
element of webpages. To crawl a feed, the URL of the XML feed must be added to the crawl queue manually.
Caching for news aggregators#
By crawling syndication feeds, SOSSE can be used as an offline cache for news aggregator 🐊 softwares. After the XML feed is indexed, cached pages from the feed can be registered in the aggregator using the atom feed generated by SOSSE. This can be done using the search parameters:
Leave the keyword field empty
Set a search parameter to
Keep
Linked by url
Equal to
, and use the URL of the XML feed as the valueSort results by
First crawled descending