Atom and RSS feeds#

SOSSE can crawl Atom and RSS feeds, this can be useful to crawl websites that are updated often, and skip already indexed pages. To index a syndication feed, it needs to be queued explicitly.

Note

SOSSE crawler does not recurse into feeds declared in the <head> element of webpages. To crawl a feed, the URL of the XML feed must be added to the crawl queue manually.

Caching for news aggregators#

By crawling syndication feeds, SOSSE can be used as an offline cache for news aggregator 🐊 softwares. After the XML feed is indexed, cached pages from the feed can be registered in the aggregator using the atom feed generated by SOSSE. This can be done using the search parameters:

  • Leave the keyword field empty

  • Set a search parameter to Keep Linked by url Equal to, and use the URL of the XML feed as the value

  • Sort results by First crawled descending

../_images/syndication_feed.png