Atom and RSS feedsΒΆ
SOSSE can crawl Atom and RSS feeds, this can be useful to crawl websites that are updated often, and skip already indexed pages. To index a syndication feed, it needs to be queued explicitly.
Note
SOSSE crawler does not recurse into feeds declared in the <head> element of webpages. To crawl a feed, the URL of the XML feed must be added to the crawl queue manually.
Caching for news aggregatorsΒΆ
By crawling syndication feeds, SOSSE can be used as an offline archive for news aggregator π softwares. After the XML feed is indexed, archived pages from the feed can be registered in the aggregator using the atom feed generated by SOSSE. This can be done using the search parameters:
Leave the keyword field empty
Set a search parameter to
KeepLinked by urlEqual to, and use the URL of the XML feed as the valueSort results by
First crawled descending