RSS In The Wild
I recently came across Kagi Small Web after I started using Kagi as my primary search engine. From their launch blog post:
“small web” typically refers to the non-commercial part of the web, crafted by individuals to express themselves or share knowledge without seeking any financial gain. This concept often evokes nostalgia for the early, less commercialized days of the web, before the ad-supported business model took over the internet
To be included on the list you have to meet certain criteria, one of which is to have an RSS/Atom feed of the content. When I created the RSS feed for this site I searched for the best practice, RSS vs Atom, which content-type header to use, etc, etc.
So when I discovered that the list of sites is available on GitHub at kagisearch/smallweb I wondered what conclusions everyone else came to...
Scraping
I wanted to scrape both the HTTP headers and body for all the feed URLs. I threw together a quick bash script which ran the curl requests in parallel.
I chose not to follow any redirects, forced all URLs to use HTTPS and set a hard timeout of 3 seconds.
Of the 14,513 URLs 12,929 returned HTTP 200, all other responses were discarded. The data was then passed through some gnarly grep one-liners to produce the graphs below.
A total of 3,024 MB was downloaded.
RSS vs Atom
Show code
grep --no-filename -EiRo '<(rss|feed)' bodies/ | sort | uniq -c | sort -rn
Content-Type Header
There are many different content types which can be declared for a feed, which is most common?
Show code
grep --no-filename -EizR "http[0-9\/\.]+ 200" headers | tr '\0' '\n' | grep --no-filename -Eio '^Content-Type:\s[^;]+$' | tr '[:upper:]' '[:lower:]' | sort | uniq -c | sort -nr | head -n 10
Charset
Not all sites included a charset for the feed, but when they did what was is it set to?
Show code
grep --no-filename -EizR "http[0-9\/\.]+ 200" headers | tr '\0' '\n' | grep -Eio '^Content-Type:.+$' | grep -Eio 'charset=[^;]+$' | tr '[:upper:]' '[:lower:]' | sort | uniq -c | sort -nr | head -n 10
Path
What is the path to the feed?
Trailing slashes were stripped before aggregation.
Show code
grep -Eoi '\.[a-z]+/.+' smallweb.txt | grep -Eio '/.+' | sed 's/\/$//' | sort | uniq -c | sort -rn | head -n 10
gTLD Domain Choice
Show code
grep -Eoi '\.[^/]{2,7}/' smallweb.txt | sed 's/\/$//' | sort | uniq -c | sort -rn | head -n 10
Web Server
Show code
grep --no-filename -EiRo '^server:.+$' headers/ | tr '[:upper:]' '[:lower:]' | sort | uniq -c | sort -nr | head -n 10
Categories
I parsed out all the <category term=""/>
nodes from the RSS feeds.
The number of categories per feed, case-insensitive.
Most common categories across all feeds. Case-insensitive, each category was counted only once per feed.