RSS In The Wild

9 MAY 2025

HTTP

RSS

small web

I recently came across Kagi Small Web after I started using Kagi as my primary search engine. From their launch blog post:

“small web” typically refers to the non-commercial part of the web, crafted by individuals to express themselves or share knowledge without seeking any financial gain. This concept often evokes nostalgia for the early, less commercialized days of the web, before the ad-supported business model took over the internet

To be included on the list you have to meet certain criteria, one of which is to have an RSS/Atom feed of the content. When I created the RSS feed for this site I searched for the best practice, RSS vs Atom, which content-type header to use, etc, etc.

So when I discovered that the list of sites is available on GitHub at kagisearch/smallweb I wondered what conclusions everyone else came to...

Scraping

I wanted to scrape both the HTTP headers and body for all the feed URLs. I threw together a quick bash script which ran the curl requests in parallel.

I chose not to follow any redirects, forced all URLs to use HTTPS and set a hard timeout of 3 seconds.

Of the 14,513 URLs 12,929 returned HTTP 200, all other responses were discarded. The data was then passed through some gnarly grep one-liners to produce the graphs below.

A total of 3,024 MB was downloaded.

RSS vs Atom

Show code

grep --no-filename -EiRo '<(rss|feed)' bodies/ | sort | uniq -c | sort -rn

Content-Type Header

There are many different content types which can be declared for a feed, which is most common?

Show code

grep --no-filename -EizR "http[0-9\/\.]+ 200" headers | tr '\0' '\n' | grep --no-filename -Eio '^Content-Type:\s[^;]+$' | tr '[:upper:]' '[:lower:]' | sort | uniq -c | sort -nr | head -n 10

Charset

Not all sites included a charset for the feed, but when they did what was is it set to?

Show code

grep --no-filename -EizR "http[0-9\/\.]+ 200" headers | tr '\0' '\n' | grep -Eio '^Content-Type:.+$' | grep -Eio 'charset=[^;]+$' | tr '[:upper:]' '[:lower:]' | sort | uniq -c | sort -nr | head -n 10

Path

What is the path to the feed?

Trailing slashes were stripped before aggregation.

Show code

grep -Eoi '\.[a-z]+/.+' smallweb.txt | grep -Eio '/.+' | sed 's/\/$//' | sort | uniq -c | sort -rn | head -n 10

gTLD Domain Choice

Show code

grep -Eoi '\.[^/]{2,7}/' smallweb.txt | sed 's/\/$//' | sort | uniq -c | sort -rn | head -n 10

Web Server

Show code

grep --no-filename -EiRo '^server:.+$' headers/ | tr '[:upper:]' '[:lower:]' | sort | uniq -c | sort -nr | head -n 10

M0UNTAIN 0F C0DE

RSS In The Wild

Scraping

RSS vs Atom

Content-Type Header

Charset

Path

gTLD Domain Choice

Web Server

Categories