Generate RSS feeds with grep(1), sed(1), and Awk(1)

sedachv · on May 22, 2021

I strongly recommend adding a schema validator to anything that generates XML. ATOM¹ has a nice schema available² that you can use at the end to check the whole thing (I use xmllint³, since it is in a lot of package repositories). Another nice thing about ATOM compared to RSS is that it has the xml:base attribute, which means you do not need to rewrite relative URLs into absolute ones. You can use recode's⁴ XML-standalone/r0 charset to do the XML escaping correctly. And of course you should be using tidy⁵ to make sure the HTML is correct before putting it into the feed or publishing it on the web. It only takes about 6 lines of shell code using the mentioned tools to generate the ATOM feed for my blog.⁶

¹ https://www.rfc-editor.org/rfc/rfc4287.txt

² https://gist.github.com/tjdett/4617547

³ http://xmlsoft.org/xmllint.html

⁴ https://github.com/rrthomas/recode/

⁵ https://www.html-tidy.org/

⁶ https://oneofus.la/have-emacs-will-hack/files/current/Makefi...

jedimastert · on May 22, 2021

Awk is really such an interesting tool that I don't see people using enough. I could see how awk would be perfect for this

falsaberN1 · on May 22, 2021

I use awk a lot myself. It's gotten to the point where I enjoy using it, might be a disease of some sort. But it's still been highly productive for parsing text streams or do stuff like add proper visual feedback to plaintext tools, pretty progress bars for wget/curl download jobs, monitoring scripts and generally anything related to extracting bits of data or pretty-printing.

I particularly enjoy using gawk (mostly for the more usable match() and sed-like in-file editing), but will do fine with a busybox awk or so as well. Something about the language just clicks with my way of doing things. It's like the perfect middle-ground between a shell script and starting a full python (or equivalent) script.

nmz · on May 22, 2021

The more awk I write, the more it starts to look like apl/j.

And that's a good thing.

ape4 · on May 22, 2021

Its named after the last names of the creators: Alfred Aho, Peter Weinberger, and Brian Kernighan.

We're lucky C isn't named K&R

midasuni · on May 22, 2021

Musk he been AWKward to decide the best order of the initials

turndown · on May 23, 2021

It was decided that other options were WAK.

dexterdog · on May 23, 2021

snypher · on May 23, 2021

KAW, a corvid interpreter.

justaj · on May 23, 2021

> that I don't see people using enough.

I'm betting that's because of its steep learning curve.

miles · on May 22, 2021

Strangely, the site itself does not appear to currently offer an RSS feed. Was hoping to subscribe for updates to the article collection: https://www.romanzolotarev.com .

rany_ · on May 22, 2021

His RSS feed is https://www.romanzolotarev.com/rss.xml

miles · on May 22, 2021

Thanks very much! Strange it is not displayed anywhere on the page or even in the source code.

vbezhenar · on May 22, 2021

Probably it's not ready for consumption yet.

axiolite · on May 22, 2021

Thanks for the link. Seems his scripts need more work because lots of HTML tags are being incorrectly escaped.

e.g.: <pre> $ tr -cd ' -~' < /dev/urandom | fold -w 20 | head -n 1 a(k#$(K ?I?d!^NM^(5x $ </pre>

k1m · on May 22, 2021

This is very cool. I think the intention is to build RSS feeds for sites you're building yourself. I work on an RSS generator that works along similar lines, but intended to be used with public pages, and the RSS items are selected with CSS selectors you enter into the form: https://createfeed.fivefilters.org

jefftk · on May 23, 2021

It's cool to see this written in shell in a "that's clever" kind of way, but it's still not a good idea. Shell is good at gluing together calls to standard tools, and for interactive use, but anything this complex is a much better fit for something like python.

axiolite points out some apparent escaping bugs (https://news.ycombinator.com/item?id=27250215), which would be much easier to catch/fix in a language that was less of a Turing tarpit.

thxg · on May 22, 2021

I looked for such a tool a couple of years ago and did not find any. As other comments indicate, this one may not yet be fully general-purpose.

Could anyone here point me to an alternative? Something that generates rss or atom from HTML files, and called from the command line. Essentially, something like pandoc, but for rss.

pedro1976 · on May 23, 2021

I wrote a general purpose tool for html to rss mapping, maybe what you are looking for. You can run it as a docker container. https://github.com/damoeb/rss-proxy

podiki · on May 23, 2021

Not exactly what you are looking for, but probably the core HTML-->RSS is in rss-bridge: https://github.com/RSS-Bridge/rss-bridge

podiki · on May 23, 2021

Nice tool! A related one that is helpful for getting RSS feeds from websites (with e.g. full article content) is rss-bridge: https://github.com/RSS-Bridge/rss-bridge