I strongly recommend adding a schema validator to anything that generates XML. ATOM¹ has a nice schema available² that you can use at the end to check the whole thing (I use xmllint³, since it is in a lot of package repositories). Another nice thing about ATOM compared to RSS is that it has the xml:base attribute, which means you do not need to rewrite relative URLs into absolute ones. You can use recode's⁴ XML-standalone/r0 charset to do the XML escaping correctly. And of course you should be using tidy⁵ to make sure the HTML is correct before putting it into the feed or publishing it on the web. It only takes about 6 lines of shell code using the mentioned tools to generate the ATOM feed for my blog.⁶
I use awk a lot myself. It's gotten to the point where I enjoy using it, might be a disease of some sort.
But it's still been highly productive for parsing text streams or do stuff like add proper visual feedback to plaintext tools, pretty progress bars for wget/curl download jobs, monitoring scripts and generally anything related to extracting bits of data or pretty-printing.
I particularly enjoy using gawk (mostly for the more usable match() and sed-like in-file editing), but will do fine with a busybox awk or so as well. Something about the language just clicks with my way of doing things. It's like the perfect middle-ground between a shell script and starting a full python (or equivalent) script.
Strangely, the site itself does not appear to currently offer an RSS feed. Was hoping to subscribe for updates to the article collection: https://www.romanzolotarev.com .
This is very cool. I think the intention is to build RSS feeds for sites you're building yourself. I work on an RSS generator that works along similar lines, but intended to be used with public pages, and the RSS items are selected with CSS selectors you enter into the form: https://createfeed.fivefilters.org
It's cool to see this written in shell in a "that's clever" kind of way, but it's still not a good idea. Shell is good at gluing together calls to standard tools, and for interactive use, but anything this complex is a much better fit for something like python.
axiolite points out some apparent escaping bugs (https://news.ycombinator.com/item?id=27250215), which would be much easier to catch/fix in a language that was less of a Turing tarpit.
I looked for such a tool a couple of years ago and did not find any. As other comments indicate, this one may not yet be fully general-purpose.
Could anyone here point me to an alternative? Something that generates rss or atom from HTML files, and called from the command line. Essentially, something like pandoc, but for rss.
I wrote a general purpose tool for html to rss mapping, maybe what you are looking for. You can run it as a docker container. https://github.com/damoeb/rss-proxy
Nice tool! A related one that is helpful for getting RSS feeds from websites (with e.g. full article content) is rss-bridge: https://github.com/RSS-Bridge/rss-bridge
¹ https://www.rfc-editor.org/rfc/rfc4287.txt
² https://gist.github.com/tjdett/4617547
³ http://xmlsoft.org/xmllint.html
⁴ https://github.com/rrthomas/recode/
⁵ https://www.html-tidy.org/
⁶ https://oneofus.la/have-emacs-will-hack/files/current/Makefi...