Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Generate RSS feeds with grep(1), sed(1), and Awk(1) (romanzolotarev.com)
96 points by Tomte on May 22, 2021 | hide | past | favorite | 21 comments


I strongly recommend adding a schema validator to anything that generates XML. ATOM¹ has a nice schema available² that you can use at the end to check the whole thing (I use xmllint³, since it is in a lot of package repositories). Another nice thing about ATOM compared to RSS is that it has the xml:base attribute, which means you do not need to rewrite relative URLs into absolute ones. You can use recode's⁴ XML-standalone/r0 charset to do the XML escaping correctly. And of course you should be using tidy⁵ to make sure the HTML is correct before putting it into the feed or publishing it on the web. It only takes about 6 lines of shell code using the mentioned tools to generate the ATOM feed for my blog.⁶

¹ https://www.rfc-editor.org/rfc/rfc4287.txt

² https://gist.github.com/tjdett/4617547

³ http://xmlsoft.org/xmllint.html

https://github.com/rrthomas/recode/

https://www.html-tidy.org/

https://oneofus.la/have-emacs-will-hack/files/current/Makefi...


Awk is really such an interesting tool that I don't see people using enough. I could see how awk would be perfect for this


I use awk a lot myself. It's gotten to the point where I enjoy using it, might be a disease of some sort. But it's still been highly productive for parsing text streams or do stuff like add proper visual feedback to plaintext tools, pretty progress bars for wget/curl download jobs, monitoring scripts and generally anything related to extracting bits of data or pretty-printing.

I particularly enjoy using gawk (mostly for the more usable match() and sed-like in-file editing), but will do fine with a busybox awk or so as well. Something about the language just clicks with my way of doing things. It's like the perfect middle-ground between a shell script and starting a full python (or equivalent) script.


The more awk I write, the more it starts to look like apl/j.

And that's a good thing.


Its named after the last names of the creators: Alfred Aho, Peter Weinberger, and Brian Kernighan.

We're lucky C isn't named K&R


Musk he been AWKward to decide the best order of the initials


It was decided that other options were WAK.


Kwa?


KAW, a corvid interpreter.


> that I don't see people using enough.

I'm betting that's because of its steep learning curve.


Strangely, the site itself does not appear to currently offer an RSS feed. Was hoping to subscribe for updates to the article collection: https://www.romanzolotarev.com .



Thanks very much! Strange it is not displayed anywhere on the page or even in the source code.


Probably it's not ready for consumption yet.


Thanks for the link. Seems his scripts need more work because lots of HTML tags are being incorrectly escaped.

e.g.: <pre> $ <b>tr -cd ' -~' < /dev/urandom |</b> <i><b>fold -w 20 | head -n 1</b></i> a(k#$(K ?I?d!^NM^(5x $ </pre>


This is very cool. I think the intention is to build RSS feeds for sites you're building yourself. I work on an RSS generator that works along similar lines, but intended to be used with public pages, and the RSS items are selected with CSS selectors you enter into the form: https://createfeed.fivefilters.org


It's cool to see this written in shell in a "that's clever" kind of way, but it's still not a good idea. Shell is good at gluing together calls to standard tools, and for interactive use, but anything this complex is a much better fit for something like python.

axiolite points out some apparent escaping bugs (https://news.ycombinator.com/item?id=27250215), which would be much easier to catch/fix in a language that was less of a Turing tarpit.


I looked for such a tool a couple of years ago and did not find any. As other comments indicate, this one may not yet be fully general-purpose.

Could anyone here point me to an alternative? Something that generates rss or atom from HTML files, and called from the command line. Essentially, something like pandoc, but for rss.


I wrote a general purpose tool for html to rss mapping, maybe what you are looking for. You can run it as a docker container. https://github.com/damoeb/rss-proxy


Not exactly what you are looking for, but probably the core HTML-->RSS is in rss-bridge: https://github.com/RSS-Bridge/rss-bridge


Nice tool! A related one that is helpful for getting RSS feeds from websites (with e.g. full article content) is rss-bridge: https://github.com/RSS-Bridge/rss-bridge




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: