Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Your reply doesn't address why you do the things I highlighted in my original comment. For example, in the middle of your generic table parser you have code specific to Berlios making it non-generic. That's there because you haven't actually parsed the HTML you are trying to hack around it with regexps.

In your post you say:

So instead, I walk through the page looking for anything that looks like a hotlink wrapping literal text of the form #[0-9]. Actually, that oversimplifies; I look for a hotlink with a specific pattern of URL in a hotlink that I know points into the site bugtracker.

I don't have any problem with you using regexps to identify the particular fragments you are looking for, my criticism is that you use them for HTML parsing. For example, in your code you use a regexp like this

  <A HREF="/bugs/\?func=detailbug&bug_id=([0-9]+)&group_id=%s">
This mixes structure and meaning to use your terms. You assume that there's an HREF attribute after the A. Your code is brittle when it comes to a change in the HTML (e.g. suppose the page author adds a CLASS= attribute).

This would not happen if you parsed the HTML into a DOM tree and then did queries against it. You could quickly extract all the <A> tags in the path with a //A query (or even all those that have an HREF) and get the actual HREF robustly. Or you could not use XPath and use a parser that does calls backs with robust lists of attributes.

Doing that would be both robust against changes in page structure, and robust against changes in the attributes or placement of attributes in the <A>.

PS Is your name calling really necessary? In your post you refer to me as a 'snotty kid' and a 'fool'

PPS Also you say:

If I did what my unwise critic says is right, I’d take a parse tree of the entire page and then write a path query down to the table cell and the embedded link.

I never said that, I said that you could use XPath queries to extract the data. I never implied that you use a complete rooted path to get where you want in the document. You've twisted what I actually said to fit into your 'structure and meaning' blog post.



>PS Is your name calling really necessary? In your post you refer to me as a 'snotty kid' and a 'fool'

Well, you said his code looks like a hack job by a poor programmer. That's fairly insulting too.

About the actual mechanics of screen scraping, I guess your only issue is that he could have used an html parser to sanitize the code first before querying it.


his code looks like a hack job by a poor programmer

Well, it does. Specifying case insensitive match and then handling both cases yourself reeks of someone who's never parsed anything before. The second case won't even do anything, it's already been done! And even if the language or the libraries worked that way, what about mixed case! Let alone parsing HTML with regexps in the first place. Python has BeautifulSoup for not-well-formed documents.

It's reasonable from that code snippet to infer the author understands neither the language he's using nor the problem domain he's working in.


It's reasonable from that code snippet to infer the author understands neither the language he's using nor the problem domain he's working in.

Well, does the program actually work as intended? If so, that's a pretty good indication that the author does understand the language and the problem domain, even if the code isn't pristine.


Errr, no, it's not. There are other reasonable inferences: that the programmer isn't concerned about efficiency of the running code but instead with his efficiency in coding; that the programmer originally had the code doing one thing, but changed it to do another and hasn't (yet) noticed that it's now redundant; that it's not wrong code, but instead redundant code so qwitcherbellyaching and submit a patch.

I parse HTML with regexps when I know it's been machine-generated with a certain structure and I'm only interested in pulling out the bits using re.search(r'blah(%s)blah') that got put in there with a printf "blah$(x)blah".

You're going to be tarring a lot of not-incorrect code with that brush of yours.


he obviously hates dealing with regular expressions and I do too. It's only natural.


So he used double the number of regexps he needed?


double the expressions but they're simpler to understand perhaps.


No, the only reason to a) specifically add the option for case insensitive matching and then b) copy-paste the same regexp just with the case changed is if you really don't understand what it means.


The difference between "fool" and "looks like it's a hack job by a poor programmer" is that one critiques the actual work, and one assaults the person. Insulting or not, critiques of work improve discourse and name calling usually doesn't.


esr calls him a fool for his position that meaning can be derived from structure.

I fail to see the difference between that and calling someone a poor programmer based on 1000 lines of prototype code.


What you're failing to see is the meaning of the phrase "looks like it was written by a poor programmer". That's a comment on the code only, given that it's possible for a fantastic programmer to write code like that. It's not, however, the same as "calling someone a poor programmer".

Likewise, someone looking at an abstract painting might say "that looks like it was painted by a 5 year old", which doesn't mean they're calling the actual painter a 5 year old. It's commentary on the work.


This is bazaar-style development in action: do a crap initial job and troll up helpful comments from people like you.


Except it doesn't appear to be. The original author has responded to my criticism of his code by calling me a 'fool' and 'snotty kid', ignoring the actual criticisms that I made and that writing a blog post explaining his superiority.

Or is that what you meant by 'bazaar-style development'?


exactly so.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: