Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
An effective way for GOOG to punish scrapers?
4 points by jonprins on Jan 6, 2011 | hide | past | favorite | 7 comments
As was suggested elsewhere, there's blacklisting if you're logged in. But that's rife for abuse.

It would take more horsepower, but Google has plenty of that. I'm sure someone at Google has thought of this; or is thinking about implementing this; or has completely dismissed this as impossible/ineffective; but I wanted to throw it out there to see what HN thinks about it.

Determine the canonical source. In this case, a Stack Overflow post. Each site that scrapes the content from the Stack Overflow post increases the site rank for said Stack Overflow post.

On one hand, it will increase any rank for original content that gets recycled and spread across the 'net: reblogged Tumblr posts; retweeted Tweets.

On the other hand, it puts another tool in the hands of black hats.

Thoughts?



Determine the canonical source.

Well, that's the hard bit, isn't it? What are the consequences of getting it wrong? If Google bans my site from their index because it thinks I stole my content from a scraper, that's going to be hard to take.


Not really. Sure if they auto-ban you it sucks, but how hard is it to post something, inform google of it and wait x hours for it to show up on some autoscraper site. (of course this requires google cooperation).

Edit: it seems like a good solution from google POV would be to inform people of their impending banning and give them the chance to defend themselves by posting original content that then gets scraped elsewhere.


Your edit is interesting. Google seems to actively reject the idea of doing things that aren't automatic. That's why you hear the horror stories of people getting blocked from adwords (frequently with google taking their balance back, too). Until they were legally threatened, google didn't have great tools for youtube copyright notices, either.

They are a company that does not like to do things that require 2 way communication, because (in my view) it doesn't scale. My guess is that they are focused on revenue/employee, and adding a call center will decrease that significantly.


Surely it's not that hard for a scraper to post something original, inform Google, wait X hours, then plant it on StackOverflow/wherever?


If the parasite kills its host...


A parasite that kills its only host didn't have enough hosts.


Not all scraping is bad: some provide extra services such as search.

Google itself is a scraper.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: