This is the sad conclusion of the next part. JA4 is a great supplement, it can squeeze some additional info, but for a motivated attacker it can be avoided.
Now the question of how motivated are noisy AI scrapers is still open. Even a solution that cuts down 50 percent of the dumbest scraping attempts will still provide much needed relief to a struggling site.
I'm curious, which site struggles are you envisaging? In my exp, JA4 is used as a hammer for which the nail must be found; simpler solutions oftentimes work better.
I think we agree that JA4 is situational. It really saved me when investigating a credential stuffing attack - random logins with random chance of success spread into many ASNs, all had the same fingerprint.
From my experience, there are all kinds of levels of bots. Add them all together and they can produce a ridiculous load on a site (especially a fragile one that you have to secure anyway). So I look at the volume, trying to block anything stupid I can get away with.
It is a game of whack-a-mole. It also can cut down the overall traffic to a fraction of the original, which has tangible infra costs benefits.
And yes, captcha works better in a lot of cases. Fortunately I'm not selling JA4, I'm just curious.
And yes, IP rate limits and ASN checks work really well in plenty cases. Side note: I got a high-throughput free offline asn-checker too! https://blog.miloslavhomer.cz/asn-check/
I agree JA4 is situational; but the # of use cases is smaller than most people think. Like you said, Captcha works better; would've stopped the credential stuffing. Managed DDoS services (Cloudflare et al) + rate limits are better at DDoS.
Cool ASN project, but doesn't IPInfo already offer this for free: https://ipinfo.io/lite ?
Back in the day I couldn't find a downloadable DB for offline checks, which is very much needed when looking at approx 10k different IPs. Even with an offline DB I might need to create this tree structure so that I can process the data fast.
Nice. I was more curious of the clients using HTTP/2.0 HTTP Protocol, what percentage of them is JA4 detecting as bots that spoof all the other headers a browser sends? That is the missing piece in my blog write-up as I don't do SSL fingerprinting. I am trying to see what percentage are getting through my very crude methods.
ok, so I've parsed some logs. I do see the ALPNs pointing to http2, but I don't capture all of the headers. The only thing I capture is the user-agent, which is the major spoof anyway.
Now, to differentiate between spoofed and non-spoofed header, I need to check the "valid" JA4 signature for a given browser and then proclaim that the rest of them are wrong. The "valid" JA4 signature can be observed, but I've found that sometimes browsers tweak their handshake a bit, so it's not 100% consistent.
The JA4 DB was recently taken down, I've requested full access, but no response (as expected). There might be some issues in getting those valid headers for the browsers, the hardware and software varies a lot (PC, Mac, Android, Iphone of all kinds of versions and browsers).
I was hoping for a quick win to share, but it doesn't seem like so and I'll have to do it properly. That should be my next post on JA4.
As a quick note, approx 30% of traffic claims to use http2 and approx 60% of that traffic has a non-bot user-agent (you know, along the lines of "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/149.0.7827.102 Safari/537.36"). I suspect majority of those are spoofed as I know how many readers I have on my blog.
In this article I take a look at the technical properties of Encrypted Client hello as well as some scenarios that are not really covered by the threat model proposed.
I argue that to get any tangible benefit you have to use the big providers, which places trust into entities that are behaving less trustworthy by the hour.
This is a really good write-up. I can't really think of anything important to add.
I don't know if it is worth adding but there is one small piece that can be kept at home rather than depending on Cloudflare or Google though by itself is rather moot but I will mention it anyway.
If using Unbound DNS [0] at home as a DNS resolver one can enable DoH if Unbound was compiled using --with-libnghttp2 thus allowing an HTTPS listener and enabling ECH tested / verified on [1]. I realize its just one tiny piece of the puzzle but we can take away the logging of DNS queries away from the big providers. If people do not trust their home ISP they can put Unbound on a VM or physical server somewhere else. I only mention this because I know some people run PiHole and other security distros on their WiFi or Firewall hardware at home.
Documentation [2][3]
I am half tempted to put a DoH listener out there for anyone to experiment with and see what kind of abuse it gets.
Then in browsers / devices I set a custom DoH endpoint of https://dohint.mydomain.tld/dns-query and uses the same key/cert I used in the past for DNS over TLS (DoT) which is still listening on TCP port 853
Have you tried putting this behind a reverse proxy? This gives us a lot of features like rate-limiting and it should work well since it is https after all.
I put Unbound directly on the web to play with for now, having some quirks with haproxy. It has an hourly cron job that pre-caches the Cloudflare Top 20000 or so .com .net .org .is domains and some domains I use.
Do you have any recommendations for effective self-learning paths? I have murky old foundations in all three fields (took first year linear algebra and a variety of logic courses) so am not starting from nothing but the few times I’ve tried to jump back in I always get a bit bogged down and can’t keep with it.
Gilbert Strang has an excellent free (IIRC) linear algebra course, accompanied by his excellent textbook. Kahn Academy has some surprisingly good linear algebra lectures as well, if something isn't clicking.
Pretty well if you consider the "bio" label, which is a set of practices not using all of the tech. They can ask for and usually get higher prices for the products.
Granted, it's more about chemicals than tractors, but still quite close to the spirit of the comments. Bio approach sacrifices some tech advances.
But, there are? I can host a repo on GitHub, Codeberg and self host it too. Then I need to watch over main to keep it consistent between those. After that's established, I can do updates from wherever. Link'em in the README.
There are distributed forges? Yes, git is distributed, but often everything around it isn't. The case parent is trying to make, is that the rest ("federated forges") should also be distributed, not just git.
> what is the missing killer feature that needs federation?
Issues, pull requests, collaboration/permissions/access, "staring"/"favoriting", etc.
I think ultimately the goal is that people can run their own forges, yet still collaborate on repositories hosted in other forges, leveraging your existing authentication so you no longer need to sign up individually for each forge.
Now the question of how motivated are noisy AI scrapers is still open. Even a solution that cuts down 50 percent of the dumbest scraping attempts will still provide much needed relief to a struggling site.
reply