Detect and crash Chromium bots

blog.castle.io

115 points by avastel 3 days ago

oefrha 12 hours ago

> The call to page.evaluate just hangs, and the browser dies silently. browser.close() is never reached, which can cause memory leaks over time.

Not just memory leaks. Since a couple months ago, if you use Chrome via playwright etc. on macOS, it will deposit a copy of Chrome (more than 1GB) into /private/var/folders/kd/<...>/X/com.google.Chrome.code_sign_clone/, and if you exit without a clean browser.close(), the copy of Chrome will remain there. I noticed after it ate up ~50GB in two days. No idea what's the point of this code sign clone thing, but I had to add --disable-features=MacAppCodeSignClone to all my invocations to prevent it, which is super annoying.

closewith 11 hours ago

That's an open bug at the minute, but the one saving grace is that they're APFS clones so don't actually consume disk space.
- oefrha 10 hours ago
  
  Interesting, IIRC I did free up quite a bit of disk space when I removed all the clones, but I also deleted a lot of other stuff that time so I could be mistaken. du(1) being unaware of APFS clones makes it hard to tell.

chrismorgan 13 hours ago

Checking https://issues.chromium.org/issues/340836884, I’m mildly surprised to find the report just under a year old, with no attention at all (bar a me-too comment after four months), despite having been filed with priority P1, which I understand is supposed to mean “aim to fix it within 30 days”. If it continues to get no attention, I’m curious if it’ll get bumped automatically in five days’ time when it hits one year, given that they do something like that with P2 and P3 bugs, shifting status to Available or something, can’t quite remember.

I say only “mildly”, because my experience on Chromium bugs (ones I’ve filed myself, or ones I’ve encountered that others have filed) has never been very good. I’ve found Firefox much better about fixing bugs.

carlhjerpe 7 hours ago

I guess it depends on what kind of bug it is, this took 25 years to fix https://news.ycombinator.com/item?id=40431444
- Dylan16807 3 hours ago
  
  To be fair that bug was only P3.

lifthrasiir 14 hours ago

Previously on HN: Detecting Noise in Canvas Fingerprinting https://news.ycombinator.com/item?id=43170079

The reception was not really positive for the obvious reason at that time.

neuroelectron 4 hours ago

I, for one, find it hilarious that "headless browsers" are even required. JavaScript interpreters serving webpages is just another amusing bit of serendipity. "Version-less HTML" hahaha

kevin_thibedeau 2 hours ago

It exists because adtech providers and CDNs punish legitimate users who don't execute untrusted code on their property.
- Thorrez 2 hours ago
  
  Headless browsers exist because adtech providers and CDNs punish legitimate users who don't execute untrusted code on their property?
  If we ask the creators of headless chrome or selenium why they created them, would they say "because adtech providers and CDNs punish legitimate users who don't execute untrusted code on their property"?

wslh 9 hours ago

In Google Chrome, at least, I tried an infinite loop modifying document.title and it freezes pages in other tabs as well. Now, I am not at my computer to try again.

jillyboel 7 hours ago

[flagged]

seventh12 7 hours ago

The intention is to crash bots' browsers, not users' browsers
- ramesh31 4 hours ago
  
  Please point me to this 100% correct bot detection system with zero false positives.
- jillyboel 7 hours ago
  
  [flagged]
  - h4ck_th3_pl4n3t 5 hours ago
    
    If you are scraping forbidden data in my robots.txt, I don't give a damn. I am gonna mess with your bots however I like, and I'm willing to go as far as it takes to teach you a lesson about respecting my robots.txt.
    
    bryanrasmussen 4 hours ago
    
    and if someone is the victim of someone putting a scraper on their machine that scrapes your site but does not damage their machine and you damage their machine? I mean this is sort of the premise of a bot network, but evidently you can crash people's machines because they have the bad luck to have been infected by someone's attack?
    I don't think this is as legally impervious an argument as you think it is.
    on edit: I don't know if this is a wide spread thing, but if I was trying to scrape for illicit reasons I would definitely start thinking about scraping from machines I did not own, and I think if I can think of it surely there must be someone doing it now.
    
    h4ck_th3_pl4n3t an hour ago
    
    Malware installation is something completely different than segfaulting an .exe file that is running the scraper process.
    If illegal scraping behavior is expected behavior of the machine, then what the machine is doing is already covered by the Computer Fraud Act.
    
    anthk 4 hours ago
    
    Not my problem. The problem will be the for the malware creator. Twice.
  - anthk 7 hours ago
    
    If you are crashing some browser from a disallowed directory in robots.txt, is not your fault.
    
    chrismorgan 5 hours ago
    
    It’s generally illegal to booby-trap your own home. (If you’re not familiar with this, read up on it, the reasons can be quite thought-provoking.) Similar principles may apply here.
    
    BeFlatXIII 5 hours ago
    
    > If you’re not familiar with this, read up on it, the reasons can be quite thought-provoking
    Are the reasons relevant to headless web browsers?
    
    bryanrasmussen 4 hours ago
    
    here's a potentially relevant example
    https://news.ycombinator.com/item?id=43947910
    
    chrismorgan 5 hours ago
    
    Some, definitely not. Others, quite possibly.
    
    doubled112 4 hours ago
    
    Because people may be hurt.
    Which people may be hurt by crashing the machine where the bot is running?
  - lightedman 7 hours ago
    
    If that's the case what do we do about websites and apps which do things like disable your back button (mobile phone's direct one) or your right click capabilities (desktop browser) while such functionality disabling is not present in the ToS or even presented to you upon visiting the site or using the app?
dmitrygr 4 hours ago

Then maybe we need laws about crashing my server by crawling it 163,000 times per minute nonstop, ignoring robots.txt? Until then, no pity for the bots.
sMarsIntruder 6 hours ago

Running a bot farm?

omneity 12 hours ago

Relevant plug: At Herd we offer a browser automation and orchestration framework that uses real browsers and thus sidesteps several of these issues[0]. The API is puppeteer-like but doesn't use it as we built the entire framework[1] from scratch.

If you're wondering about the emphasis on MCPs, Herd is a generalist automation framework with a bespoke package format – trails[2], that supports MCP and REST out-of-the-box.

0: https://herd.garden

1: https://herd.garden/docs/reference

2: https://herd.garden/docs/trails-automations

---

EDIT: I understand not everyone likes a shameless plug in another thread. The intention behind it however is also informative, as not every browser automation strategy is subject to the issues as in TFA.

The title does say crashing Chromium bots, yet our approach creates "Chromium bots" that do not crash under this premise, providing a useful counter-example.

randunel 12 hours ago

How do you deal with the usual CF, akamai and other fingerprinting and blocking you? Or is that the customer's job to figure out?
- omneity 12 hours ago
  
  Thank you for the question! It depends on the scale you're operating at.
  1. For individual use (or company use but each user is on their device) typically the traffic is drown out in regular user activity since we use the same browser and no particular measure is needed, it just works. We have options for power users.
  2. For large scale use, we offer tailored solutions depending on the anti-bot measures encountered. Part of it is to emulate #1.
  3. We don't deal with "blackhat bots", so we don't offer support to work around legitimate anti-bot measures such as social spambots etc.
  - lyu07282 11 hours ago
    
    If you don't put significant effort into it, any headless browser from cloud IP ranges will be banned by large parts of the internet. This isn't just about spam bots, you can't even read news articles in many cases. You will have some competition from residential proxies and other custom automation solutions that take care of all of that for their customers.
    
    omneity 11 hours ago
    
    Thanks, that's so true! We learned this the hard way building Monitoro[0] and large data scraping pipelines in the past, so we had the opportunity to build up the required muscle.
    One thing to note, there are different "tiers" of websites, each requiring different counter-measures. Not everyone is pursuing the high competition websites, and most importantly as we learned in several cases scraping is fully consensual or within the rights of the user. For example:
    * Many of our users scrape their own websites to send notifications to their discord community. It's a super easy way to create alerts without code.
    * Sometimes users are locked in their own providers, for example some companies have years of job posting information in their ATS they cannot get out. We do help with that.
    * Public data websites who are underutilized precisely because the data is difficult to access. We help make that data operational and actionable. We had for example a sailor setup alerts on buoys to stay safe in high waters. A random example[1]
    0: https://monitoro.co
    1: https://wavenet.cefas.co.uk/details/312/EXT
volemo 2 hours ago

Guess we gotta find a way to crash these bots too. :D
erekp 7 hours ago

We have a similar solution at metalsecurity.io :) handling large-scale automation for enterprise use cases, bypassing antibots
- omneity 7 hours ago
  
  That's super cool, thank you for sharing! It's based on playwright though right? Can you verify if the approach you are using is also subject to the bug in TFA?
  My original point was not necessarily about bypassing anti-bot protections, and rather to offer a different branch of browser automation independent of incumbent solutions such as Puppeteer, Selenium and others, which we believe are not made for this purpose, and has many limitations as TFA mentions, requiring way too many workarounds as your solution illustrates.
  - erekp 6 hours ago
    
    we fix leaks and bugs of automation frameworks, so we don't have that problem. The approach of using the user's browser, like yours, is that you will burn the user's fingerprint depending on scale.
    
    omneity 6 hours ago
    
    Thanks for sharing your experience! I'm quite opinionated on this topic so buckle up :D
    We avoided the fork & patch route because it's both labor intensive for a limited return on investment, and a game of catching up. Updating the forked framework is challenging on its own right, let alone porting existing customer payloads to newer versions, locking you de-facto to older versions. I did maintain a custom fork at a previous workplace that was similar in scope to Browserless[0] and I can tell you it was a real pain.
    Developing your own framework (besides satisfying the obvious NIH itch) allows you to precisely control your exposure (reduce the attack surface) from a security perspective, and protects your customers from upstream decisions such as deprecations or major changes that might not be aligned with your customer requirements. I also have enough experience in this space to know exactly what we need to implement and the capabilities we want to enable. No bloat (yet)
    > you will burn the user's fingerprint depending on scale
    It's relative to your activity. See my other comment about scale and use cases, for personal device usage this is not an issue in practice, and users can automate several websites[1] using their personal agents without worrying about this. For more involved scenarios we have appropriate strategies that avoid this issue.
    > we fix leaks and bugs of automation frameworks
    Sounds interesting! I'd love to read a write up, or PRs if you have contributed something upstream.
    0: https://www.browserless.io/
    1: https://herd.garden/trails
    
    erekp 5 hours ago
    
    sounds good. As you can probably imagine, I also come from a lot of experience in the space :) But fair enough, everyone has their own opinion on what is more or less painful to implement and maintain and the associated pros and cons. We're tailored to very specific use cases that require scale and speed, so the route we took makes the most sense. I can't obviously share details of our implementation as it'd expose our evasions. And this is the exact problem of open source alternatives like camoufox and the now defunct puppeteer-stealth.