> The call to page.evaluate just hangs, and the browser dies silently. browser.close() is never reached, which can cause memory leaks over time.
Not just memory leaks. Since a couple months ago, if you use Chrome via playwright etc. on macOS, it will deposit a copy of Chrome (more than 1GB) into /private/var/folders/kd/<...>/X/com.google.Chrome.code_sign_clone/, and if you exit without a clean browser.close(), the copy of Chrome will remain there. I noticed after it ate up ~50GB in two days. No idea what's the point of this code sign clone thing, but I had to add --disable-features=MacAppCodeSignClone to all my invocations to prevent it, which is super annoying.
Interesting, IIRC I did free up quite a bit of disk space when I removed all the clones, but I also deleted a lot of other stuff that time so I could be mistaken. du(1) being unaware of APFS clones makes it hard to tell.
Checking https://issues.chromium.org/issues/340836884, I’m mildly surprised to find the report just under a year old, with no attention at all (bar a me-too comment after four months), despite having been filed with priority P1, which I understand is supposed to mean “aim to fix it within 30 days”. If it continues to get no attention, I’m curious if it’ll get bumped automatically in five days’ time when it hits one year, given that they do something like that with P2 and P3 bugs, shifting status to Available or something, can’t quite remember.
I say only “mildly”, because my experience on Chromium bugs (ones I’ve filed myself, or ones I’ve encountered that others have filed) has never been very good. I’ve found Firefox much better about fixing bugs.
I, for one, find it hilarious that "headless browsers" are even required. JavaScript interpreters serving webpages is just another amusing bit of serendipity. "Version-less HTML" hahaha
Headless browsers exist because adtech providers and CDNs punish legitimate users who don't execute untrusted code on their property?
If we ask the creators of headless chrome or selenium why they created them, would they say "because adtech providers and CDNs punish legitimate users who don't execute untrusted code on their property"?
In Google Chrome, at least, I tried an infinite loop modifying document.title and it freezes pages in other tabs as well. Now, I am not at my computer to try again.
If you are scraping forbidden data in my robots.txt, I don't give a damn. I am gonna mess with your bots however I like, and I'm willing to go as far as it takes to teach you a lesson about respecting my robots.txt.
and if someone is the victim of someone putting a scraper on their machine that scrapes your site but does not damage their machine and you damage their machine? I mean this is sort of the premise of a bot network, but evidently you can crash people's machines because they have the bad luck to have been infected by someone's attack?
I don't think this is as legally impervious an argument as you think it is.
on edit: I don't know if this is a wide spread thing, but if I was trying to scrape for illicit reasons I would definitely start thinking about scraping from machines I did not own, and I think if I can think of it surely there must be someone doing it now.
It’s generally illegal to booby-trap your own home. (If you’re not familiar with this, read up on it, the reasons can be quite thought-provoking.) Similar principles may apply here.
If that's the case what do we do about websites and apps which do things like disable your back button (mobile phone's direct one) or your right click capabilities (desktop browser) while such functionality disabling is not present in the ToS or even presented to you upon visiting the site or using the app?
Then maybe we need laws about crashing my server by crawling it 163,000 times per minute nonstop, ignoring robots.txt? Until then, no pity for the bots.
Relevant plug: At Herd we offer a browser automation and orchestration framework that uses real browsers and thus sidesteps several of these issues[0]. The API is puppeteer-like but doesn't use it as we built the entire framework[1] from scratch.
If you're wondering about the emphasis on MCPs, Herd is a generalist automation framework with a bespoke package format – trails[2], that supports MCP and REST out-of-the-box.
EDIT: I understand not everyone likes a shameless plug in another thread. The intention behind it however is also informative, as not every browser automation strategy is subject to the issues as in TFA.
The title does say crashing Chromium bots, yet our approach creates "Chromium bots" that do not crash under this premise, providing a useful counter-example.
Thank you for the question! It depends on the scale you're operating at.
1. For individual use (or company use but each user is on their device) typically the traffic is drown out in regular user activity since we use the same browser and no particular measure is needed, it just works. We have options for power users.
2. For large scale use, we offer tailored solutions depending on the anti-bot measures encountered. Part of it is to emulate #1.
3. We don't deal with "blackhat bots", so we don't offer support to work around legitimate anti-bot measures such as social spambots etc.
If you don't put significant effort into it, any headless browser from cloud IP ranges will be banned by large parts of the internet. This isn't just about spam bots, you can't even read news articles in many cases. You will have some competition from residential proxies and other custom automation solutions that take care of all of that for their customers.
Thanks, that's so true! We learned this the hard way building Monitoro[0] and large data scraping pipelines in the past, so we had the opportunity to build up the required muscle.
One thing to note, there are different "tiers" of websites, each requiring different counter-measures. Not everyone is pursuing the high competition websites, and most importantly as we learned in several cases scraping is fully consensual or within the rights of the user. For example:
* Many of our users scrape their own websites to send notifications to their discord community. It's a super easy way to create alerts without code.
* Sometimes users are locked in their own providers, for example some companies have years of job posting information in their ATS they cannot get out. We do help with that.
* Public data websites who are underutilized precisely because the data is difficult to access. We help make that data operational and actionable. We had for example a sailor setup alerts on buoys to stay safe in high waters. A random example[1]
That's super cool, thank you for sharing! It's based on playwright though right? Can you verify if the approach you are using is also subject to the bug in TFA?
My original point was not necessarily about bypassing anti-bot protections, and rather to offer a different branch of browser automation independent of incumbent solutions such as Puppeteer, Selenium and others, which we believe are not made for this purpose, and has many limitations as TFA mentions, requiring way too many workarounds as your solution illustrates.
we fix leaks and bugs of automation frameworks, so we don't have that problem. The approach of using the user's browser, like yours, is that you will burn the user's fingerprint depending on scale.
Thanks for sharing your experience! I'm quite opinionated on this topic so buckle up :D
We avoided the fork & patch route because it's both labor intensive for a limited return on investment, and a game of catching up. Updating the forked framework is challenging on its own right, let alone porting existing customer payloads to newer versions, locking you de-facto to older versions. I did maintain a custom fork at a previous workplace that was similar in scope to Browserless[0] and I can tell you it was a real pain.
Developing your own framework (besides satisfying the obvious NIH itch) allows you to precisely control your exposure (reduce the attack surface) from a security perspective, and protects your customers from upstream decisions such as deprecations or major changes that might not be aligned with your customer requirements. I also have enough experience in this space to know exactly what we need to implement and the capabilities we want to enable. No bloat (yet)
> you will burn the user's fingerprint depending on scale
It's relative to your activity. See my other comment about scale and use cases, for personal device usage this is not an issue in practice, and users can automate several websites[1] using their personal agents without worrying about this. For more involved scenarios we have appropriate strategies that avoid this issue.
> we fix leaks and bugs of automation frameworks
Sounds interesting! I'd love to read a write up, or PRs if you have contributed something upstream.
sounds good. As you can probably imagine, I also come from a lot of experience in the space :) But fair enough, everyone has their own opinion on what is more or less painful to implement and maintain and the associated pros and cons. We're tailored to very specific use cases that require scale and speed, so the route we took makes the most sense. I can't obviously share details of our implementation as it'd expose our evasions. And this is the exact problem of open source alternatives like camoufox and the now defunct puppeteer-stealth.
> The call to page.evaluate just hangs, and the browser dies silently. browser.close() is never reached, which can cause memory leaks over time.
Not just memory leaks. Since a couple months ago, if you use Chrome via playwright etc. on macOS, it will deposit a copy of Chrome (more than 1GB) into /private/var/folders/kd/<...>/X/com.google.Chrome.code_sign_clone/, and if you exit without a clean browser.close(), the copy of Chrome will remain there. I noticed after it ate up ~50GB in two days. No idea what's the point of this code sign clone thing, but I had to add --disable-features=MacAppCodeSignClone to all my invocations to prevent it, which is super annoying.
That's an open bug at the minute, but the one saving grace is that they're APFS clones so don't actually consume disk space.
Interesting, IIRC I did free up quite a bit of disk space when I removed all the clones, but I also deleted a lot of other stuff that time so I could be mistaken. du(1) being unaware of APFS clones makes it hard to tell.
Checking https://issues.chromium.org/issues/340836884, I’m mildly surprised to find the report just under a year old, with no attention at all (bar a me-too comment after four months), despite having been filed with priority P1, which I understand is supposed to mean “aim to fix it within 30 days”. If it continues to get no attention, I’m curious if it’ll get bumped automatically in five days’ time when it hits one year, given that they do something like that with P2 and P3 bugs, shifting status to Available or something, can’t quite remember.
I say only “mildly”, because my experience on Chromium bugs (ones I’ve filed myself, or ones I’ve encountered that others have filed) has never been very good. I’ve found Firefox much better about fixing bugs.
I guess it depends on what kind of bug it is, this took 25 years to fix https://news.ycombinator.com/item?id=40431444
To be fair that bug was only P3.
Previously on HN: Detecting Noise in Canvas Fingerprinting https://news.ycombinator.com/item?id=43170079
The reception was not really positive for the obvious reason at that time.
I, for one, find it hilarious that "headless browsers" are even required. JavaScript interpreters serving webpages is just another amusing bit of serendipity. "Version-less HTML" hahaha
It exists because adtech providers and CDNs punish legitimate users who don't execute untrusted code on their property.
Headless browsers exist because adtech providers and CDNs punish legitimate users who don't execute untrusted code on their property?
If we ask the creators of headless chrome or selenium why they created them, would they say "because adtech providers and CDNs punish legitimate users who don't execute untrusted code on their property"?
In Google Chrome, at least, I tried an infinite loop modifying document.title and it freezes pages in other tabs as well. Now, I am not at my computer to try again.
[flagged]
The intention is to crash bots' browsers, not users' browsers
Please point me to this 100% correct bot detection system with zero false positives.
[flagged]
If you are scraping forbidden data in my robots.txt, I don't give a damn. I am gonna mess with your bots however I like, and I'm willing to go as far as it takes to teach you a lesson about respecting my robots.txt.
and if someone is the victim of someone putting a scraper on their machine that scrapes your site but does not damage their machine and you damage their machine? I mean this is sort of the premise of a bot network, but evidently you can crash people's machines because they have the bad luck to have been infected by someone's attack?
I don't think this is as legally impervious an argument as you think it is.
on edit: I don't know if this is a wide spread thing, but if I was trying to scrape for illicit reasons I would definitely start thinking about scraping from machines I did not own, and I think if I can think of it surely there must be someone doing it now.
Malware installation is something completely different than segfaulting an .exe file that is running the scraper process.
If illegal scraping behavior is expected behavior of the machine, then what the machine is doing is already covered by the Computer Fraud Act.
Not my problem. The problem will be the for the malware creator. Twice.
If you are crashing some browser from a disallowed directory in robots.txt, is not your fault.
It’s generally illegal to booby-trap your own home. (If you’re not familiar with this, read up on it, the reasons can be quite thought-provoking.) Similar principles may apply here.
> If you’re not familiar with this, read up on it, the reasons can be quite thought-provoking
Are the reasons relevant to headless web browsers?
here's a potentially relevant example
https://news.ycombinator.com/item?id=43947910
Some, definitely not. Others, quite possibly.
Because people may be hurt.
Which people may be hurt by crashing the machine where the bot is running?
If that's the case what do we do about websites and apps which do things like disable your back button (mobile phone's direct one) or your right click capabilities (desktop browser) while such functionality disabling is not present in the ToS or even presented to you upon visiting the site or using the app?
Then maybe we need laws about crashing my server by crawling it 163,000 times per minute nonstop, ignoring robots.txt? Until then, no pity for the bots.
Running a bot farm?
Relevant plug: At Herd we offer a browser automation and orchestration framework that uses real browsers and thus sidesteps several of these issues[0]. The API is puppeteer-like but doesn't use it as we built the entire framework[1] from scratch.
If you're wondering about the emphasis on MCPs, Herd is a generalist automation framework with a bespoke package format – trails[2], that supports MCP and REST out-of-the-box.
0: https://herd.garden
1: https://herd.garden/docs/reference
2: https://herd.garden/docs/trails-automations
---
EDIT: I understand not everyone likes a shameless plug in another thread. The intention behind it however is also informative, as not every browser automation strategy is subject to the issues as in TFA.
The title does say crashing Chromium bots, yet our approach creates "Chromium bots" that do not crash under this premise, providing a useful counter-example.
How do you deal with the usual CF, akamai and other fingerprinting and blocking you? Or is that the customer's job to figure out?
Thank you for the question! It depends on the scale you're operating at.
1. For individual use (or company use but each user is on their device) typically the traffic is drown out in regular user activity since we use the same browser and no particular measure is needed, it just works. We have options for power users.
2. For large scale use, we offer tailored solutions depending on the anti-bot measures encountered. Part of it is to emulate #1.
3. We don't deal with "blackhat bots", so we don't offer support to work around legitimate anti-bot measures such as social spambots etc.
If you don't put significant effort into it, any headless browser from cloud IP ranges will be banned by large parts of the internet. This isn't just about spam bots, you can't even read news articles in many cases. You will have some competition from residential proxies and other custom automation solutions that take care of all of that for their customers.
Thanks, that's so true! We learned this the hard way building Monitoro[0] and large data scraping pipelines in the past, so we had the opportunity to build up the required muscle.
One thing to note, there are different "tiers" of websites, each requiring different counter-measures. Not everyone is pursuing the high competition websites, and most importantly as we learned in several cases scraping is fully consensual or within the rights of the user. For example:
* Many of our users scrape their own websites to send notifications to their discord community. It's a super easy way to create alerts without code.
* Sometimes users are locked in their own providers, for example some companies have years of job posting information in their ATS they cannot get out. We do help with that.
* Public data websites who are underutilized precisely because the data is difficult to access. We help make that data operational and actionable. We had for example a sailor setup alerts on buoys to stay safe in high waters. A random example[1]
0: https://monitoro.co
1: https://wavenet.cefas.co.uk/details/312/EXT
Guess we gotta find a way to crash these bots too. :D
We have a similar solution at metalsecurity.io :) handling large-scale automation for enterprise use cases, bypassing antibots
That's super cool, thank you for sharing! It's based on playwright though right? Can you verify if the approach you are using is also subject to the bug in TFA?
My original point was not necessarily about bypassing anti-bot protections, and rather to offer a different branch of browser automation independent of incumbent solutions such as Puppeteer, Selenium and others, which we believe are not made for this purpose, and has many limitations as TFA mentions, requiring way too many workarounds as your solution illustrates.
we fix leaks and bugs of automation frameworks, so we don't have that problem. The approach of using the user's browser, like yours, is that you will burn the user's fingerprint depending on scale.
Thanks for sharing your experience! I'm quite opinionated on this topic so buckle up :D
We avoided the fork & patch route because it's both labor intensive for a limited return on investment, and a game of catching up. Updating the forked framework is challenging on its own right, let alone porting existing customer payloads to newer versions, locking you de-facto to older versions. I did maintain a custom fork at a previous workplace that was similar in scope to Browserless[0] and I can tell you it was a real pain.
Developing your own framework (besides satisfying the obvious NIH itch) allows you to precisely control your exposure (reduce the attack surface) from a security perspective, and protects your customers from upstream decisions such as deprecations or major changes that might not be aligned with your customer requirements. I also have enough experience in this space to know exactly what we need to implement and the capabilities we want to enable. No bloat (yet)
> you will burn the user's fingerprint depending on scale
It's relative to your activity. See my other comment about scale and use cases, for personal device usage this is not an issue in practice, and users can automate several websites[1] using their personal agents without worrying about this. For more involved scenarios we have appropriate strategies that avoid this issue.
> we fix leaks and bugs of automation frameworks
Sounds interesting! I'd love to read a write up, or PRs if you have contributed something upstream.
0: https://www.browserless.io/
1: https://herd.garden/trails
sounds good. As you can probably imagine, I also come from a lot of experience in the space :) But fair enough, everyone has their own opinion on what is more or less painful to implement and maintain and the associated pros and cons. We're tailored to very specific use cases that require scale and speed, so the route we took makes the most sense. I can't obviously share details of our implementation as it'd expose our evasions. And this is the exact problem of open source alternatives like camoufox and the now defunct puppeteer-stealth.