We have recently come under a particularly persistent case of comment spam attack on the Vamsoft Community Forums. Like everybody on the Internet, our forums get spammed occasionally by spambots and humans alike, so in a way, this was an unremarkable event.
What was remarkable, though, is that our spammer has managed to defeat our JavaScript-based spambot detection, a variant of the industry-standard detection technique called a honeypot. Only humans are expected to do that, but our attacker didn’t exactly behave like a human.
Eventually, we have decided to investigate our spammer and what we have found is a little-known secret of how automated web browsers like PhantomJS are abused by spammers for their purposes. This sort of attack dates back to at least 2013, yet little information is available on the Internet.
To help raise awareness among the web developer and administrator communities, we have published a case study on our investigation and contributed a few ideas for detecting the new kind of spambots.
Case study: Headless browser use in web forum spam (809kB PDF)
Instead of inserting an extra hidden field in javascript, you could use a hidden field with a default value and change that value when a visible field gets focus. This way, only visitors effectively clicking in a field would pass the spambot challenge.
I use this technique, with other tricks, and so far it’s working pretty well. What do you think ?
@Frato: That could definitely work (indeed, it is a variant of the interaction-based prevention we suggested in the case study).
It may be defeated by a simulation of user interaction (and it would make some sense for spambots to simulate interaction). Technically, it could be done, because PhantomJS enables running JavaScript in the context of the web page (see http://phantomjs.org/api/webpage/method/evaluate.html). Although browsers limit sending keys, etc. for security reasons, there’s always the possiblity of modifying WebKit for a custom PhantomJS build in case more freedom is needed for the attacker.
Are there any abnormalities in the header field values other than user agent? And how about cookies, are they treated correctly?
I think the risk of running a full browser, even in sand box, it can be exploited, just like a browser a human uses? So have they put any limitations at all to how much this headless browser can do?
This is kind of new to me, so I’m just digging for information. I’m used to those old kind of bots, which can always be detected on fx. old user agent or not accepting gzip (which every browser does for good reasons) or not understanding HTML entities. Most of my techniques will definitely be defeated by a full browser bot.