This blog is retired.


Captcha recognition experiment

Just stumbled upon the article “Using AI to beat CAPTCHA and post comment spam”. There is a number of projects related to breaking CAPTCHAs and a number of articles on the topic, but this article strikes me most, because of:

* speed of development, and
* techniques used.

First, some introduction. Casey (the author) visited a blog hosting site. For some reason, he disliked the spam bot protection and decided to demonstrate that CAPTCHA provided a false sense of security. Why? I have no idea. The most reasonable explanation is hidden in the article itself:

When i looked at it [CAPTCHA], I was pretty certain I could apply the AI techniques I’ve been teaching myself to beat it. So that is my 2nd reason for writing this article… Because I could.

Good programmers love challenges, and breaking captchas is a good challenge. Meanwhile, here is an example of what he was going be break:

I’d say it’s not the weakest CAPTCHA. I’d even say it’s above an average web CAPTCHA. That’s why the author’s experience is even more interesting.

Speed of development: the CAPTCHA was broken in about a weekend, starting from scratch. And the author didn’t have experience, and the whole work was an experiment, with the trial and error approach.

The funny thing is that the final code doesn’t use a rocket science. Yes, the author tried to use a neural network, but gave up:

If I used a small number of input patterns (e.g. less than 100 characters), then the neural net would train successfully, but it did not perform well at runtime for recognizing the skewed characters. If I used a large number of input patterns (e.g. 250 characters), then the neural net would not converge during training. I.e. it would not successfully match all of the input data, even though the total error kept dropping.

The final solution was:

* As one of the means to describe a character, use the number of its endpoints.
* Use the collection of feature vectors directly, without the neural network.

Surprisingly, it worked. Even without tuning, the whole CAPTCHA was correctly guessed more than in the half of the cases. This gives the following conclusion.

In short: A casual programmer broke a not-so-bad CAPTCHA in a weekend, and it was nearly 3 years ago. It’s hard to imagine the experience of the real spammers…