Breaking Captcha: Validation Accuracy vs Test Accuracy

This is the first part of the series “How to make a simple captcha breaker”. Let’s call this one “How not to make a simple captcha breaker”.

3 min readJan 23, 2019

I have been recently exploring Convolutional Neural Networks, or as LeCun would call them ConvNets (No, not CNN, there is a different story about this all together). After getting a brief idea from Jeremy Howard’s Fast.ai course and Deeplearning.ai’s Deep Learning Specialization course, yesterday I thought about making a simple captcha break to scrap my university results. One as simple as the one shown below.

Captcha that my university uses.

My plan to go forward was pretty simple and was divided into three steps. First, make training dataset. Second, train model on the dataset and finally, pulling the image from the website and predict the captcha output.

Make Training Dataset

I made images using PILLOW which had one character in ‘Arial italics’ font because it was the closest one to my test dataset.

Training dataset for 1. Similarly, I created dataset for numbers 0 to 9

from PIL import Image, ImageDraw, ImageFont
import os
import shutil
import random

ROOT_PATH = 'dataset/'
PATH = 'dataset_num/'
if os.path.isdir(ROOT_PATH):
    shutil.rmtree(ROOT_PATH)

for i in range(11):
    os.makedirs(ROOT_PATH + PATH + str(i))

num_start = 0
num_end = 10


def make_captcha():
    for num in range(num_start, num_end):
        for i in range(1000):
            img = Image.new('RGB',
                            (25, 25),
                            color=(255, 255, 255))
            fnt = ImageFont.truetype('fonts/arial_italics.ttf',
                                     16)

            d = ImageDraw.Draw(img)
            d.text((random.random()*10, random.random()*10),
                   str(num),
                   font=fnt,
                   fill=(54, 90, 64))
            if num == 0:
                img.save(f'{ROOT_PATH}{PATH}10/{str(i)}.jpg')
            else:
                img.save(f'{ROOT_PATH}{PATH}{num}/{str(i)}.jpg')


if __name__ == '__main__':
    make_captcha()

Train Model on Dataset

Since the data is not too varying and we know, train and test data are very similar, the training part was fairly simple.

I will share the code for training the model in the second part of this series. The important thing to note is the After 5 epochs, I was getting train accuracy=100% and validation accuracy as 100%. Which is pretty good and I was really happy.

Predict Captcha from Website

After the training part, it was time to test the model. For this, I gathered a test image from my Universities portal and predicted the outcome. The plan was:

to grab the image and slice it into 6 parts each containing a number.
Then, predict the 6 slices individually using contours of OpenCV
Finally, concatenate the predictions.

I ran the script to predict the outcome and Voila! the result was embarrassing. The model was able to predict only one number correctly. Now, was the time to debug it and see where did I go wrong. When I had a look at the number slices on which predictions were being made, I understood where the problem was.

While slicing the image, some noise was introduced from the neighbouring numbers. Here is an example.

Notice the noise (white pixels) in each slice

Conclusion

Here is the learning:

don’t get too excited even if your model gives 100% accuracy on your training and validation dataset.
Validation and test dataset should come from the same source. (I failed Andrew Ng). In this case, the training dataset should also be from the same data source as the number of test cases is limited