How we built a font recognition engine

Photo by DREW GILLIAM on Unsplash

Yes, we at pixolution are naturally more into images than fonts. But we had an exciting project with Portland based software vendor Extensis who specializes in font asset management and brand management.

In this particular case Extensis asked us to train an AI model capable of classifying fonts. The idea was a service where users could upload an image containing text, select one or two words and get the font type identified. The process of creating that model was so intriguing and challenging it would be a disgrace not to share it with you.

What we did

We trained a neural net capable of classifying the used font type and font variant in an input image - out of 370 learned font types. The model only needs one or two words as sample input and is language-independent, independent from the concrete word, background, color and size.

Additionally, we had to implement a preprocessing pipeline to normalize user input before identifying the font.

23 million images as training data

As always when training AI models, the most important thing is having reasonable training data. Creating a sufficiently large database to train a deep convolutional network from scratch was therefore the most comprehensive task. In general, 90% of the work is creating and improving the training data when developing an AI model. Since we could generate our training data instead of collecting it, we were able to train a model from scratch.

At first, we collected English and German word lists from the internet with a total of 450,000 words. We implemented a Python script which randomly selects 1–2 words from this merged word list and renders them in an image file. The idea of using real words instead of random letters was to reflect the different probabilities of occurence - thus the importance of certain letters - in real world texts. Nevertheless, we inserted some random letters to prevent the model to learn letter combinations.

For each font we generated 3,000 images with random words. And for each of these images we generated 20 augmented versions. This sums up to 60,000 training samples per font and about 23 million images altogether.

It was important that all words were rendered at the same pixel height so that the network would not mistakenly learn to distinguish fonts by size.

Next, we augmented those rendered images with different modifications to make the model more robust against noise and distracting visual elements like background colors, shadows, overlaps, crops etc. A great framework for augmenting images is imgaug. We used it to augment each sample 20 times.

Some random samples used as training input.

Once we had the training data we moved on training a GoogLeNet model. The testphase was very pleasing as the validation set accuracy was more than 98%. Not bad, right?

Normalizing user input

Users of the font recognition service will upload an image and select an area of one or two words. This subimage is then classified by the AI model. Of course, real users will not provide perfect input data. They may provide several lines of text with different font and image sizes. To avoid accuracy loss we implemented a preprocessing pipeline to normalize the input.

Normalizing user input

First, we apply an erode kernel to create a mask of connected regions which roughly represents lines of text. We identify the largest region and calculate its bounding box. If we found such a region, we crop the input image to that bounding box to remove multiple lines and offsets from the input image and retrieve a single line with text.

Processing steps to normalize an input image

We then apply an adaptive threshold (CLAHE algorithm) to auto level the color channels. After sharpening the image we scale it to a fixed height while preserving aspect ratio. We crop the width to fixed width and center the resulting image in a square image (256x256 px). The image is now normalized and used as input for the classification model.

Summary

We are proud to say that we have achieved all our goals for the project with Extensis and we will definitely continue our work in the field of font type recognition.

There are several things we can improve. Currently, the system cannot deal with rotated text and the model input should be changed, so we don't need squared images. We've learned a lot and we are looking forward to scale the system to identify hundreds of thousands fonts.

If you have a need for a customized AI model as well, just get in touch. We are curious where our journey will take us next time.


Newsletter

Get our latest blog posts and updates straight to your inbox.