+ Printed vs. handwritten text lines – automatically separated

The Transkribus team collaborates with the Pattern Recognition team of the University Erlangen-Nürnberg (also member of READ-COOP SCE) and the collegues were so great to make an interesting experiment: to train their classifier for discriminating printed and handwritten text lines automatically. There are mainly two use-cases: (1) to improve recognition results if specific HTR models are applied to specific script types. However, we made the experience that the HTR engines usually can deal rather well with a large number of scripts internally so the actual benefit may not be as high as one expects. (2) to find handwritten lines in printed books. E.g. if famous persons made notes in their private books the tool which is described below will find them with amazing accuracy!

The following text was provided by Matthias Seuret and Vincent Christlein from the Pattern Recognition team and slightly adapted for this post:

The difficulty in the classification of text lines as being printed or handwritten does not lie much in the usage of the convolutional neural networks (CNN) or the design of their architecture, but in the acquisition and preparation of the data. Indeed, modern artificial neural networks (ANN) are now able to deal with highly complex data (such as ImageNet, which includes 90 different dog breeds to discriminate), and for a large variety of tasks, presenting enough examples to the ANN is sufficient to make it reach a fair accuracy.

It is necessary to note that ANNs (and other artificial intelligence systems) are extremely biased by the data used to train them. Because of this, the training data should be chosen carefully to ensure that the easiest way to classify the images correctly is to solve the task. For example, if all (or most) handwritten text lines are on a yellowish paper, while printed material is on white paper, then the ANN will simply learn to separate yellow from white, and will answer that any text line printed on yellowish paper is handwritten. Of course, an ANN can learn various other undesired data properties, such as the image resolution and quality, the texture of the paper, or the colour or contrast of the ink. Thus, it is of an utmost importance to use training data as similar as possible to what the ANN will have to deal with.

The system we developed for this task is based on the (printed) font groups classifier developed for the OCR-D project (http://www.ocr-d.de/). It consists of a DenseNet-121 wrapped in some utility classes, and has been adapted for the binary classification of handwritten and printed text. The DenseNet-121 is a convolutional neural network with 121 layers, most of them being in 4 blocks densely connected. It has however a relatively small amount of parameters for a network of its size, and thus requires less data to be trained than architectures with more parameters.

Machine learning schema for printed vs. handwritten text lines

Text lines are pre-processed in two ways. First, all of them are resized to a height of 150 pixels, and their aspect ratio is preserved. This is helpful for the ANN, as it will not have to learn dealing with a large diversity of text size. Second, data augmentation methods applied to the training images. This means that some small modifications, such as shearing or hue modification, are applied to the training images every time they are shown to the neural network during the training. The goal is to make the network learn to ignore these variations and perform on unseen data.

We trained our network on text lines coming from two different sources. Approximately 40’000 printed text lines were extracted automatically from the dataset presented in “Dataset of Pages from Early Printed Books with Multiple Font Groups” (https://doi.org/10.5281/zenodo.3366685), and 9’577 handwritten samples provided by READ. Also, 1’562 text lines from each class were used for test purpose – none of them came from a page used for the training data. While our network reached a classification accuracy of 97.5% on the test data, one has to keep in mind that this holds true only for this specific data. The source code of our method and the trained CNN, as well as code allowing anybody to easily retrain the CNN on their own data, is available at the following address: https://github.com/seuretm/printed-vs-handwritten

Note: If you are interested to create training data for this purpose in Transkribus you can use the “Structural tagging” feature and mark lines as “handwritten” or “printed” in your documents. The actual classifier needs to run outside of Transkribus, however, if there is some strong support from the user community we are happy to include the tool also in the Transkribus platform.

Posted in Events.