+Sharing data for Handwritten Text Recognition

At the READ project we are committed to sharing data and working collaboratively to improve the recognition of handwritten historical documents.

With this in mind, two of the project’ research groups have uploaded data relating to recent computer science competitions in handwriting and document layout analysis.

Data from the Pattern Recognition and Human Language Technology group at the Universitat Politècnica de València and the Computational Intelligence Lab (CITLab) at the University of Rostock:

Check out the ScriptNet-READ community on Zenodo for more of the data that READ project researchers are using for their experiments.

+ A new video tutorial helps with segmentation in Transkribus

Segmentation is a crucial stage of working with Handwritten Text Recognition in our Transkribus platform.  Digitised images of historical documents must be segmented into text regions, lines and baselines before they can be transcribed manually or automatically.  Segmentation can be performed automatically by the software to a high level of accuracy.  For more complex documents, users may then need to make some manual corrections – moving or deleting baselines for example.

If you’re new to segmentation in Transkribus, we have a new video tutorial which will help you get started.

You can find out more about working with Transkribus in our How to Guides.

+ Recognising printed Asian texts with Transkribus

Yes, you read that correctly – our Transkribus platform can indeed recognise printed Indian texts.

Conventional OCR software usually struggles to decipher the complexities of South Asian scripts.  Two projects have recently been working with nineteenth-century printed texts in Transkribus with the hope of getting better results.  Using images and transcripts from a collection, Transkribus users can train a model to recognise printed text of any type.

First of all, The British Library’s Two Centuries of Indian Print project is creating a digitised collection of works published in South Asia in the eighteenth and nineteenth centuries.  The project team trained a text recognition model in Transkribus with 50 pages (containing 5,700 words) of digitised images and transcripts from Bengali books.  The resulting model can produce transcripts of page from the collection with an average Character Error Rate of 21%.  Although this is a relatively high error rate, the team are planning to retrain the model by creating more pages of training data and focusing on improving the recognition of elements of the Bengali characters which were sometimes missed by the software.

The Naval Kishore Press was a nineteenth-century publishing house which brought works on various subjects to market in Hindi, Urdu, Arabic, Persian and Sanskrit. Part of its output are held by the library of the South Asia Institute (SAI) at Heidelberg University.  The South Asia Institute library and Heidelberg University Library are collaborating on the Naval Kishore Press – digital project, working to produce digitised and machine-readable text for a selection of texts published by this press.  The project team used 200 pages of images and transcripts to train a model in Transkribus to recognise Hindi and Sanskrit text.  This model can produce transcripts of the collection with a Character Error Rate of around 5%.  Fully searchable images and transcripts from the collection are now available to consult, download and annotate on Heidelberg University library’s online catalogue.

Read more:

+Transkribus How to Guides now available in German (and French)

Many new users are registering for a Transkribus account every day and our How to Guides are there to help everyone get to grips with Handwritten Text Recognition technology for historical documents.

All of our How to Guides are now available in English and German.

Our introductory guide, ‘How to use Transkribus in 10 steps’ is also available in French.

You can find all of our How to Guides on the Transkribus wiki.

Our thanks go to Régis Schlagdenhaffen for the French translation. 

+ Preserving our cultural heritage with a smartphone

The READ project is a big proponent of digitisation on demand using smartphones.

A typical mobile phone camera can capture relatively high-quality images of historical documents, which can then be used for preservation, research and even as training data for Automated Text Recognition using our Transkribus platform.

The Computer Vision Lab at the Technical University of Vienna (one of the READ project partners) have created the ScanTent device and the DocScan mobile app to make it easier for people to digitise documents in this way.

The ScanTents

We were happy to receive a positive enquiry about these tools, highlighting their potential to capture unique records that might otherwise be lost.

Stefan Krüger from Germany got in touch after he had digitised his grandfather’s dissertation using his mobile phone and used Transkribus to recognise the text with OCR.  Herbert Rechner completed his dissertation in 1927 just before the rise of the Nazis, on the radical topic of the ‘the sexual causes of offences’.  Although Stefan was never able to meet his grandfather, he is interested in researching his history and is hopeful that Transkribus might be able to help recognise personal handwritten papers one day.

Stefan wrote…

‘After a long search I found the 90 years old dissertation of my grandfather in the German National Library in Leipzig and (in bowing to the performance of my ancestor) digitally reproduced the work. The Transkribus project helped me a lot with its outstanding recognition rate.

I photographed the booklet (about 100 pages) freehand with glass plate and smartphone (CamScanner) and re-set it in InDesign after text recognition.

With this work it became clear to me that we are experiencing a scientific break: everything that is not digitally available in scientific literature will disappear in the cognition-sinking. It is simply no longer taken into account in the scientific knowledge and research process. In the case of topics relating to electronics, space travel and other “more modern” developments, this may be easy to accept.

With all historically relevant things, however, this is painful.

That’s why I find your low-level effort with high-tech solutions very interesting. I would like to test your tent and the app. My thought is that actually (at least) everyone who has enjoyed an academic education should participate in the digital processing of his work and other literature. If you could make such a crowd thing out of it, then a big stock of literature could actually be worked on. So I am happy to participate in your developments in this sense.

With cordial greetings

Stefan Krüger’

Translated from German with www.DeepL.com/Translator

Thank you to Stefan for this feedback, which shows how Transkribus can help individuals to digitise and recognise exceptional historical documents.

A page from Herbert Rechner’s dissertation, digitised with a smartphone. Image credit: Stefan Krüger

If you would like to try digitising documents with a mobile phone, the DocScan app is available to download now free of charge (Android only). The ScanTent is still in development and units should be available for sale over the next few months.

Find out more:

+ Searching the Spanish Golden Age with Keyword Spotting

In sixteenth- and seventeenth-century Spain, there was a significant surge of thousands of theatrical productions. This period has become known as the Spanish Golden Age.  Thanks to a new protoype web tool, anyone can now search through 40,000 images from a significant digitised collection of manuscripts relating to this period of Spanish history.  This tool uses cutting-edge Keyword Spotting technology, allowing users to search images which have  never before been transcribed.

This tool is a collaboration between the Pattern Recognition and Human Language Technology research centre at the Universitat Politecnica de Valencia (one of the READ partners), the National Library of Spain and the PROLOPE research group (both READ MOU partners).

The PRHLT research centre has treated these manuscripts with advanced text recognition and probabilistic word indexing technology.  This sophisticated form of searching is often called Keyword Spotting. It is more powerful than a conventional full-text search because it uses statistical models trained for text recognition to search through probability values assigned to character sequences (words), considering most possible readings of each word on a page.

Keyword Spotting for the word ‘Madrid’.

The 40,000 pages currently available for searching represents about half of the collection.  More documents from the collection will be processed in this way if further funding can be found.

The release of this Keyword Spotting tool coincides with a new exhibition at the National Library of Spain all about the Spanish Golden Age which runs until March 2019.  The exhibition will combine original manuscripts with digital displays.  The PRHLT team have a created an online quiz (in Spanish) for the exhibition which asks users to work with the Keyword Spotting too to find out which words appear frequently or in combination.

If you are interested in Keyword Spotting, check out other tools constructed by the PRHLT team relating to:

+ Recognising eighteenth-century legal records at Middle Temple

The Honourable Society of the Middle Temple is one of four Inns of Court: prestigious professional associations for barristers working in England.

The archive and library of Middle Temple holds records of the Inn from the early sixteenth century onwards.  The most significant series of these documents are being digitised and made available online.

Middle Temple began exploring Transkribus tentatively in 2016.  The Inn first signed a Memorandum of Understanding with the READ project and then started to explore the possibilities of training Handwritten Text Recognition (HTR) models to recognise documents in their collections.

After discussions about the best documents to start with, they settled on digitised manuscript records of Middle Temple’s governing body or Parliament.  These records dated from 1762-1775 and were written in several very similar hands.

A selection of 101 bifolio pages were uploaded to Transkribus and transcribed by the Transkribus team.  David Woolley QC, a bencher at Middle Temple, then took care of proof-reading and correcting each page to ensure that the transcriptions were as accurate as possible.

These images and transcripts (around 80,000 transcribed words) became training data for generating a HTR model.  Data from the pre-exisiting ‘English Writing M1’ model was also included as part of the training process as a ‘base model’.  The ‘English Writing M1’ model is trained to recognise the writing of the English philosopher Jeremy Bentham (1748 – 1832) and his secretaries – it is freely available to all Transkribus users for their experiments.

The resulting HTR model can produce transcripts of images from the test set with a very low Character Error Rate of 3.31%.  This is an amazing result!  Automated transcripts with such a low error rate immediately become a useful research resource.

Automated transcription of a page from the Middle Temple records.

The team at Middle Temple also created a dictionary based on one of their ‘Bench Books’ which lists recurring names, abbreviations and unusual terms. This dictionary should hopefully improve the quality of the recognition.

Middle Temple is now exploring ways to build on this first great achievement, by making these transcripts available to researchers in a searchable database.

Thanks to Lesley Whitelaw, Barnaby Bryan and David Woolley at Middle Temple and Stuart Dunn at King’s College London for this collaboration.

+ Working with (Early Modern) Dutch script? Join a new Transkribus working group!

by Annemieke Romein, University of Ghent

(Dutch language version below)

Throughout the Early Modern era much was written in the Dutch language, not just in the Low Countries – but in former colonies, among certain religious groups within Northern America, and in Hansa cities as well. An Early Modern Gothic script was widely used, though it had some varieties depending on its contexts, aim, and type. First experiments with documents from Belgium (Ghent, in Flanders) have demonstrated that the Dutch language can be recognised by Handwritten Text Recognition (HTR) models with a good level of accuracy.

The next step is to combine different examples of Early Modern Dutch texts in order to build and improve generic models for the recognition of various types of documents. Dr. Annemieke Romein (Erasmus University Rotterdam/ Ghent University), Dr. Jetze Touber, and Koen Verstraeten have initiated the ‘Early Modern Dutch’ working group – where all Transkribus users can work together towards the aim of the improved recognition of the Dutch language. Scroll down to find out more about joining the working group and its aims.

The process of combining training data of different Early Modern Dutch documents has already started at Ghent University. Various researchers at the Institute for Early Modern History and the Ghent Center for Digital Humanities are bringing materials together in order to train a HTR model. However, within a multidisciplinary group such as this, we have quickly realised that there are various types of texts as well as periods within the early modern period to deal with. Sixteenth-century handwriting is different from a century later, even when in terms of content little changed; likewise, texts written with a political-institutional or legal background will differ tremendously from diaries, letters and academic texts. Nonetheless, each of these types of texts can train the recognition of the text as well as of the handwriting. How smart computers can be made, is yet to be discovered within such a context.

In order to streamline this endeavour, three Ghent-based historians are working together and will be coordinating/ training different language models, hopefully leading to one final model for the Dutch language (depending on the amount of training material).

Dr. Annemieke Romein 16th, 17th, 18th century Political-institutional/ legal texts (incl. requests, letters of statesmen).
Dr. Jetze Touber 16th, 17th, 18th century Cultural texts (diaries, letters); Scholarly, academic and religious texts.
Koen Verstraeten 19th century Cultural texts (diaries, letters); Scholarly and academic texts.

The ‘Early Modern Dutch’-working group is looking for further examples of documents written in Dutch from the 16th, 17th and 18th century. You can help us add to the collection – all that is needed are images (preferably around 300 dpi) and transcriptions.

You can:

  • share existing training data that you have already prepared in Transkribus (duplicate it to the folder we will invite you to).
  • prepare new images and transcripts in Transkribus in the ‘Early Modern Dutch’ collection
  • send over files containing images and transcripts which can be matched automatically and converted into training data using the Text2image tool.

Please do indicate what type of textual material you are sharing, so that we have an overview and can start training models a.s.a.p..

To join the working group and get access to the ‘Early Modern Dutch’ collection in Transkribus, contact the group at: TranskribusEMDutch@gmail.com.

The ‘Early Modern Dutch’ working group aims to demonstrate that training based algorithms like Handwritten Text Recognition need significant input from many stakeholders – they can only be improved by cooperation and sharing!

————————————————————————————————————

Werkt u met Vroegmoderne Nederlandse teksten (± 1500-1900)? Sluit u aan bij de Transkribus werkgroep!

Er zijn veel teksten geschreven in de Nederlandse taal, niet alleen in de Lage Landen zelf, maar ook in voormalige koloniën, bij religieuze groepen in Noord-Amerika, alsook in de Hanze steden. Het vroegmoderne gotische schrift werd veel gebezigd, hoewel er variaties te vinden zijn afhankelijk van de context, doel en het type tekst. Eerste experimenten met documenten laten zien dat de Nederlandse taal middels Automatische Tekst Herkenning (OCR) modellen herkend kunnen worden en dat middels training hier goede resultaten geboekt kunnen worden.

De volgende stap is het combineren van verschillende voorbeelden van Nederlandse teksten, in een poging om algemene taalmodellen te maken die verschillende typen documenten kunnen analyseren en herkennen. Dr. Annemieke Romein (Erasmus University Rotterdam/ Ghent University – IEMH), Dr. Jetze Touber (UGent – IEMH), en Koen Verstraeten (UGent archief) nemen het initiatief om een werkgroep ‘Vroegmodern Nederlands’ te starten. De focus ligt op de periode 1500-1900, maar materiaal uit andere perioden is eveneens welkom. In deze groep kunnen Transkribus-gebruikers samenwerken om de herkenning van de Nederlandse taal van teksten te verbeteren. Leest u vooral verder om meer te komen over deelname aan deze groep en de doelen.

Het proces van het combineren van trainingsmateriaal van verschillende vroegmoderne teksten is al enige tijd aan de gang. Aan de UGent zijn verschillende onderzoekers van het Institute for Early Modern History en het Ghent Center for Digital Humanities bezig met het uploaden van hun materialen naar Transkribus. Via Text2Image worden bestaande transcripties aan foto’s gekoppeld en worden computers getraind. Dit is momenteel in volle gang. We hebben ons al snel gerealiseerd date r verschillende typen teksten bestaan, alsook verschillende tijdsperioden waarin gradueel veranderingen optreden. Alle soorten teksten kunnen worden getraind in Transkribus, maar daar is veel trainingsmateriaal voor nodig. Méér dan een enkele onderzoeker kan verzamelen. Daarom deze oproep tot deelname.

Transkribus (voorlopig) een gratis programma dat kan worden gebruikt om servers in Innsbruck te trainen om handschriften (maar ook drukwerk) te herkennen middels “Handwriting Text Recognition” (HTR). Ten minste 75 pagina’s getranscribeerde tekst zijn nodig om een handschrift goed te kunnen herkennen, maar dat betreft dan één auteur. Hoe meer materiaal er wordt geüpload, hoe universeler wordt het model. Het wordt dan steeds breder toepasbaar. Archieven, bibliotheken en erfgoedinstellingen, maar zeker ook individuele onderzoekers wordt dringend verzocht om hun materiaal te delen dat de 16e tot en met 19e eeuw bestrijkt.

Drie Gentse onderzoekers zijn betrokken bij het coördineren van het Nederlandstalige model en zullen tests uitvoeren om een zo accuraat mogelijk model (of modellen) te trainen. Voornoemde onderzoekers houden zich bezig met respectievelijk:

Dr. Annemieke Romein 16e, 17e, 18e, , 19e eeuw Politiek-institutionele/ juridische teksten (incl. rekesten, brieven van staatslieden)
Dr. Jetze Touber 16e, 17e, 18e eeuw Culturele teksten (dagboeken, brieven); wetenschappelijke, academische en religieuze teksten.
Koen Verstraeten 19e eeuw Culturele teksten (dagboeken, brieven); wetenschappelijke, academische en religieuze teksten.

Als u materiaal beschikbaar wilt stellen en deel wilt nemen aan deze werkgroep vragen wij u om contact op te nemen via TranskribusEMDutch@gmail.com. Het is handig als u dan aangeeft om welk type teksten het gaat, zodat wij een beeld hebben in welke modellen wij dit kunnen gaan gebruiken.

Veel gestelde vragen:

  • Afbeeldingen en transcripties die u via Transkribus op hun server plaatst (direct via het programma, of via de Text2image tool) blijven privé: u heeft hier uitsluitend toegang toe.
  • U kunt er voor kiezen bepaalde documenten te delen (dupliceren) naar de groep Vroegmodern Nederlands. Deze groep heeft uitsluitend tot doel het trainen van taalmodellen om het Nederlands sneller te doen herkennen. Deelnemers van deze groep kunnen teksten van anderen zien.
  • Het is dus uw keuze welke documenten u met ons deelt! Hoe meer materiaal ons bereikt, hoe makkelijker het wordt om taalmodellen te trainen.
  • U heeft materiaal (foto’s en transcripties) maar u gebruikt nog geen Transkribus? Geen probleem. Wanneer u een account aanmaakt en contact opneemt met Transkribus (email@transkribus.eu) kunnen zij u helpen in het proces. U kunt het materiaal op verschillende manieren beschikbaar stellen en Transkribus koppelt de afbeeldingen aan de transcripties. (Tot mei 2019 is deze service gratis).
  • U bent een instelling en vraagt zich af wat voor u het voordeel is? In de eerste plaats traint het een taalmodel dat heel veel onderzoekers en instellingen (incl. de uwe) van dienst kan zijn bij het sneller herkennen van handschriften. Het materiaal dat u beschikbaar stelt, in uw eigen account van Transkribus, kunt u ook gebruiken om doorzoekbare Pdf’s te maken. U heeft dan een afbeelding van het bronmateriaal, met op de achtergrond de transcripties (of naar keuze: eveneens eronder geplaatst); dit kunt u gebruiken om het materiaal voor uw publiek beschikbaar te stellen. Het is dus eveneens een mooie manier van presenteren!
  • Kosten? Tot juni 2019 is Transkribus gratis. Het wordt momenteel via Europese Onderzoeksgelden gefinancierd (het READ project). Na juni start “READ-COOP” waarin individuele gebruikers gratis gebruik blijven maken, maar ‘groot gebruikers’ zoals instellingen een bijdrage gevraagd zal worden. Hoe hoog deze kosten zullen zijn is nog niet precies bekend, maar er wordt benadrukt dat dit niet al te hoog zal zijn omdat het besef er is dat instellingen hier meestal niet veel geld aan kunnen uitgeven. MAAR: voorlopig is de service gratis en kunt u dus de “doorzoekbare pdf’s” als tegenprestatie krijgen en u kunt altijd na juni stoppen met gebruik van het programma!

+ From Foucault to our future – Transkribus User Conference 2018

On 8-9 November 2018, Vienna was overtaken by more than 100 Transkribus users keen to share their experiences and learn about the latest advances in Handwritten Text Recognition.

The Transkribus User Conference was hosted at the Technical University of Vienna for the second time.  Although the skies were less sunny than last year, the conference programme was just as packed with user case studies, demonstrations of new tools and technological insights.

We got going before the conference even started with a Scanathon event at the Austrian Academy of Sciences on 7 November.  We invited participants to try digitising documents with their mobile phone using our DocScan app and ScanTent device.  After train delays in Austria, we breathed a sigh of relief when the ScanTents arrived and we could start testing and receiving feedback on our new prototypes!

Participants at the Vienna Scanathon on 7 November 2018.  Image credit: Elena Muehlbauer

Coming back to the conference, user stories were a big highlight of the event.  We heard how researchers and archivists from around Europe are using our Transkribus platform to recognise a variety of writing including the papers of the French philosopher Michel Foucault, early modern signatures and initials and sixteenth century Polish tax registers.

We also heard about one of the first projects using Transkribus for crowdsourcing.  Crowd leert computer lezen (or Crowd teaches the computer how to read) at Amsterdam City Archives has connected Transkribus to the VeleHanden crowdsourcing platform, allowing volunteers to produce transcriptions that can then be automatically used as training data for recognising notarial documents.

The conference was also an important showcase for ground-breaking advances in text recognition technology including:

  • HTR+ – a faster and more accurate form of Handwritten Text Recognition using Tensorflow
  • Keyword Spotting – a sophisticated form of keyword searching using the power of Handwritten Text Recognition technology
  • Table Processing – applying automated layout analysis and templates to tabular documents

Gundram Leifert from CITLab, University of Rostock presents HTR+. Image credit: Elena Muehlbauer

We looked towards our future too with a presentation on the new READ-COOP.  The READ project will come to an end in July 2019, from which point Transkribus services will be provided as part of this new cooperative.  A freemium service model is planned where basic Transkribus functions will remain free to all and more intensive users (research projects, archives etc.) will be liable for some charges  – more details coming soon!

Finally, we hope that the conference was a unique opportunity for our users to speak directly to computer scientists and developers about their requirements and their research. It certainly looked like lots of nice connections were being made!

Discussion in full flow at Transkribus User Conference 2018. Image credit: Louise Seaward.

Huge thanks go to all our presenters and participants.  We also need to thank the Computer Vision Lab at the Technical University of Vienna for hosting the conference and our other READ project colleagues who helped with organisational details.  We are looking forward to next year’s conference already!

+ New Transkribus transcription conventions now available

We know that handwritten historical documents are often complex in their structure and content.

Transkribus users usually need to transcribe at least 75 pages of handwritten material in our platform in order to create training data for Handwritten Text Recognition (HTR).  We have produced a new How to Guide covering our transcription conventions to make this process easier.

It answers some of the common queries around the transcription of elements like punctuation, abbreviations diacritics, strikethrough text, names and much more.

You can find further guidance on working with Transkribus in the other How to Guides on the Transkribus wiki.  And if you can’t find an answer to your query, we are always happy to hear from you by email (email@transkribus.eu)