+ Latest success story! Medieval Handwriting and Handwritten Text Recognition

Two partners in the READ project network have now successfully trained a new model to recognise Gothic handwriting!  The State Archives of Zurich (READ project partner) and the University of Zurich (READ project Memorandum of Understanding partner) have collaborated on the automatic recognition of a collection of medieval charters.

In 1336 a cartulary was written in Königsfelden, close to the city of Brugg (which is now part of Switzerland).  Königsfelden abbey was a well-endowed institution with close ties to the dukes of Habsburg.  In a neat and regular handwriting, the charters of the institution were copied on roughly 260 parchment pages. The cartulary is available online via e-codices.

Image of the cartulary of Königsfelden.  Aarau, Staatsarchiv Aargau, AA/0428, f. 1r [http://www.e-codices.unifr.ch/en/list/one/saa/0428]

At the University of Zurich, there is an ongoing project to create a digital scholarly edition of the charters of Königsfelden abbey.  The cartulary is an important source for early writing practices and has already been partially transcribed. The project team have been using our Transkribus platform to produce their transcriptions and they used these transcripts to train and test a Handwritten Text Recognition (HTR) model.

The model was trained on transcripts of around 26,000 words from the charters.  These documents are written in a regular script, with evenly ruled lines and this helps the technology to process the pages more easily.  The HTR model is able to automatically produce transcripts of documents in the collection with an astonishing  Character Error Rate (CER) of 10%.

Transkribus has been able to deal with some of the intricacies common to medieval documents.  Thanks to the integration of Unicode, superscripts on letters, such as uͤ can also be recognized by the HTR. Don’t expect this recognition to work perfectly, the signs are sometimes so small that even expert paleographers debate their meaning!

Furthermore, one of the main problems regarding pre-modern handwriting could partially be dealt with: Abbreviations were indicated in the process of transcription by using combining diacritics such as ‘ ̄ ‘ (U+0305 combining overline) or entering correct signs from Unicode.

Screenshot from Transkribus showing the computer-generated transcript of a cartulary document

Since the transcripts provided as training data were consistent, the automatic recognition of abbreviations (or rather the correct transcription using abbreviation signs) could in some cases be achieved. In order to produce easily legible transcriptions or even scholarly editions, these signs can be searched and replaced in Transkribus or in another editor in a later stage.

For two reasons, it was decided not to integrate dictionaries to try to enhance the accuracy of the model.  First, medieval texts tend to be full of different variants. The same word can occur in the same text, with  various different spellings.  Second, in the cartulary, as in other medieval documents, Latin and the vernacular (in this case middle German) are mixed.  Despite the lack of a dictionary, the HTR model was still able to recognise these documents at a high level of accuracy.

In the future, we hope to be able to create general models that can be applied to regular handwriting as found in medieval books and charters.  All that is needed is a large amount of training data from different medieval documents.  So, come join us and start to train your own HTR model!

By Tobias Hodel, University of Zurich.

+ Date for your diary! The first Transkribus User Conference comes to Vienna in November 2017

We’re delighted to announce that we will be organising a dedicated conference for Transkribus users this November.

The Transkribus User Conference will take place at the Technical University Vienna on 2-3 November 2017.  It will be a forum for new and more experienced users of our platform to find out more about the capabilities of Transkribus and the latest research into Automated Text Recognition for print and handwriting.

You can expect suggestions of the best practice for working with Transkribus, presentations on the accuracy of Automated Text Recognition and demonstrations of new tools like our e-learning app for reading historical documents and DocScan, our mobile app for digitising historical papers with a mobile phone.  We will also hear use cases and results from archivists and researchers who have been working with Transkribus intensively.  Finally, the conference will be a valuable opportunity for us to hear from our users – we need your feedback on the Transkribus infrastructure and our aim of revolutionizing access to historical collections.

Registration details and a full programme will be announced soon – watch this space!

+ Coming soon! Teach yourself to read historical handwriting with our e-learning app

At the READ project, we are dedicated to using new technologies to make historical documents more accessible.  Our latest forthcoming tool is an important part of this mission. Transkribus Learn, our free e-learning app will allow users to train themselves to decipher any sort of historical handwriting.  It will be particularly useful for students who are just beginning to work with historical material but could be beneficial to anyone who wants to get to grips with a certain script.  Try it out!

The e-learning app generates selected lines from a manuscript one-by-one and asks users to transcribe a certain word.  Users can practice transcribing as many words as they desire.  They can move on to test what they have learnt.  The tool keeps a record of how many words have been transcribed correctly so the user can get an idea of their progress.  The tool is quick and easy to use – you can transcribe a vast amount of words once you get going.  It also works on mobile phones for any keen users who might like to brush up on their transcription skills on their commute!

The e-learning app is connected to the Handwritten Text Recognition technology in our Transkribus platform.  Computer-generated transcripts are compared to the suggested words submitted by users.  Once users have worked with Transkribus to train a model to process a set of documents, they can be freely included in the e-learning app.

We are still working on our prototype but the e-learning app will be released later in 2017.  It will represent a welcome service for anyone who wants to become more familiar with historical handwriting.  The e-learning app could also be offered to users in crowdsourcing initiatives – volunteers could practice transcribing and gain confidence before they start contributing to a project.

More updates on the app will be coming soon and we look forward to your feedback!

+ Algorithms, models and medieval documents – join us at the International Medieval Congress 2017

We are already getting excited for one of Europe’s biggest history conferences!  The International Medieval Congress attracts medieval scholars from around the globe, who will be presenting their research this year across 238 sessions.  The READ project will be presenting a panel and a workshop to spread the good word about our handwritten text recognition technology.  And we are looking forward to the famous IMC disco too!

The International Medieval Congress takes place at the University of Leeds on 3-6 July 2017.  The READ project will be presenting on Monday 3 July, at 11:15 in session no. 139.  The details are as follows:

The Digital Scribe: Handwritten Text Recognition (HTR) of Medieval Documents

From Memoria to the Memory of the Turning Points of Life: Matricula-online and HTR

Elena Muehlbauer (Passau Diocesan Archives)

Transkribus and the Archives of a Brigittine Monastery – Making Digital Editions of Naantali documents

Maria Kallio (National Archives Finland)

Sending 15th Century Missives Through Algorithms: Testing and Evaluating HTR with 2,200 Documents

Tobias Hodel  (State Archives Zurich)

You can see a couple of examples of the documents that our panel will be discussing below. How does Handwritten Text Recognition technology copy with the writing of these medieval scribes?

Copy of privileges, orders, seasonal contributions and records pertaining to the cloister holdings of Königsfeld. Compiled at the time of Queen Agnes of Hungary (ca. 1281-1346).  [Aarau, Staatsarchiv Aargau, AA/0428, f. 1r – Cartulary I of Königsfelden.  Image from e-codices]

The READ project will also be hosting a separate workshop at the University of Leeds on the morning of Wednesday 5 July 2017.  This is open to all – medieval scholars and beyond!

We will give an overview of the latest advances in Handwritten Text Recognition technology and show participants how they can work with our Transkribus tool to train a computer to automatically process a set of documents of any language, date, style or format.  To register for the workshop or for more information, please email Tobias Hodel.

+ DATeCH Conference – learn about Handwritten Text Recognition at our workshop

The DATeCH International Conference is fast approaching on 1-2 June 2017 in Göttingen.  The conference is a forum for innovative work on the creation, use and transformation of digitised historical documents.

If you are planning on attending the conference, you might be interested in our pre-conference workshop on 31 May.

We will be giving participants an overview of READ project technology and showing them how to apply handwritten text recognition to their own documents.  The workshop will be led by Tobias Hodel (State Archives Zurich), with support from researchers at the Computer Vision Lab, Technical University Vienna and the Computational Intelligence Technology Lab, University of Rostock.

At the DATeCH website you can consult the agenda of the workshop and find more information on registration.  If you need any further details, please email Tobias Hodel.

 

+ Machine Reading the Archive in Cambridge

It was a sunny Tuesday morning when the READ project made it to the Centre for Research in the Arts, Social Sciences and Humanities (CRASSH) at the University of Cambridge for our latest workshop.  Louise Seaward (Bentham Project, University College London) and Sebastian Colutto (University of Innsbruck) delivered a presentation and workshop on automated text recognition for handwritten and printed text.

The Mathematical Bridge at Queen’s College, University of Cambridge [Image by Louise Seaward]

Whilst Sebastian gave a technical overview of how our Transkribus platform can be used for automated text recognition, Louise explained the potential benefits of the automatic transcription and searching of documents from the perspective of a historian.  The team then delivered a hands-on workshop where staff and students from the university were able to get to grips with Transkribus.  Participants learnt how computers can be trained to recognise handwriting and how accurate this recognition can be.  There was also much interest in new methods for the automated recognition of printed text, which can produce even better results than Optical Character Recognition (OCR)!

Sebastian Colutto delivers a Transkribus workshop at the University of Cambridge [Image by Louise Seaward]

The event was part of ‘Machine Reading the Archive‘, a training and development programme for digital methods organised by Cambridge Digital Humanities Network, Cambridge Big Data and the Cambridge Digital History Programme.  The READ project looks forward to contributing to the programme again in the future!

+ Meet the READ project partners – Sofia Ares Oliveira

What’s your name?

Sofia Ares Oliveira.

Where do you work?

Digital Humanities Laboratory at Ecole Polytechnique Fédérale de Lausanne (EPFL).

Tell us a bit about your background…

I studied Electrical Engineering at EPFL and specialised in Information Technology, where I explored several signal processing topics, from acoustics to biomedical signals to images. I started working at the DHLab on cadaster map images and since then I have been working on the several thousands of historical documents from Venice that have been digitized and are waiting to be processed.

What is your role in the READ project?

At EPFL we are responsible for the Large Scale Demonstrator, the Venice Time Machine, which aims at building a multidimensional model of Venice and its evolution covering a period of more than 1000 years. I am mainly in charge of integrating and implementing computer vision and image processing tools for handwritten text documents and cadaster maps.

What is top of your to-do list at the moment?

Finalising the release of cython’s binding of the line segmentation tools on Transkribus, so that other READ partners can use it with python.

What do you like best about working on READ?

Working with people coming from different fields and countries, and the ‘product-oriented’ vision of the project, with direct feedback from users.

If you could do another job for just one day, what would it be?

Astronaut, a nice combination of scientist, engineer and explorer!

What can you see out of the window of your office? 

Thanks Sofia! 

+ A new model for Humanities research – collaboration with HumaReC

HumaReC is a new research platform developed by the Swiss Institute for Bioinformatics.  It is part of a project to investigate the digital production and publication of Humanities data using an edition of a New Testament manuscript as a test case.

HumaReC is aiming to establish a new model of Humanities research which allows for the continuous publication and analysis of data through transcriptions, blogs, a discussion forum and research publications. If you want to find out more, Claire Clivaz from HumaReC has recently written a blog post where she talks more about this idea of curating data in the Digital Humanities.

HumaReC are working with the READ project to train Handwritten Text Recognition technology to recognise the Arabic, Latin and Greek writing from the New Testament manuscript.  It is a particular challenge to process these three languages in one document collection!  They are also experimenting with ways to link files in our Transkribus platform to those which appear in their image viewer on their website.  You can take at look at their software on their GitHub page.

+ Transkribus in 10 steps?! Find out how in our new video…

Are you interested in using Transkribus for Handwritten Text Recognition?  If you have a couple of minutes, you can get an overview of the process in our new video.  How to use Transkribus – in 10 steps was put together by Elena Mühlbauer from Passau Diocesan Archives, who are one of the READ project partners.

You can find a more detailed version of this How to Guide, along with other instructional papers, on the Transkribus wiki.  

Or if you’re in the mood for more videos, the Transkribus YouTube channel has a growing playlist of video presentations relating to the READ project.

You could try ‘Handwritten Text Recognition: Key Concepts’ by Roger Labahn (University of Rostock)

or ‘Automated Writer Identification and its Use Cases for Archival Documents’ by Stefan Fiel (Technical University Vienna).  

Happy watching!  

+ Welcoming The British Library to the READ project network!

We are very happy to welcome The British Library into the READ project network as a Memorandum of Understanding partner.  The British Library collection is vast, containing more than 150 million items including a copy of Magna Carta and papers written by The Beatles.

Cooperation between READ project partners and The British Library has been developing across the past few years and the library is now working with Transkribus to train a Handwritten Text Recognition model to recognise colonial records from the nineteenth century.  We look forward to seeing the results soon!

The British Library joins the national libraries of Spain, France and Norway and many other archives, libraries and institutions who have signed a Memorandum of Understanding with the READ project.  If you are interested in becoming part of our network, send us an email to find out more!