+ Trolls and water spirits – transcribing Swedish folklore records with Handwritten Text Recognition

It’s time to hear about some remarkable new results with Handwritten Text Recognition (HTR) technology – this time from the Institute for Language and Folklore in Sweden.

The Institute holds a collection of more than 30,000 pages of folklore records written by the Swedish folklorist Carl-Martin Bergstrand between the 1920s and the 1960s.  Dr Fredrik Skott, an associate professor and research archivist at the Institute, has helped to train a HTR model to automatically transcribe these documents.

Dr Skott used our Transkribus platform to transcribe around 20,000 words from pages which were written by Bergstrand in the early 1930s.  A couple of example pages can be seen below, which contain Bergstrand’s records of an interview with August Svensson (b. 1842) where Svensson talked about water spirits and trolls.

Transcripts and images of these documents were processed by CITlab HTR – a form of HTR technology which uses Neural Networks to recognise handwriting.  The resulting HTR model can automatically produce transcripts of pages written by Bergstrand with an average Character Error Rate (CER) of 7.0%.  When a dictionary is integrated into the recognition process, the CER can be as low as 5.5%.

Dr Skott is excited about the possibilities: ‘Previously, I always thought that future generations would have difficulty reading the folklore collections. Now I know that they will find it easier to read the text than the present generation does. In short, the results of our tests with Transkribus are amazing. After manually transcribing just 150 pages, our HTR model now reads the folklore records better than many of our visitors do’.

The Institute for Language and Folklore is now working with these transcriptions to produce a digital map of myths and legends that they plan to launch in autumn 2017.

+ Next week! Panel on Handwritten Text Recognition for Medieval Documents at the International Medieval Congress

The excitement is building – it’s nearly time for this year’s International Medieval Congress in Leeds.  The READ project will be presenting a panel on the morning of Monday 3 July to show that yes, Handwritten Text Recognition, can even work on medieval documents!  Scroll down for full details of the panel, including abstracts of the papers.

We are also hosting a separate  workshop at the University of Leeds on Wednesday 5 July for anyone interested in learning more about the technology – please email Tobias Hodel for details.

Details of the panel:

Monday 3 July, 11:15am, Session no. 139. The Digital Scribe: Handwritten Text Recognition (HTR) of Medieval Documents

Abstracts of the papers:

Elena Muehlbauer (Passau Diocesan Archives), From Tables to Transkribus. From information to knowledge. Working with parish registers. [Change to the scheduled programme] 

The Diocesan Archives of Passau preserve more than 800,000 pages of parish registers. Those pages tell their readers about the important stages in life – birth, marriage and death – of catholics all over Bavaria and Austria. Those facts are highly revealing for genealogists but also for social historians who wish to understand the development of modern life. With technology available in the Transkribus platform, we are now able to gain access to a selection of registers that are written in a very specific way: tables and forms given to the priests by the newly founded state. We are currently working on an engine that will extract information out of words automatically. With the help of Transkribus, data transforms into information – and from information into knowledge.

Maria Kallio (National Archives of Finland), Transkribus and the Archives of a Brigittine Monastery: Making Digital Editions of Naantali Documents

In summer of 2016 the National Archives of Finland started a project in order to make new editions of medieval charters originating from the Brigittine Monastery of Naantali. The goal of the project was to make new editions of 136 documents and publish them in digital form in the Diplomatarium Fennicum database. Because there were several researchers working on the project, there was a serious need for a flexible platform where the co-operation would be easy to implement. Since an advanced transcription undertaken in Transkribus can be used as a basis for digital edition, the project chose to work with this platform.  The presentation describes the workflow and project results, without forgetting the challenges or insights that have taken place during the project.

Tobias Hodel (State Archives of Zurich), Sending 15th-Century Missives through Algorithms: Testing and Evaluating HTR with 2,200 Documents

Is it possible to teach algorithms to read medieval handwriting? Does it make sense to have the material prepared by students, learning to read Gothic writing at the same time? Those two simple questions lay the groundwork for a discussion of how and whether handwritten text recognition and teaching of the Middle ages can be intertwined.

The material to address the tasks consist of 2,200 missives from Thun, a small town in Switzerland. 120 documents were transcribed and used for training. In the process three difficulties had to be identified: different and changing hands, difficult layout structures, and abbreviations. The identified difficulties are typical for such an endeavor. Unfortunately the results of the recognition are insufficient and can only be used cautiously by scholars. The ‘small’ amount of material for training is a reason for the poor levels of recognition. Using language models, the results can be improved, although crucial parts such as names and verbs still remain only partially identifiable.  At the same time the combination of teaching and the use of cutting-edge technological tools proved engaging. The students involved were highly motivated and welcomed the possibility to take part in a digital research endeavor.

+ Venice Time Machine and READ – new article in Nature journal

Last week’s issue of Nature carried a fascinating article on the work of Venice Time Machine project at  the Digital Humanities Lab, Ecole Polytechnique Fédérale de Lausanne (EPFL).  This initiative is one of the READ project partners and it is working to digitize, annotate and index a huge cache of documents from 1000 years of Venetian history.

The article explores the Venice Time Machine as a large-scale collaboration between archives, historians and digital humanities scholars who are applying digital techniques to process these records – including building a huge digital scanner that can digitise images that are as large as 4×7 metres!  It also sheds light on how Handwritten Text Recognition technology from the READ project is being used to enable the processing and searching of handwritten text.  The Venice Time Machine hopes to take us back to the past by reconstructing the events and networks of this enchanting city.

Article: Abbott, Alison, ‘The ‘time machine’ reconstructing ancient Venice’s social networks’, Nature, 14 June 2017

+ Heading to the IMC 2017? Come to our workshop on Handwritten Text Recognition!

If you’re headed to the International Medieval Congress this year, you might be interested to know that we are running a parallel workshop on Handwritten Text Recognition at the University of Leeds.

On the morning of Wednesday 5th July, we will give an overview of the latest advances in this technology and show participants how they can work with our Transkribus tool to train a computer to automatically process a set of documents of any language, date, style or format.  You can consult the programme for more information.  To register, please email Tobias Hodel.

Look out for us at the IMC too!  The READ project is also presenting a panel on Monday 3 July, where we will discuss how we have started to apply Handwritten Text Recognition to different sorts of early documents  from 15th-century missives in early modern German to sacramental registers and Brigittine charters.

+ Handwritten Text Recognition success with Italian documents from Archivio Storico Ricordi

The Archivio Storico Ricordi is one of the most important private music collections in the world and it has started to work with Handwritten Text Recognition (HTR) to process some of its treasures.  Founded in Milan in 1808, the Casa Ricordi publishing house contains a wealth of letters and scores from noted composers like Verdi and Puccini.

Screenshot of a letter from Giuilio Ricordi and its HTR transcription in Transkribus 

The archive submitted around 88,000 words of transcribed material written by Giulio Ricordi, the general manger of the publishing house in the late nineteenth century.  This training data was used to generate a model that can produce automatic transcriptions of pages with an impressive Character Error Rate (CER) of 12.3%.

Our example document shows the results on a sample page from the collection – take a look to see how much the computer gets right!

+ Latest success story! Medieval Handwriting and Handwritten Text Recognition

Two partners in the READ project network have now successfully trained a new model to recognise Gothic handwriting!  The State Archives of Zurich (READ project partner) and the University of Zurich (READ project Memorandum of Understanding partner) have collaborated on the automatic recognition of a collection of medieval charters.

In 1336 a cartulary was written in Königsfelden, close to the city of Brugg (which is now part of Switzerland).  Königsfelden abbey was a well-endowed institution with close ties to the dukes of Habsburg.  In a neat and regular handwriting, the charters of the institution were copied on roughly 260 parchment pages. The cartulary is available online via e-codices.

Image of the cartulary of Königsfelden.  Aarau, Staatsarchiv Aargau, AA/0428, f. 1r [http://www.e-codices.unifr.ch/en/list/one/saa/0428]

At the University of Zurich, there is an ongoing project to create a digital scholarly edition of the charters of Königsfelden abbey.  The cartulary is an important source for early writing practices and has already been partially transcribed. The project team have been using our Transkribus platform to produce their transcriptions and they used these transcripts to train and test a Handwritten Text Recognition (HTR) model.

The model was trained on transcripts of around 26,000 words from the charters.  These documents are written in a regular script, with evenly ruled lines and this helps the technology to process the pages more easily.  The HTR model is able to automatically produce transcripts of documents in the collection with an astonishing  Character Error Rate (CER) of 10%.

Transkribus has been able to deal with some of the intricacies common to medieval documents.  Thanks to the integration of Unicode, superscripts on letters, such as uͤ can also be recognized by the HTR. Don’t expect this recognition to work perfectly, the signs are sometimes so small that even expert paleographers debate their meaning!

Furthermore, one of the main problems regarding pre-modern handwriting could partially be dealt with: Abbreviations were indicated in the process of transcription by using combining diacritics such as ‘ ̄ ‘ (U+0305 combining overline) or entering correct signs from Unicode.

Screenshot from Transkribus showing the computer-generated transcript of a cartulary document

Since the transcripts provided as training data were consistent, the automatic recognition of abbreviations (or rather the correct transcription using abbreviation signs) could in some cases be achieved. In order to produce easily legible transcriptions or even scholarly editions, these signs can be searched and replaced in Transkribus or in another editor in a later stage.

For two reasons, it was decided not to integrate dictionaries to try to enhance the accuracy of the model.  First, medieval texts tend to be full of different variants. The same word can occur in the same text, with  various different spellings.  Second, in the cartulary, as in other medieval documents, Latin and the vernacular (in this case middle German) are mixed.  Despite the lack of a dictionary, the HTR model was still able to recognise these documents at a high level of accuracy.

In the future, we hope to be able to create general models that can be applied to regular handwriting as found in medieval books and charters.  All that is needed is a large amount of training data from different medieval documents.  So, come join us and start to train your own HTR model!

By Tobias Hodel, University of Zurich.

+ Date for your diary! The first Transkribus User Conference comes to Vienna in November 2017

We’re delighted to announce that we will be organising a dedicated conference for Transkribus users this November.

The Transkribus User Conference will take place at the Technical University Vienna on 2-3 November 2017.  It will be a forum for new and more experienced users of our platform to find out more about the capabilities of Transkribus and the latest research into Automated Text Recognition for print and handwriting.

You can expect suggestions of the best practice for working with Transkribus, presentations on the accuracy of Automated Text Recognition and demonstrations of new tools like our e-learning app for reading historical documents and DocScan, our mobile app for digitising historical papers with a mobile phone.  We will also hear use cases and results from archivists and researchers who have been working with Transkribus intensively.  Finally, the conference will be a valuable opportunity for us to hear from our users – we need your feedback on the Transkribus infrastructure and our aim of revolutionizing access to historical collections.

Registration details and a full programme will be announced soon – watch this space!

+ Coming soon! Teach yourself to read historical handwriting with our e-learning app

At the READ project, we are dedicated to using new technologies to make historical documents more accessible.  Our latest forthcoming tool is an important part of this mission. Transkribus Learn, our free e-learning app will allow users to train themselves to decipher any sort of historical handwriting.  It will be particularly useful for students who are just beginning to work with historical material but could be beneficial to anyone who wants to get to grips with a certain script.  Try it out!

The e-learning app generates selected lines from a manuscript one-by-one and asks users to transcribe a certain word.  Users can practice transcribing as many words as they desire.  They can move on to test what they have learnt.  The tool keeps a record of how many words have been transcribed correctly so the user can get an idea of their progress.  The tool is quick and easy to use – you can transcribe a vast amount of words once you get going.  It also works on mobile phones for any keen users who might like to brush up on their transcription skills on their commute!

The e-learning app is connected to the Handwritten Text Recognition technology in our Transkribus platform.  Computer-generated transcripts are compared to the suggested words submitted by users.  Once users have worked with Transkribus to train a model to process a set of documents, they can be freely included in the e-learning app.

We are still working on our prototype but the e-learning app will be released later in 2017.  It will represent a welcome service for anyone who wants to become more familiar with historical handwriting.  The e-learning app could also be offered to users in crowdsourcing initiatives – volunteers could practice transcribing and gain confidence before they start contributing to a project.

More updates on the app will be coming soon and we look forward to your feedback!

+ Algorithms, models and medieval documents – join us at the International Medieval Congress 2017

We are already getting excited for one of Europe’s biggest history conferences!  The International Medieval Congress attracts medieval scholars from around the globe, who will be presenting their research this year across 238 sessions.  The READ project will be presenting a panel and a workshop to spread the good word about our handwritten text recognition technology.  And we are looking forward to the famous IMC disco too!

The International Medieval Congress takes place at the University of Leeds on 3-6 July 2017.  The READ project will be presenting on Monday 3 July, at 11:15 in session no. 139.  The details are as follows:

The Digital Scribe: Handwritten Text Recognition (HTR) of Medieval Documents

From Memoria to the Memory of the Turning Points of Life: Matricula-online and HTR

Elena Muehlbauer (Passau Diocesan Archives)

Transkribus and the Archives of a Brigittine Monastery – Making Digital Editions of Naantali documents

Maria Kallio (National Archives Finland)

Sending 15th Century Missives Through Algorithms: Testing and Evaluating HTR with 2,200 Documents

Tobias Hodel  (State Archives Zurich)

You can see a couple of examples of the documents that our panel will be discussing below. How does Handwritten Text Recognition technology copy with the writing of these medieval scribes?

Copy of privileges, orders, seasonal contributions and records pertaining to the cloister holdings of Königsfeld. Compiled at the time of Queen Agnes of Hungary (ca. 1281-1346).  [Aarau, Staatsarchiv Aargau, AA/0428, f. 1r – Cartulary I of Königsfelden.  Image from e-codices]

The READ project will also be hosting a separate workshop at the University of Leeds on the morning of Wednesday 5 July 2017.  This is open to all – medieval scholars and beyond!

We will give an overview of the latest advances in Handwritten Text Recognition technology and show participants how they can work with our Transkribus tool to train a computer to automatically process a set of documents of any language, date, style or format.  To register for the workshop or for more information, please email Tobias Hodel.

+ DATeCH Conference – learn about Handwritten Text Recognition at our workshop

The DATeCH International Conference is fast approaching on 1-2 June 2017 in Göttingen.  The conference is a forum for innovative work on the creation, use and transformation of digitised historical documents.

If you are planning on attending the conference, you might be interested in our pre-conference workshop on 31 May.

We will be giving participants an overview of READ project technology and showing them how to apply handwritten text recognition to their own documents.  The workshop will be led by Tobias Hodel (State Archives Zurich), with support from researchers at the Computer Vision Lab, Technical University Vienna and the Computational Intelligence Technology Lab, University of Rostock.

At the DATeCH website you can consult the agenda of the workshop and find more information on registration.  If you need any further details, please email Tobias Hodel.