+ Registration now open! Transkribus User Conference 2018

Registration for the 2018 Transkribus User Conference in Vienna is now open!

The conference will take place for the second time at the Technical University Vienna, right in the heart of Vienna, on 8-9 November 2018.

Registration for the Transkribus User Conference is currently at full capacity.  Please contact Tamara Terbul (Tamara.Terbul@uibk.ac.at) to join the waiting list and we will let you know if there are spaces available.

The conference registration fee is 50 EUR for regular participants and 25 EUR for students.

Please be aware that conference places are limited and granted on a first come, first served basis.

Building on the success of last year’s conference, this year’s programme will offer opportunities for Transkribus users to find out about the latest technological developments, experiment with new features, hear from users who have been working intensively with the platform and ask questions about how Automated Text Recognition could work on different kinds of documents.

Everyone is welcome – from Transkribus newbies to more experienced users.  And if you came along last year, there will be much new content to enjoy.

Some highlights of the programme include:

  • READ-COOP – hear about the future incarnation of the READ project which will promote collaborative working to preserve and enhance digital cultural heritage.
  • Transkribus in Practice – hear how users have been working with the platform to process documents of varying dates, languages and styles
  • Transkribus workshop – a chance for new and more experienced users to learn how to work with the platform and ask questions of our developers.
  • READ technology showcase – presentations from computer scientists on the technology behind Transkribus
  • Digitisation on demand in archives – a presentation of DocScan and the ScanTent, tools which help users to digitise historical documents using their mobile phone.

Conference participants who will arrive early in Vienna might also be interested in attending a pre-conference Scanathon at the Austrian Academy of Sciences on the afternoon of 7 November, where they will be able to try out these tools in an archival environment.

Places for the conference are limited and offered on a first-come, first-served basis.

If you have any questions about the conference, please contact Tamara Terbul (Tamara.Terbul@uibk.ac.at).

We look forward to meeting you all in Vienna!

+ Unleashing the Trankribus API

by David Brown and Stephen Crane, Trinity College Dublin

On 30 June 1922, at the outset of the Irish Civil War, a cataclysmic explosion and fire destroyed the Public Record Office of Ireland at the Four Courts, Dublin. Flames and heat consumed seven centuries of Ireland’s recorded history, stored in a magnificent six-storey Victorian repository known as the Record Treasury. On the centenary of the 1922 blaze, the Beyond 2022 project at Trinity College Dublin will unveil Ireland’s Virtual Record Treasury​—a digital reconstruction of the Public Record Office of Ireland building and its collections.

Large parts of these collections were copied prior to the fire: the work of antiquarians, historians and publicly funded projects that intended to publish the most historically significant parts of the collection as printed source material for scholars. For various reasons, only a small proportion of what were huge transcription projects were ever published, but copies survive in manuscript running to millions of pages of handwritten text. The transcriptions were made between the seventeenth and nineteenth centuries in the trained secretarial hand of the times. Most projects were entrusted to a single transcriber, usually an expert in a particular field and some individuals transcribed up to 25,000 pages over a period of many years. With so many examples of very large quantities of text produced by a single hand, the Irish Record Office transcriptions might as well have been prepared with Transkribus in mind.

19th Century transcription of late 16th Century patent roll by the Irish Record Commission for the unpublished ‘Acta Regia’. Courtesy of the Russell Library, Maynooth University: Renehan Collection, Vol. 3, p. 14.

The collections reflect the cataloguing arrangements in the original record office and the largest sets of copies deal with topics central to the study of Irish history: The Elizabethan conquest and Administration, the Plantation of Ulster, the Cromwellian occupation of Ireland, the Williamite wars and the breaking up of the great landed estates in the nineteenth century. All areas of history are covered in these transcripts, however, and the material includes early census-type records, trade, legal judgements and a wide range of smaller thematic collections related to specific towns and cities. The digitisation is most advanced for the Cromwellian period, 1650-1659, and the scale of documents recovered surpasses that which has survived for most parts of England.

Transkribus works very well on large, relatively uniform collections such as these. Several HTR models have been prepared for 15,000 words each, beginning with the nineteenth century hands and achieving, in some cases, a Character Error Rate (CER) of less than 2%! As the number of trained models increased, a separate project emerged to investigate if the existing models could be used to partially recognise a sample from the next set of documents, and speed up the process of creating each subsequent set of ground truth. It was decided to create a single page ground truth for each new example, and compare this with text automatically generated with each model in the project to find the best one to work with.

Transkribus comprises a cross-platform client GUI which is downloaded and executed on users’ local machines, Windows, Mac or Linux. This GUI communicates with a remote server over the Web. The server allows to manage collections of documents, train HTR models and run models against document collections, all in response to user-requests through the GUI.

Unusually, the Transkribus project has separately published an open-source client library which the GUI uses to make requests to the server. As part of a summer project we decided to use this library as the basis for a scripting language, allowing us to write mini-programs (scripts) automating common tasks separately from the GUI, but using the same back-end services as it.

The client library as shipped is written in the Java programming language, which runs on a virtual machine known as the JVM, and which enables the client to be cross-platform. We decided to base our scripting language on Clojure, an idiomatic modern Lisp which also runs in the JVM and provides excellent Java interoperability.

Our scripting language, which we call Transkript, is also published as open-source, on Github. It does not implement all of the underlying API, just enough to enable a couple of small scripting applications: eval-models and run-ocr.

The first script compares multiple trained models associated with a collection, using the first page of a specified document. Using the GUI this would be a laborious affair since running each model takes some time. A user can run our script and return later to browse the results.

The second script is used to upload a folder of images representing pages of a typewritten document, run OCR on it, and download the text output of the OCR process.

The power of our approach is that each of these scripts took only a couple of hours to write and test, and the core of each of them is about a dozen lines of fluent code, which is quite comprehensible, even to relatively non-technical users. The scripting language does not add any new functionality to Transkribus, but enables dramatically increased productivity through the batch processing of large numbers of jobs. There are multiple additional scripts that can be employed, for example to HTR documents automatically once the most appropriate model has been identified by the eval-models script.

+ Transkribus – The Best Idea to Procrastinate I’ve Ever Had

Stefan Karcher, a graduate student at Heidelberg University has written a fascinating blog post explaining how he has been using Transkribus to process nineteenth-century German sermons.

Karcher took the opportunity to train his own Automated Text Recognition models.  He used around 30,000 transcribed words of training data to generate a model that can produce transcripts of his documents with a Character Error Rate of 8-10%.  The blog post notes that these transcripts are a useful and efficient basis for his research and includes a description of how these automated transcripts can be analysed with  Voyant Tools.

Do you want to train your own Automated Text Recognition model?

+ Eighteenth-century medical casebooks – transcribed with Transkribus!

William Hey (1736 – 1819) was an English surgeon who worked at Leeds General Infirmary, served as mayor of Leeds and as president of the Leeds Philosophical and Literary Society.

The team in Special Collections at the library of the University of Leeds (one of the READ project MOU partners) are interested in creating digital transcriptions of the writings of this notable local figure.

They have transcribed around 15,000 words from Hey’s medical casebooks in our Transkribus platform and used this data to train two Automated Text Recognition models to recognise Hey’s writing.

The first model was trained solely on the Hey papers, the second model included the pre-exisiting ‘English Writing M1’ model as part of the training process.  The ‘English Writing M1’ model is trained to recognise the writing of the English philosopher Jeremy Bentham (1748 – 1832) and his secretaries – it is freely available to all Transkribus users for their experiments.

The results were very good, reflecting both the relative simplicity of Hey’s handwriting and the amount of training data for eighteenth-century English writing that has already been submitted in Transkribus by various other research and archival teams.

The best results for the automated recognition of Hey’s writing came with the latter model – it can produce transcripts of papers written by Hey with a Character Error Rate (CER) of just 8%.  This means that more than 90% of the characters are transcribed correctly by the software – and this is a very good starting point for manually correcting and improving the quality of these transcripts with a view to making them available to archival users. The Special Collections team also hope to improve the accuracy of their model by transcribing more words of training data.

To find out how to prepare training data for Automated Text Recognition and train your own model in Transkribus, take a look at our How to Guides:

Further information:

+ Reading admiral de Ruyter’s journal – using existing transcripts to train Automated Text Recognition

Nicoline van der Sijs is part of a team of researchers working at the Meertens Institute in the Netherlands (one of the READ MOU partners).  The team has trained an Automated Text Recognition model to process the handwriting of Michiel de Ruyter, a Dutch admiral from the seventeenth century.

The model was trained with around 20,000 words of existing transcribed material from de Ruyter’s journals (see below for an example of his tricky handwriting!).  These transcriptions were matched automatically to corresponding digitised images of de Ruyter’s handwriting using Text2Img matching technology developed by the CITlab team at the University of Rostock (one of the READ project partners).

The resulting model is capable of recognising De Ruyter’s handwriting with a Character Error Rate (CER) of around 10%, which is an remarkable result for such a complex hand.

Image from the De Ruyter collection from the National Archives of the Netherlands, NL HaNA 1.10.72 20 0004

Professor van der Sijs and her colleagues are planning to use these transcriptions to compile an online corpus of de Ruyter’s writings for general access and scholarly linguistic analysis.

Researchers at the Meertens Institute are also interested in replicating these exciting results with other collections where existing transcriptions are already available, thanks to the hard work of volunteer transcribers.  The Stichting Vrijwilligersnet Nederlandse Taal (SVNT) is a network of about 100 volunteers who have been transcribing historic Bibles for more than ten years.  Other material transcribed by volunteers includes sailing letters from the seventeenth and eighteenth centuries and seventeenth-century printed newspapers.  The transcriptions that these volunteers have produced can be fed into our cutting-edge technology and used as training data for Automated Text Recognition.

  • Do you have existing transcriptions that you have produced or collected as part of a research project?
  • Send them to us and we can process them and train a model to recognise the writing in your documents!
  • To find out more about working with existing transcripts, consult our How to Guide or contact us.

+ Working with tables in Transkribus? Help has arrived!

Record books, registers, accounts – these are just a few of the hundreds of archival documents that can be laid out in tables and forms.  Although the human eye can easily spot the patterns in these kinds of documents, they often present a challenge for Automated Text Recognition technology.

If you are trying to process tabular documents in Transkribus, we have an updated How to Guide which can help:

The guide explains how to mark up tables using segmentation tools and then export the transcriptions of these tables into Excel.

The new section of the guide focuses on the semi-automatic processing of tables, describing how users can create a table template that can be applied to multiple pages that possess a similar layout.  This new functionality should hopefully make it simpler and quicker to create training data for Automated Text Recognition from documents laid out in tables – good luck!

Image: UCL Special Collections, Bentham Papers, box i, fol. 631.

+ New resources for German-speaking Transkribus users

A quick announcement about some new German language resources for users of our Transkribus platform.

You can now find the following German language How to Guides on the Transkribus wiki:

Transkribus beginners can also watch a German language webinar to start learning the basics:

+ Transkribus recognises early modern German correspondence

The Gender History research group at the University of Jena (Thuringia, Germany) have been experimenting with Transkribus as part of a digital edition project on the correspondence of the eighteenth-century regent, Erdmuthe Benigna von Reuß-Ebersdorf (1670-1732).

Early Modern scripts are very challenging for Automated Text Recognition technology because letters tend to be closely intertwined, abbreviations occur quite often and the spelling of words is not standardized.  As the below example suggests, Erdmuthe’s writing is not easy to follow!  She had a unique writing style and often broke words into separate parts.

Sample page of a letter (Source: Landesarchiv Thüringen – Staatsarchiv Greiz, Paragiatsherrschaft Köstritz, From IV 15, fol. 56r ., All rights reserved)

In order to train a model to recognise Erdmuthe’s writing, the Gender History research team used about 250 pages of existing transcripts that had been produced in the course of their work on the digital edition.  They also used these same transcripts to create a dictionary of Erdmuthe’s vocabulary that can be integrated into the recognition process.

The resulting model is capable of producing automated transcripts of Erdmuthe’s writing with a Character Error Rate (CER) of below 9%.  When a dictionary is included in the recognition process,  the errors are reduced still further.

Martin Prell from the project team has elaborated on this experiment in a report (in German).  He covers the experience of preparing training data for text recognition and working directly with Transkribus.  If you are thinking about using Transkribus for your own project, this very instructive paper could help!

Report:

Other links:

Working with Gothic script? Join a new Transkribus working group!

Gothic scripts from the Middle Ages can be found in archives and libraries all over Europe.  The script was widely used for hundreds of years, and not only in expensive decorated books.  First experiments with documents from Switzerland and Germany have demonstrated that Gothic script can be recognised by Automated Text Recognition models with good levels of accuracy (see an example from the cartulary of the Königsfelden abbey).

The next step is to combine different examples of Gothic scripts in order to build and improve generic models for the recognition of this kind of document.  Dr Tobias Hodel (State Archives of Zurich, University of Zurich) has set up the ‘Gothic Hands’ working group – where all Transkribus users can work together towards the aim of the improved recognition of Gothic material.  Scroll down to find out more about joining the working group and its aims.

St. Gallen, Stiftsbibliothek, Cod. Sang. 857, p. 124.  The St. Gall Nibelung manuscript B with the Nibelungenlied (The Song of the Nibelungs) and “Klage” (lament), “Parzival” and “Willehalm” by Wolfram von Eschenbach, and Stricker’s “Karl der Grosse” (Charlemagne) (https://www.e-codices.ch/en/list/one/csg/0857).

The process of combining training data of different Gothic documents has already been started as part of a collaboration between various digital editions projects (Parzival editionKönigsfelden edition and the Ortsbürgerarchiv St. Gallen, Konventsbuch St. Katharinental edition).  The resulting model (Comb_Gothic_Bookwriting) is now available to all Transkribus users – if you work with Gothic script, try it out!  The model can already transcribe Gothic documents with a Character Error Rate of less than 10%.  But this could be just the beginning!

The ‘Gothic Hands’ working group is looking for further examples of documents written in Gothic scripts from the 13th, 14th, and 15th century.  You can help us add to the collection – all that is needed are images and transcriptions. You can:

  • share existing training data that you have already prepared in Transkribus
  • prepare new images and transcripts in Transkribus in the ‘Gothic Hands’ collection
  • send over files containing images and transcripts which can be matched automatically and converted into training data using our Text2image tool.

To join the working group and get access to the ‘Gothic Hands’ collection in Transkribus, contact Tobias Hodel (tobias.hodel@hist.uzh.ch).

The ‘Gothic Hands’ working group aims to demonstrate that training based algorithms like Automated Text Recognition need significant input from many stakeholders – they can only be improved by cooperation and sharing!  This way of thinking aligns perfectly with the future of the READ project.  After the end of our European Union funding in June 2019, READ will become a European Cooperative Society (SCE) run for the benefit of its members.  More information about this new direction will be shared soon on the READ website and during the upcoming Transkribus User Conference in Vienna this November.