+ Sharing data with Transkribus – Transcribimus and minutes of Vancouver City Council

We can all agree that it’s nice to share – and in the READ project, sharing data brings direct benefits for the Handwritten Text Recognition technology in our Transkribus platform.  According to principles of machine learning, the more images and transcripts that are submitted to us as training data, the stronger the Handwritten Text Recognition technology can become.  Images and transcripts are not publicly shared but they contribute to a general improvement in the technology behind the scenes.

Transcribimus is a community project based in Vancouver, Canada with a sizeable collection of transcripts which they will be using to train an Handwritten Text Recognition model.

Transcribimus all started when Sam Sullivan, former mayor of Vancouver, started to research the City Council minutes from the late nineteenth century with a view to exploring the achievements of Vancouver’s second mayor, David Oppenheimer.  Sam’s physical limitations prevented him from visiting the archives as often as he would have liked.  So he formed a partnership with Margaret Sutherland, a local retiree who had experience of genealogy and reading old handwriting.  Margaret began transcribing and digitising the minutes for Sam and was gradually joined by other volunteer transcribers including Christopher Stephenson, a graduate student in Library and Archival studies who provided lots of assistance.  Transcribimus eventually became an online platform where more than 20 volunteers have transcribed some 3,500 pages of handwritten minutes.

Image from the City Council Minutes. City of Vancouver Archives, VMA 23-5 page 214. Image credit: Margaret Sutherland.

These transcriptions are already freely available on the Transcribimus website.  The City of Vancouver Archives will ultimately display the images and transcripts on their website too.

The vast majority of the minutes are written in one hand, so these images and transcripts will likely feed into a strong Handwritten Text Recognition model that produces useful transcripts of the collection. Transcribimus volunteers could then check and correct any errors in these automated transcripts – and the transcription of the City Council minutes should hopefully be realised more quickly!

  • Do you have existing transcriptions that you have produced or collected as part of a research project?  Ideally 500 pages or more…
  • Send them to us and we can process them and train a model to recognise the writing in your documents!
  • To find out more about working with existing transcripts, consult our How to Guide or contact us.

+ Learn more about Transkribus in Zagreb

Join us for an event in the Croatian capital of Zagreb on Thursday 18th October.

The event is hosted by ICARUS Croatia and the Faculty of Philosophy at the University of Zagreb.

There will be a morning of lectures from READ project researchers based at the University of Innsbruck and University College London, which will explain the workings of Transkribus and the possibilities of Handwritten Text Recognition for different kinds of historical documents.

There are also limited spaces for a Transkribus workshop, where participants will be able to learn tips and tricks for working directly with the platform.

To enquire about attending the Zagreb event, please email Vlatka Lemić (vlemic@arhiv.hr).

+ Join us for Vienna Scanathon at the Austrian Academy of Sciences

Digitising historical documents? There’s an app for that!

Join us in Vienna for our next Scanathon event, hosted by the Austrian Academy of Sciences and the Austrian Centre for Digital Humanities.

Screenshot of DocScan app

Participants will have the opportunity to test out the DocScan mobile app and the ScanTent device, new tools which facilitate the digitisation of historical documents with a mobile phone.

The event will take place on the afternoon of Wednesday 7 November.  Attendance is free and open to all – but registration is required:

The Scanathon could be an ideal pre-conference activity for anyone attending the 2018 Transkribus User Conference, also in Vienna.

We ask that attendees bring their smartphone to the event so they can work with the tools. The DocScan app is currently only available on Android phones.

Participants are also invited to bring their own documents to digitise during the event.

DocScan and the ScanTent are being developed by one of the READ project partners, the Computer Vision Lab at the Technical University of Vienna.

+ Registration now open! Transkribus User Conference 2018

Registration for the 2018 Transkribus User Conference in Vienna is now open!

The conference will take place for the second time at the Technical University Vienna, right in the heart of Vienna, on 8-9 November 2018.

Registration for the Transkribus User Conference is currently at full capacity.  Please contact Tamara Terbul (Tamara.Terbul@uibk.ac.at) to join the waiting list and we will let you know if there are spaces available.

The conference registration fee is 50 EUR for regular participants and 25 EUR for students.

Please be aware that conference places are limited and granted on a first come, first served basis.

Building on the success of last year’s conference, this year’s programme will offer opportunities for Transkribus users to find out about the latest technological developments, experiment with new features, hear from users who have been working intensively with the platform and ask questions about how Automated Text Recognition could work on different kinds of documents.

Everyone is welcome – from Transkribus newbies to more experienced users.  And if you came along last year, there will be much new content to enjoy.

Some highlights of the programme include:

  • READ-COOP – hear about the future incarnation of the READ project which will promote collaborative working to preserve and enhance digital cultural heritage.
  • Transkribus in Practice – hear how users have been working with the platform to process documents of varying dates, languages and styles
  • Transkribus workshop – a chance for new and more experienced users to learn how to work with the platform and ask questions of our developers.
  • READ technology showcase – presentations from computer scientists on the technology behind Transkribus
  • Digitisation on demand in archives – a presentation of DocScan and the ScanTent, tools which help users to digitise historical documents using their mobile phone.

Conference participants who will arrive early in Vienna might also be interested in attending a pre-conference Scanathon at the Austrian Academy of Sciences on the afternoon of 7 November, where they will be able to try out these tools in an archival environment.

Places for the conference are limited and offered on a first-come, first-served basis.

If you have any questions about the conference, please contact Tamara Terbul (Tamara.Terbul@uibk.ac.at).

We look forward to meeting you all in Vienna!

+ Unleashing the Trankribus API

by David Brown and Stephen Crane, Trinity College Dublin

On 30 June 1922, at the outset of the Irish Civil War, a cataclysmic explosion and fire destroyed the Public Record Office of Ireland at the Four Courts, Dublin. Flames and heat consumed seven centuries of Ireland’s recorded history, stored in a magnificent six-storey Victorian repository known as the Record Treasury. On the centenary of the 1922 blaze, the Beyond 2022 project at Trinity College Dublin will unveil Ireland’s Virtual Record Treasury​—a digital reconstruction of the Public Record Office of Ireland building and its collections.

Large parts of these collections were copied prior to the fire: the work of antiquarians, historians and publicly funded projects that intended to publish the most historically significant parts of the collection as printed source material for scholars. For various reasons, only a small proportion of what were huge transcription projects were ever published, but copies survive in manuscript running to millions of pages of handwritten text. The transcriptions were made between the seventeenth and nineteenth centuries in the trained secretarial hand of the times. Most projects were entrusted to a single transcriber, usually an expert in a particular field and some individuals transcribed up to 25,000 pages over a period of many years. With so many examples of very large quantities of text produced by a single hand, the Irish Record Office transcriptions might as well have been prepared with Transkribus in mind.

19th Century transcription of late 16th Century patent roll by the Irish Record Commission for the unpublished ‘Acta Regia’. Courtesy of the Russell Library, Maynooth University: Renehan Collection, Vol. 3, p. 14.

The collections reflect the cataloguing arrangements in the original record office and the largest sets of copies deal with topics central to the study of Irish history: The Elizabethan conquest and Administration, the Plantation of Ulster, the Cromwellian occupation of Ireland, the Williamite wars and the breaking up of the great landed estates in the nineteenth century. All areas of history are covered in these transcripts, however, and the material includes early census-type records, trade, legal judgements and a wide range of smaller thematic collections related to specific towns and cities. The digitisation is most advanced for the Cromwellian period, 1650-1659, and the scale of documents recovered surpasses that which has survived for most parts of England.

Transkribus works very well on large, relatively uniform collections such as these. Several HTR models have been prepared for 15,000 words each, beginning with the nineteenth century hands and achieving, in some cases, a Character Error Rate (CER) of less than 2%! As the number of trained models increased, a separate project emerged to investigate if the existing models could be used to partially recognise a sample from the next set of documents, and speed up the process of creating each subsequent set of ground truth. It was decided to create a single page ground truth for each new example, and compare this with text automatically generated with each model in the project to find the best one to work with.

Transkribus comprises a cross-platform client GUI which is downloaded and executed on users’ local machines, Windows, Mac or Linux. This GUI communicates with a remote server over the Web. The server allows to manage collections of documents, train HTR models and run models against document collections, all in response to user-requests through the GUI.

Unusually, the Transkribus project has separately published an open-source client library which the GUI uses to make requests to the server. As part of a summer project we decided to use this library as the basis for a scripting language, allowing us to write mini-programs (scripts) automating common tasks separately from the GUI, but using the same back-end services as it.

The client library as shipped is written in the Java programming language, which runs on a virtual machine known as the JVM, and which enables the client to be cross-platform. We decided to base our scripting language on Clojure, an idiomatic modern Lisp which also runs in the JVM and provides excellent Java interoperability.

Our scripting language, which we call Transkript, is also published as open-source, on Github. It does not implement all of the underlying API, just enough to enable a couple of small scripting applications: eval-models and run-ocr.

The first script compares multiple trained models associated with a collection, using the first page of a specified document. Using the GUI this would be a laborious affair since running each model takes some time. A user can run our script and return later to browse the results.

The second script is used to upload a folder of images representing pages of a typewritten document, run OCR on it, and download the text output of the OCR process.

The power of our approach is that each of these scripts took only a couple of hours to write and test, and the core of each of them is about a dozen lines of fluent code, which is quite comprehensible, even to relatively non-technical users. The scripting language does not add any new functionality to Transkribus, but enables dramatically increased productivity through the batch processing of large numbers of jobs. There are multiple additional scripts that can be employed, for example to HTR documents automatically once the most appropriate model has been identified by the eval-models script.

+ Transkribus – The Best Idea to Procrastinate I’ve Ever Had

Stefan Karcher, a graduate student at Heidelberg University has written a fascinating blog post explaining how he has been using Transkribus to process nineteenth-century German sermons.

Karcher took the opportunity to train his own Automated Text Recognition models.  He used around 30,000 transcribed words of training data to generate a model that can produce transcripts of his documents with a Character Error Rate of 8-10%.  The blog post notes that these transcripts are a useful and efficient basis for his research and includes a description of how these automated transcripts can be analysed with  Voyant Tools.

Do you want to train your own Automated Text Recognition model?

+ Eighteenth-century medical casebooks – transcribed with Transkribus!

William Hey (1736 – 1819) was an English surgeon who worked at Leeds General Infirmary, served as mayor of Leeds and as president of the Leeds Philosophical and Literary Society.

The team in Special Collections at the library of the University of Leeds (one of the READ project MOU partners) are interested in creating digital transcriptions of the writings of this notable local figure.

They have transcribed around 15,000 words from Hey’s medical casebooks in our Transkribus platform and used this data to train two Automated Text Recognition models to recognise Hey’s writing.

The first model was trained solely on the Hey papers, the second model included the pre-exisiting ‘English Writing M1’ model as part of the training process.  The ‘English Writing M1’ model is trained to recognise the writing of the English philosopher Jeremy Bentham (1748 – 1832) and his secretaries – it is freely available to all Transkribus users for their experiments.

The results were very good, reflecting both the relative simplicity of Hey’s handwriting and the amount of training data for eighteenth-century English writing that has already been submitted in Transkribus by various other research and archival teams.

The best results for the automated recognition of Hey’s writing came with the latter model – it can produce transcripts of papers written by Hey with a Character Error Rate (CER) of just 8%.  This means that more than 90% of the characters are transcribed correctly by the software – and this is a very good starting point for manually correcting and improving the quality of these transcripts with a view to making them available to archival users. The Special Collections team also hope to improve the accuracy of their model by transcribing more words of training data.

To find out how to prepare training data for Automated Text Recognition and train your own model in Transkribus, take a look at our How to Guides:

Further information:

+ Reading admiral de Ruyter’s journal – using existing transcripts to train Automated Text Recognition

Nicoline van der Sijs is part of a team of researchers working at the Meertens Institute in the Netherlands (one of the READ MOU partners).  The team has trained an Automated Text Recognition model to process the handwriting of Michiel de Ruyter, a Dutch admiral from the seventeenth century.

The model was trained with around 20,000 words of existing transcribed material from de Ruyter’s journals (see below for an example of his tricky handwriting!).  These transcriptions were matched automatically to corresponding digitised images of de Ruyter’s handwriting using Text2Img matching technology developed by the CITlab team at the University of Rostock (one of the READ project partners).

The resulting model is capable of recognising De Ruyter’s handwriting with a Character Error Rate (CER) of around 10%, which is an remarkable result for such a complex hand.

Image from the De Ruyter collection from the National Archives of the Netherlands, NL HaNA 1.10.72 20 0004

Professor van der Sijs and her colleagues are planning to use these transcriptions to compile an online corpus of de Ruyter’s writings for general access and scholarly linguistic analysis.

Researchers at the Meertens Institute are also interested in replicating these exciting results with other collections where existing transcriptions are already available, thanks to the hard work of volunteer transcribers.  The Stichting Vrijwilligersnet Nederlandse Taal (SVNT) is a network of about 100 volunteers who have been transcribing historic Bibles for more than ten years.  Other material transcribed by volunteers includes sailing letters from the seventeenth and eighteenth centuries and seventeenth-century printed newspapers.  The transcriptions that these volunteers have produced can be fed into our cutting-edge technology and used as training data for Automated Text Recognition.

  • Do you have existing transcriptions that you have produced or collected as part of a research project?
  • Send them to us and we can process them and train a model to recognise the writing in your documents!
  • To find out more about working with existing transcripts, consult our How to Guide or contact us.

+ Working with tables in Transkribus? Help has arrived!

Record books, registers, accounts – these are just a few of the hundreds of archival documents that can be laid out in tables and forms.  Although the human eye can easily spot the patterns in these kinds of documents, they often present a challenge for Automated Text Recognition technology.

If you are trying to process tabular documents in Transkribus, we have an updated How to Guide which can help:

The guide explains how to mark up tables using segmentation tools and then export the transcriptions of these tables into Excel.

The new section of the guide focuses on the semi-automatic processing of tables, describing how users can create a table template that can be applied to multiple pages that possess a similar layout.  This new functionality should hopefully make it simpler and quicker to create training data for Automated Text Recognition from documents laid out in tables – good luck!

Image: UCL Special Collections, Bentham Papers, box i, fol. 631.