+ Transkribus on Euronews TV

Check us out – we’re on TV again!  EuroNews TV, a leading 24-hour information channel, has produced a short documentary film featuring READ which sheds light on the latest research in Handwritten Text Recognition.

The film is a co-production between EuroNews and the European Commission.  It is being aired in 10 languages on the award-winning Futuris programme on European science, research and innovation and should hopefully be seen by 430 million households in 130 countries!

+ Experiments with Transkribus and early printed text

We love hearing what our users have been getting up to with our Transkribus platform for Handwritten Text Recognition.

Annika Rockenberger from the National Library of Norway has written a blog about her experiments with Transkribus as part of her work on a digital edition of the writings of the German journalist, historian and poet Georg Greflinger (1620-1677).

Annika is working with early printed text which cannot be adequately recognised with OCR.  She explains that Transkribus users can train a model to recognise this kind of printed text, with around 5000 words of transcribed material.

Unfortunately in this case, digitised images from tightly bound books have made it difficult for the programme to detect the location of text on a page.  Annika hopes to continue her experiments with Transkribus at a later date with better quality images.  Read more on the Greflinger Digital Edition blog:

+ Searching Jeremy Bentham’s manuscripts with Keyword Spotting

The Bentham Project has been experimenting with the Handwritten Text Recognition (HTR) of Bentham’s manuscripts for the past five years, first as a partner in the tranScriptorium project and now as part of READ .

Read about their progress with HTR and our Transkribus platform in blog posts from June 2017 and  February 2018.

Keyword Spotting

The results have thus far been impressive, especially considering the immense difficulty of Bentham’s own handwriting.  But automated transcription is not yet at a point where it is sufficiently accurate to be used by Bentham Project researchers as a basis for scholarly editing.

However, the current state of the technology is strong enough for keyword searching!  And thanks to a collaboration with the PRHLT research center at the Universitat Politècnica de València (another partner in the READ project), there are some exciting new results to report.  It is now possible to search over 90,000 digital images of the central collections of Bentham’s manuscripts, which are held at Special Collections University College London and The British Library.

A Keyword Spotting search for the word ‘pleasure’

Appeal for volunteers!

A Google sheet has been prepared with some suggested search terms in 5 different spreadsheet tabs (Bentham’s neologisms, concepts, people, places and other).  The Bentham Project is appealing for people to record their searches online, using the suggested search terms and some new ones too.  Some of the results will be shared at the upcoming Transkribus User Conference in November.

Background

The PRHLT team have processed the Bentham papers with cutting-edge HTR and probabilistic word indexing technologies. This sophisticated form of searching is often called Keyword Spotting. It is more powerful than a conventional full-text search because it uses statistical models trained for text recognition to search through probability values assigned to character sequences (words), considering most possible readings of each word on a page.

The result is that this vast collection of Bentham’s papers can be efficiently searched, including those papers that have not yet been transcribed! The accuracy rates are impressive. The spots suggest around 84-94% accuracy (6-16% Character Error Rate) when compared with manual transcriptions of Bentham’s manuscripts. More precisely speaking, laboratory tests show that the word average search precision ranges from 79% to 94%. This means that, out of 100 average search results, only as few as 6 may fail to actually be the words searched for. The accuracy of spotted words depends on the difficulty of Bentham’s handwriting – although it is possible to find useful results in Bentham’s scrawl! There could be as many as 25 million words waiting to be found.

A search for the word ‘happiness’ uncovers Bentham’s most famous phrase, written in his own hand.

Use cases

This fantastic site will be invaluable to anyone interested in Bentham’s philosophy.  It will help Bentham Project researchers to find previously unknown references in pages that have not yet been transcribed.  It will allow researchers to quickly investigate Bentham’s concepts and correspondents.  It should also help volunteer transcribers in the Transcribe Bentham initiative to find interesting material to transcribe.

This interface is a prototype beta version.  In the future, there are plans to increase the power of this research tool by connecting it to other digital resources, allowing users to quickly search the manuscripts at the UCL library repository, the Bentham papers database and the Transcribe Bentham Tanscription Desk and linking these images to rich existing metadata.

Feedback on this new search functionality is welcomed at: transcribe.bentham@ucl.ac.uk

Similar Keyword Spotting technology (based on research by the CITlab team at the University of Rostock, another one of the READ project partners) is currently available to all users of the Transkribus platform.  Find out more about how to get started with Keyword Spotting.

+ New to Transkribus? Master the platform in just 10 steps

Maybe you’ve just discovered Transkribus and are feeling a bit overwhelmed?  Our updated video should help you get to grips with working with our Handwritten Text Recognition technology – in just 10 steps (and under 4 minutes!).

You can find more detailed information about working with our platform in our How to Guides.

While you’re on the Transkribus YouTube channel, check out our other videos too – including presentations from Transkribus users at the the 2017 Transkribus User Conference.

+ Sharing data with Transkribus – Transcribimus and minutes of Vancouver City Council

We can all agree that it’s nice to share – and in the READ project, sharing data brings direct benefits for the Handwritten Text Recognition technology in our Transkribus platform.  According to principles of machine learning, the more images and transcripts that are submitted to us as training data, the stronger the Handwritten Text Recognition technology can become.  Images and transcripts are not publicly shared but they contribute to a general improvement in the technology behind the scenes.

Transcribimus is a community project based in Vancouver, Canada with a sizeable collection of transcripts which they will be using to train an Handwritten Text Recognition model.

Transcribimus all started when Sam Sullivan, former mayor of Vancouver, started to research the City Council minutes from the late nineteenth century with a view to exploring the achievements of Vancouver’s second mayor, David Oppenheimer.  Sam’s physical limitations prevented him from visiting the archives as often as he would have liked.  So he formed a partnership with Margaret Sutherland, a local retiree who had experience of genealogy and reading old handwriting.  Margaret began transcribing and digitising the minutes for Sam and was gradually joined by other volunteer transcribers including Christopher Stephenson, a graduate student in Library and Archival studies who provided lots of assistance.  Transcribimus eventually became an online platform where more than 20 volunteers have transcribed some 3,500 pages of handwritten minutes.

Image from the City Council Minutes. City of Vancouver Archives, VMA 23-5 page 214. Image credit: Margaret Sutherland.

These transcriptions are already freely available on the Transcribimus website.  The City of Vancouver Archives will ultimately display the images and transcripts on their website too.

The vast majority of the minutes are written in one hand, so these images and transcripts will likely feed into a strong Handwritten Text Recognition model that produces useful transcripts of the collection. Transcribimus volunteers could then check and correct any errors in these automated transcripts – and the transcription of the City Council minutes should hopefully be realised more quickly!

  • Do you have existing transcriptions that you have produced or collected as part of a research project?  Ideally 500 pages or more…
  • Send them to us and we can process them and train a model to recognise the writing in your documents!
  • To find out more about working with existing transcripts, consult our How to Guide or contact us.

+ Learn more about Transkribus in Zagreb

Join us for an event in the Croatian capital of Zagreb on Thursday 18th October.

The event is hosted by ICARUS Croatia and the Faculty of Philosophy at the University of Zagreb.

There will be a morning of lectures from READ project researchers based at the University of Innsbruck and University College London, which will explain the workings of Transkribus and the possibilities of Handwritten Text Recognition for different kinds of historical documents.

There are also limited spaces for a Transkribus workshop, where participants will be able to learn tips and tricks for working directly with the platform.

To enquire about attending the Zagreb event, please email Vlatka Lemić (vlemic@arhiv.hr).

+ Join us for Vienna Scanathon at the Austrian Academy of Sciences

Digitising historical documents? There’s an app for that!

Join us in Vienna for our next Scanathon event, hosted by the Austrian Academy of Sciences and the Austrian Centre for Digital Humanities.

Screenshot of DocScan app

Participants will have the opportunity to test out the DocScan mobile app and the ScanTent device, new tools which facilitate the digitisation of historical documents with a mobile phone.

The event will take place on the afternoon of Wednesday 7 November.  Attendance is free and open to all – but registration is required:

The Scanathon could be an ideal pre-conference activity for anyone attending the 2018 Transkribus User Conference, also in Vienna.

We ask that attendees bring their smartphone to the event so they can work with the tools. The DocScan app is currently only available on Android phones.

Participants are also invited to bring their own documents to digitise during the event.

DocScan and the ScanTent are being developed by one of the READ project partners, the Computer Vision Lab at the Technical University of Vienna.

+ Registration now open! Transkribus User Conference 2018

Registration for the 2018 Transkribus User Conference in Vienna is now open!

The conference will take place for the second time at the Technical University Vienna, right in the heart of Vienna, on 8-9 November 2018.

Registration for the Transkribus User Conference is currently at full capacity.  Please contact Tamara Terbul (Tamara.Terbul@uibk.ac.at) to join the waiting list and we will let you know if there are spaces available.

The conference registration fee is 50 EUR for regular participants and 25 EUR for students.

Please be aware that conference places are limited and granted on a first come, first served basis.

Building on the success of last year’s conference, this year’s programme will offer opportunities for Transkribus users to find out about the latest technological developments, experiment with new features, hear from users who have been working intensively with the platform and ask questions about how Automated Text Recognition could work on different kinds of documents.

Everyone is welcome – from Transkribus newbies to more experienced users.  And if you came along last year, there will be much new content to enjoy.

Some highlights of the programme include:

  • READ-COOP – hear about the future incarnation of the READ project which will promote collaborative working to preserve and enhance digital cultural heritage.
  • Transkribus in Practice – hear how users have been working with the platform to process documents of varying dates, languages and styles
  • Transkribus workshop – a chance for new and more experienced users to learn how to work with the platform and ask questions of our developers.
  • READ technology showcase – presentations from computer scientists on the technology behind Transkribus
  • Digitisation on demand in archives – a presentation of DocScan and the ScanTent, tools which help users to digitise historical documents using their mobile phone.

Conference participants who will arrive early in Vienna might also be interested in attending a pre-conference Scanathon at the Austrian Academy of Sciences on the afternoon of 7 November, where they will be able to try out these tools in an archival environment.

Places for the conference are limited and offered on a first-come, first-served basis.

If you have any questions about the conference, please contact Tamara Terbul (Tamara.Terbul@uibk.ac.at).

We look forward to meeting you all in Vienna!

+ Unleashing the Trankribus API

by David Brown and Stephen Crane, Trinity College Dublin

On 30 June 1922, at the outset of the Irish Civil War, a cataclysmic explosion and fire destroyed the Public Record Office of Ireland at the Four Courts, Dublin. Flames and heat consumed seven centuries of Ireland’s recorded history, stored in a magnificent six-storey Victorian repository known as the Record Treasury. On the centenary of the 1922 blaze, the Beyond 2022 project at Trinity College Dublin will unveil Ireland’s Virtual Record Treasury​—a digital reconstruction of the Public Record Office of Ireland building and its collections.

Large parts of these collections were copied prior to the fire: the work of antiquarians, historians and publicly funded projects that intended to publish the most historically significant parts of the collection as printed source material for scholars. For various reasons, only a small proportion of what were huge transcription projects were ever published, but copies survive in manuscript running to millions of pages of handwritten text. The transcriptions were made between the seventeenth and nineteenth centuries in the trained secretarial hand of the times. Most projects were entrusted to a single transcriber, usually an expert in a particular field and some individuals transcribed up to 25,000 pages over a period of many years. With so many examples of very large quantities of text produced by a single hand, the Irish Record Office transcriptions might as well have been prepared with Transkribus in mind.

19th Century transcription of late 16th Century patent roll by the Irish Record Commission for the unpublished ‘Acta Regia’. Courtesy of the Russell Library, Maynooth University: Renehan Collection, Vol. 3, p. 14.

The collections reflect the cataloguing arrangements in the original record office and the largest sets of copies deal with topics central to the study of Irish history: The Elizabethan conquest and Administration, the Plantation of Ulster, the Cromwellian occupation of Ireland, the Williamite wars and the breaking up of the great landed estates in the nineteenth century. All areas of history are covered in these transcripts, however, and the material includes early census-type records, trade, legal judgements and a wide range of smaller thematic collections related to specific towns and cities. The digitisation is most advanced for the Cromwellian period, 1650-1659, and the scale of documents recovered surpasses that which has survived for most parts of England.

Transkribus works very well on large, relatively uniform collections such as these. Several HTR models have been prepared for 15,000 words each, beginning with the nineteenth century hands and achieving, in some cases, a Character Error Rate (CER) of less than 2%! As the number of trained models increased, a separate project emerged to investigate if the existing models could be used to partially recognise a sample from the next set of documents, and speed up the process of creating each subsequent set of ground truth. It was decided to create a single page ground truth for each new example, and compare this with text automatically generated with each model in the project to find the best one to work with.

Transkribus comprises a cross-platform client GUI which is downloaded and executed on users’ local machines, Windows, Mac or Linux. This GUI communicates with a remote server over the Web. The server allows to manage collections of documents, train HTR models and run models against document collections, all in response to user-requests through the GUI.

Unusually, the Transkribus project has separately published an open-source client library which the GUI uses to make requests to the server. As part of a summer project we decided to use this library as the basis for a scripting language, allowing us to write mini-programs (scripts) automating common tasks separately from the GUI, but using the same back-end services as it.

The client library as shipped is written in the Java programming language, which runs on a virtual machine known as the JVM, and which enables the client to be cross-platform. We decided to base our scripting language on Clojure, an idiomatic modern Lisp which also runs in the JVM and provides excellent Java interoperability.

Our scripting language, which we call Transkript, is also published as open-source, on Github. It does not implement all of the underlying API, just enough to enable a couple of small scripting applications: eval-models and run-ocr.

The first script compares multiple trained models associated with a collection, using the first page of a specified document. Using the GUI this would be a laborious affair since running each model takes some time. A user can run our script and return later to browse the results.

The second script is used to upload a folder of images representing pages of a typewritten document, run OCR on it, and download the text output of the OCR process.

The power of our approach is that each of these scripts took only a couple of hours to write and test, and the core of each of them is about a dozen lines of fluent code, which is quite comprehensible, even to relatively non-technical users. The scripting language does not add any new functionality to Transkribus, but enables dramatically increased productivity through the batch processing of large numbers of jobs. There are multiple additional scripts that can be employed, for example to HTR documents automatically once the most appropriate model has been identified by the eval-models script.

+ Transkribus – The Best Idea to Procrastinate I’ve Ever Had

Stefan Karcher, a graduate student at Heidelberg University has written a fascinating blog post explaining how he has been using Transkribus to process nineteenth-century German sermons.

Karcher took the opportunity to train his own Automated Text Recognition models.  He used around 30,000 transcribed words of training data to generate a model that can produce transcripts of his documents with a Character Error Rate of 8-10%.  The blog post notes that these transcripts are a useful and efficient basis for his research and includes a description of how these automated transcripts can be analysed with  Voyant Tools.

Do you want to train your own Automated Text Recognition model?