+ Update on table processing

Back in April we appealed for help in generating a new data set that could be used to improve the automated layout analysis of historical documents set out in tables.  We asked, and you answered!

Thanks to submissions from our network, READ researchers at the Computer Vision Lab at the Technical University of Vienna, Naver Labs Europe and the Passau Diocesan Archives have been compiling a sizeable collection of images of historical documents containing tables.

We now have a total of around 1,500 images from 25 contributors all around the world.  The delivered sources show a great variety of tables from hand-drawn accounting books to stock exchange lists and train timetables, from record books to prisoner lists, simple tabular prints in books, production census and many, many more.

READ researchers are preparing the data set as the basis for a computer science research competition in early 2019 (more details coming soon!).  This collection will be used to evaluate different approaches to the automated recognition of tables.

There is still a lot for us to learn about what constitutes a table.  Working with this heterogeneous data should help us to move beyond the specifics and come up with some generic guidelines and techniques for processing these kinds of pages.

We are very thankful to our network for delivering such a variety of tabular data and we look forward to sharing our next progress report!

Screenshot of 1937 Irish Census in Transkribus.  Image courtesy of National University of Ireland, Galway.

+ More than 15,000 Transkribus users!

Drumroll please!  Transkribus now has more than 15,000 users!  Our users are based mainly in Europe but also extend into Africa, Australia, America and other parts of the globe.

This expansion of our user-base is a significant achievement for the READ project.  Back when the project started in January 2016, there were only 2828 registered Transkribus users.   And a broad user network is very important for us.  By working with an enormous variety of documents provided by different researchers, projects and institutions, we are developing robust Handwritten Text Recognition technology that can cope with all sorts of scripts.

So we look forward to collaborating with lots more new users in 2019 and beyond!  And if you haven’t tried out Transkribus yet, why not have a go?

+ Transkribus on Euronews TV

Check us out – we’re on TV again!  EuroNews TV, a leading 24-hour information channel, has produced a short documentary film featuring READ which sheds light on the latest research in Handwritten Text Recognition.

The film is a co-production between EuroNews and the European Commission.  It is being aired in 10 languages on the award-winning Futuris programme on European science, research and innovation and should hopefully be seen by 430 million households in 130 countries!

+ Experiments with Transkribus and early printed text

We love hearing what our users have been getting up to with our Transkribus platform for Handwritten Text Recognition.

Annika Rockenberger from the National Library of Norway has written a blog about her experiments with Transkribus as part of her work on a digital edition of the writings of the German journalist, historian and poet Georg Greflinger (1620-1677).

Annika is working with early printed text which cannot be adequately recognised with OCR.  She explains that Transkribus users can train a model to recognise this kind of printed text, with around 5000 words of transcribed material.

Unfortunately in this case, digitised images from tightly bound books have made it difficult for the programme to detect the location of text on a page.  Annika hopes to continue her experiments with Transkribus at a later date with better quality images.  Read more on the Greflinger Digital Edition blog:

+ Searching Jeremy Bentham’s manuscripts with Keyword Spotting

The Bentham Project has been experimenting with the Handwritten Text Recognition (HTR) of Bentham’s manuscripts for the past five years, first as a partner in the tranScriptorium project and now as part of READ .

Read about their progress with HTR and our Transkribus platform in blog posts from June 2017 and  February 2018.

Keyword Spotting

The results have thus far been impressive, especially considering the immense difficulty of Bentham’s own handwriting.  But automated transcription is not yet at a point where it is sufficiently accurate to be used by Bentham Project researchers as a basis for scholarly editing.

However, the current state of the technology is strong enough for keyword searching!  And thanks to a collaboration with the PRHLT research center at the Universitat Politècnica de València (another partner in the READ project), there are some exciting new results to report.  It is now possible to search over 90,000 digital images of the central collections of Bentham’s manuscripts, which are held at Special Collections University College London and The British Library.

A Keyword Spotting search for the word ‘pleasure’

Appeal for volunteers!

A Google sheet has been prepared with some suggested search terms in 5 different spreadsheet tabs (Bentham’s neologisms, concepts, people, places and other).  The Bentham Project is appealing for people to record their searches online, using the suggested search terms and some new ones too.  Some of the results will be shared at the upcoming Transkribus User Conference in November.

Background

The PRHLT team have processed the Bentham papers with cutting-edge HTR and probabilistic word indexing technologies. This sophisticated form of searching is often called Keyword Spotting. It is more powerful than a conventional full-text search because it uses statistical models trained for text recognition to search through probability values assigned to character sequences (words), considering most possible readings of each word on a page.

The result is that this vast collection of Bentham’s papers can be efficiently searched, including those papers that have not yet been transcribed! The accuracy rates are impressive. The spots suggest around 84-94% accuracy (6-16% Character Error Rate) when compared with manual transcriptions of Bentham’s manuscripts. More precisely speaking, laboratory tests show that the word average search precision ranges from 79% to 94%. This means that, out of 100 average search results, only as few as 6 may fail to actually be the words searched for. The accuracy of spotted words depends on the difficulty of Bentham’s handwriting – although it is possible to find useful results in Bentham’s scrawl! There could be as many as 25 million words waiting to be found.

A search for the word ‘happiness’ uncovers Bentham’s most famous phrase, written in his own hand.

Use cases

This fantastic site will be invaluable to anyone interested in Bentham’s philosophy.  It will help Bentham Project researchers to find previously unknown references in pages that have not yet been transcribed.  It will allow researchers to quickly investigate Bentham’s concepts and correspondents.  It should also help volunteer transcribers in the Transcribe Bentham initiative to find interesting material to transcribe.

This interface is a prototype beta version.  In the future, there are plans to increase the power of this research tool by connecting it to other digital resources, allowing users to quickly search the manuscripts at the UCL library repository, the Bentham papers database and the Transcribe Bentham Tanscription Desk and linking these images to rich existing metadata.

Feedback on this new search functionality is welcomed at: transcribe.bentham@ucl.ac.uk

Similar Keyword Spotting technology (based on research by the CITlab team at the University of Rostock, another one of the READ project partners) is currently available to all users of the Transkribus platform.  Find out more about how to get started with Keyword Spotting.

+ New to Transkribus? Master the platform in just 10 steps

Maybe you’ve just discovered Transkribus and are feeling a bit overwhelmed?  Our updated video should help you get to grips with working with our Handwritten Text Recognition technology – in just 10 steps (and under 4 minutes!).

You can find more detailed information about working with our platform in our How to Guides.

While you’re on the Transkribus YouTube channel, check out our other videos too – including presentations from Transkribus users at the the 2017 Transkribus User Conference.

+ Sharing data with Transkribus – Transcribimus and minutes of Vancouver City Council

We can all agree that it’s nice to share – and in the READ project, sharing data brings direct benefits for the Handwritten Text Recognition technology in our Transkribus platform.  According to principles of machine learning, the more images and transcripts that are submitted to us as training data, the stronger the Handwritten Text Recognition technology can become.  Images and transcripts are not publicly shared but they contribute to a general improvement in the technology behind the scenes.

Transcribimus is a community project based in Vancouver, Canada with a sizeable collection of transcripts which they will be using to train an Handwritten Text Recognition model.

Transcribimus all started when Sam Sullivan, former mayor of Vancouver, started to research the City Council minutes from the late nineteenth century with a view to exploring the achievements of Vancouver’s second mayor, David Oppenheimer.  Sam’s physical limitations prevented him from visiting the archives as often as he would have liked.  So he formed a partnership with Margaret Sutherland, a local retiree who had experience of genealogy and reading old handwriting.  Margaret began transcribing and digitising the minutes for Sam and was gradually joined by other volunteer transcribers including Christopher Stephenson, a graduate student in Library and Archival studies who provided lots of assistance.  Transcribimus eventually became an online platform where more than 20 volunteers have transcribed some 3,500 pages of handwritten minutes.

Image from the City Council Minutes. City of Vancouver Archives, VMA 23-5 page 214. Image credit: Margaret Sutherland.

These transcriptions are already freely available on the Transcribimus website.  The City of Vancouver Archives will ultimately display the images and transcripts on their website too.

The vast majority of the minutes are written in one hand, so these images and transcripts will likely feed into a strong Handwritten Text Recognition model that produces useful transcripts of the collection. Transcribimus volunteers could then check and correct any errors in these automated transcripts – and the transcription of the City Council minutes should hopefully be realised more quickly!

  • Do you have existing transcriptions that you have produced or collected as part of a research project?  Ideally 500 pages or more…
  • Send them to us and we can process them and train a model to recognise the writing in your documents!
  • To find out more about working with existing transcripts, consult our How to Guide or contact us.

+ Learn more about Transkribus in Zagreb

Join us for an event in the Croatian capital of Zagreb on Thursday 18th October.

The event is hosted by ICARUS Croatia and the Faculty of Philosophy at the University of Zagreb.

There will be a morning of lectures from READ project researchers based at the University of Innsbruck and University College London, which will explain the workings of Transkribus and the possibilities of Handwritten Text Recognition for different kinds of historical documents.

There are also limited spaces for a Transkribus workshop, where participants will be able to learn tips and tricks for working directly with the platform.

To enquire about attending the Zagreb event, please email Vlatka Lemić (vlemic@arhiv.hr).

+ Join us for Vienna Scanathon at the Austrian Academy of Sciences

Digitising historical documents? There’s an app for that!

Join us in Vienna for our next Scanathon event, hosted by the Austrian Academy of Sciences and the Austrian Centre for Digital Humanities.

Screenshot of DocScan app

Participants will have the opportunity to test out the DocScan mobile app and the ScanTent device, new tools which facilitate the digitisation of historical documents with a mobile phone.

The event will take place on the afternoon of Wednesday 7 November.  Attendance is free and open to all – but registration is required:

The Scanathon could be an ideal pre-conference activity for anyone attending the 2018 Transkribus User Conference, also in Vienna.

We ask that attendees bring their smartphone to the event so they can work with the tools. The DocScan app is currently only available on Android phones.

Participants are also invited to bring their own documents to digitise during the event.

DocScan and the ScanTent are being developed by one of the READ project partners, the Computer Vision Lab at the Technical University of Vienna.

+ Registration now open! Transkribus User Conference 2018

Registration for the 2018 Transkribus User Conference in Vienna is now open!

The conference will take place for the second time at the Technical University Vienna, right in the heart of Vienna, on 8-9 November 2018.

Registration for the Transkribus User Conference is currently at full capacity.  Please contact Tamara Terbul (Tamara.Terbul@uibk.ac.at) to join the waiting list and we will let you know if there are spaces available.

The conference registration fee is 50 EUR for regular participants and 25 EUR for students.

Please be aware that conference places are limited and granted on a first come, first served basis.

Building on the success of last year’s conference, this year’s programme will offer opportunities for Transkribus users to find out about the latest technological developments, experiment with new features, hear from users who have been working intensively with the platform and ask questions about how Automated Text Recognition could work on different kinds of documents.

Everyone is welcome – from Transkribus newbies to more experienced users.  And if you came along last year, there will be much new content to enjoy.

Some highlights of the programme include:

  • READ-COOP – hear about the future incarnation of the READ project which will promote collaborative working to preserve and enhance digital cultural heritage.
  • Transkribus in Practice – hear how users have been working with the platform to process documents of varying dates, languages and styles
  • Transkribus workshop – a chance for new and more experienced users to learn how to work with the platform and ask questions of our developers.
  • READ technology showcase – presentations from computer scientists on the technology behind Transkribus
  • Digitisation on demand in archives – a presentation of DocScan and the ScanTent, tools which help users to digitise historical documents using their mobile phone.

Conference participants who will arrive early in Vienna might also be interested in attending a pre-conference Scanathon at the Austrian Academy of Sciences on the afternoon of 7 November, where they will be able to try out these tools in an archival environment.

Places for the conference are limited and offered on a first-come, first-served basis.

If you have any questions about the conference, please contact Tamara Terbul (Tamara.Terbul@uibk.ac.at).

We look forward to meeting you all in Vienna!