+ ScanTent makes it to Mali, West Africa!

Prototypes of the ScanTent, our device for digitising documents with a mobile phone, have been popping up all over Europe over the past year. And in December 2018, the first ScanTent made it to Africa!

Dr Vincent Hiribarren (King’s College London) took the tent to the town of Kita in western Mali to try it out before using the professional equipment (cameras, tripod, scanner) provided by the Endangered Archives Programme project called ‘Recovering the rich local history of Kita (Mali) through the salvaging of its archival heritage’.  This grant is held and directed by Dr Marie Rodet (SOAS, University of London).

The Endangered Archives Programme at The British Library awards annual grants to preserve archival material that is at risk of destruction or neglect. This funding means that endangered archival collections can be transferred to new homes, digitised and deposited at local institutions and in The British Library.

The ScanTent is a portable piece of equipment which holds a user’s phone in place above a historical document, providing a consistent source of light and leaving users with their hands free to turn pages or move documents around.  The advantages of the ScanTent become even greater when it is used in conjunction with our DocScan mobile app.  DocScan automatically detects the page area of a document and provides real-time feedback on image quality.  It also has an auto-shoot feature which will take a photo every time a page is turned.  Transkribus users can upload images to the platform directly from DocScan and these images can then be used for training an Automated Text Recognition model.

Dr Hiribarren installed DocScan on his phone in advance of his trip and was then able to set up the ScanTent quickly on location in Kita and start scanning! This experiment really shows that these tools have huge potential to open up access to unique collections of historical material all over the globe.

DocScan is available now, free of charge.  The ScanTent is still in development and units will be available for sale and hire later in 2019.

Testing out the ScanTent in Kita (Mali). Image credit: Vincent Hiribarren.

Find out more:

+ Latest Transkribus video tutorials

We’re celebrating the New Year with the release of several video tutorials designed to help new users navigate our Transkribus platform.

If you have a few minutes, you can get a nice overview of stages needed to automatically recognise and search handwritten and printed historical texts in Transkribus.

How to use Transkribus in 10 steps

Segmentation

Training a Handwritten Text Recognition model

Using Handwritten Text Recognition models

Keyword Spotting

If you need extra help with Transkribus, please check out our detailed How to Guides.

+ Meet the READ project partners Johanna Walcher

What’s your name?

Johanna Walcher

 

 

 

 

 

 

 

 

 

 

Where do you work?

The University of Innsbruck.

Tell us a bit about your background…

I did a bachelor’s degree in Transcultural Communication for Italian and Russian Language. Currently I am working towards my master’s degree in Media Studies and also study Philosophy. In my leisure time I love doing sports, preferably in the mountains. Hiking, running, mountain biking, skiing and yoga are great! But I can also sit still and spend a whole day reading books and newspapers. Whenever I can I travel the world and spend as much time as I can with my friends and family.

What is your role in the READ project?

In the READ project I am responsible for the Transkribus How to Guides, which should make life easier for users. I also run Transkribus webinars where I present the READ project and show interested people how to use the platform. As a member of the Dissemination Working Group, I am involved in spreading the word about the READ project.

What is at the top of your to-do list at the moment?

My current project is recording short screencast videos, which explain the different steps of the Transkribus workflow.  The first ones are already up on our YouTube channel soon.

What do you like best about working on READ?

All the interesting projects around READ and the inspiring people I work with.

If you could do another job for just one day, what would it be?

Good question! There are too many jobs in the world I would love to try and it would take me too much time to choose.  Working hours are expensive, so I will rather save READ some money and do my actual job, which I like a lot! 🙂

What can you see out of the window of your office? 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Thanks Johanna!

+Sharing data for Handwritten Text Recognition

At the READ project we are committed to sharing data and working collaboratively to improve the recognition of handwritten historical documents.

With this in mind, two of the project’ research groups have uploaded data relating to recent computer science competitions in handwriting and document layout analysis.

Data from the Pattern Recognition and Human Language Technology group at the Universitat Politècnica de València and the Computational Intelligence Lab (CITLab) at the University of Rostock:

Check out the ScriptNet-READ community on Zenodo for more of the data that READ project researchers are using for their experiments.

+ A new video tutorial helps with segmentation in Transkribus

Segmentation is a crucial stage of working with Handwritten Text Recognition in our Transkribus platform.  Digitised images of historical documents must be segmented into text regions, lines and baselines before they can be transcribed manually or automatically.  Segmentation can be performed automatically by the software to a high level of accuracy.  For more complex documents, users may then need to make some manual corrections – moving or deleting baselines for example.

If you’re new to segmentation in Transkribus, we have a new video tutorial which will help you get started.

You can find out more about working with Transkribus in our How to Guides.

+ Recognising printed Asian texts with Transkribus

Yes, you read that correctly – our Transkribus platform can indeed recognise printed Indian texts.

Conventional OCR software usually struggles to decipher the complexities of South Asian scripts.  Two projects have recently been working with nineteenth-century printed texts in Transkribus with the hope of getting better results.  Using images and transcripts from a collection, Transkribus users can train a model to recognise printed text of any type.

First of all, The British Library’s Two Centuries of Indian Print project is creating a digitised collection of works published in South Asia in the eighteenth and nineteenth centuries.  The project team trained a text recognition model in Transkribus with 50 pages (containing 5,700 words) of digitised images and transcripts from Bengali books.  The resulting model can produce transcripts of page from the collection with an average Character Error Rate of 21%.  Although this is a relatively high error rate, the team are planning to retrain the model by creating more pages of training data and focusing on improving the recognition of elements of the Bengali characters which were sometimes missed by the software.

The Naval Kishore Press was a nineteenth-century publishing house which brought works on various subjects to market in Hindi, Urdu, Arabic, Persian and Sanskrit. Part of its output are held by the library of the South Asia Institute (SAI) at Heidelberg University.  The South Asia Institute library and Heidelberg University Library are collaborating on the Naval Kishore Press – digital project, working to produce digitised and machine-readable text for a selection of texts published by this press.  The project team used 200 pages of images and transcripts to train a model in Transkribus to recognise Hindi and Sanskrit text.  This model can produce transcripts of the collection with a Character Error Rate of around 5%.  Fully searchable images and transcripts from the collection are now available to consult, download and annotate on Heidelberg University library’s online catalogue.

Read more:

+Transkribus How to Guides now available in German (and French)

Many new users are registering for a Transkribus account every day and our How to Guides are there to help everyone get to grips with Handwritten Text Recognition technology for historical documents.

All of our How to Guides are now available in English and German.

Our introductory guide, ‘How to use Transkribus in 10 steps’ is also available in French.

You can find all of our How to Guides on the Transkribus wiki.

Our thanks go to Régis Schlagdenhaffen for the French translation. 

+ Preserving our cultural heritage with a smartphone

The READ project is a big proponent of digitisation on demand using smartphones.

A typical mobile phone camera can capture relatively high-quality images of historical documents, which can then be used for preservation, research and even as training data for Automated Text Recognition using our Transkribus platform.

The Computer Vision Lab at the Technical University of Vienna (one of the READ project partners) have created the ScanTent device and the DocScan mobile app to make it easier for people to digitise documents in this way.

The ScanTents

We were happy to receive a positive enquiry about these tools, highlighting their potential to capture unique records that might otherwise be lost.

Stefan Krüger from Germany got in touch after he had digitised his grandfather’s dissertation using his mobile phone and used Transkribus to recognise the text with OCR.  Herbert Rechner completed his dissertation in 1927 just before the rise of the Nazis, on the radical topic of the ‘the sexual causes of offences’.  Although Stefan was never able to meet his grandfather, he is interested in researching his history and is hopeful that Transkribus might be able to help recognise personal handwritten papers one day.

Stefan wrote…

‘After a long search I found the 90 years old dissertation of my grandfather in the German National Library in Leipzig and (in bowing to the performance of my ancestor) digitally reproduced the work. The Transkribus project helped me a lot with its outstanding recognition rate.

I photographed the booklet (about 100 pages) freehand with glass plate and smartphone (CamScanner) and re-set it in InDesign after text recognition.

With this work it became clear to me that we are experiencing a scientific break: everything that is not digitally available in scientific literature will disappear in the cognition-sinking. It is simply no longer taken into account in the scientific knowledge and research process. In the case of topics relating to electronics, space travel and other “more modern” developments, this may be easy to accept.

With all historically relevant things, however, this is painful.

That’s why I find your low-level effort with high-tech solutions very interesting. I would like to test your tent and the app. My thought is that actually (at least) everyone who has enjoyed an academic education should participate in the digital processing of his work and other literature. If you could make such a crowd thing out of it, then a big stock of literature could actually be worked on. So I am happy to participate in your developments in this sense.

With cordial greetings

Stefan Krüger’

Translated from German with www.DeepL.com/Translator

Thank you to Stefan for this feedback, which shows how Transkribus can help individuals to digitise and recognise exceptional historical documents.

A page from Herbert Rechner’s dissertation, digitised with a smartphone. Image credit: Stefan Krüger

If you would like to try digitising documents with a mobile phone, the DocScan app is available to download now free of charge (Android only). The ScanTent is still in development and units should be available for sale over the next few months.

Find out more:

+ Searching the Spanish Golden Age with Keyword Spotting

In sixteenth- and seventeenth-century Spain, there was a significant surge of thousands of theatrical productions. This period has become known as the Spanish Golden Age.  Thanks to a new protoype web tool, anyone can now search through 40,000 images from a significant digitised collection of manuscripts relating to this period of Spanish history.  This tool uses cutting-edge Keyword Spotting technology, allowing users to search images which have  never before been transcribed.

This tool is a collaboration between the Pattern Recognition and Human Language Technology research centre at the Universitat Politecnica de Valencia (one of the READ partners), the National Library of Spain and the PROLOPE research group (both READ MOU partners).

The PRHLT research centre has treated these manuscripts with advanced text recognition and probabilistic word indexing technology.  This sophisticated form of searching is often called Keyword Spotting. It is more powerful than a conventional full-text search because it uses statistical models trained for text recognition to search through probability values assigned to character sequences (words), considering most possible readings of each word on a page.

Keyword Spotting for the word ‘Madrid’.

The 40,000 pages currently available for searching represents about half of the collection.  More documents from the collection will be processed in this way if further funding can be found.

The release of this Keyword Spotting tool coincides with a new exhibition at the National Library of Spain all about the Spanish Golden Age which runs until March 2019.  The exhibition will combine original manuscripts with digital displays.  The PRHLT team have a created an online quiz (in Spanish) for the exhibition which asks users to work with the Keyword Spotting too to find out which words appear frequently or in combination.

If you are interested in Keyword Spotting, check out other tools constructed by the PRHLT team relating to:

+ Recognising eighteenth-century legal records at Middle Temple

The Honourable Society of the Middle Temple is one of four Inns of Court: prestigious professional associations for barristers working in England.

The archive and library of Middle Temple holds records of the Inn from the early sixteenth century onwards.  The most significant series of these documents are being digitised and made available online.

Middle Temple began exploring Transkribus tentatively in 2016.  The Inn first signed a Memorandum of Understanding with the READ project and then started to explore the possibilities of training Handwritten Text Recognition (HTR) models to recognise documents in their collections.

After discussions about the best documents to start with, they settled on digitised manuscript records of Middle Temple’s governing body or Parliament.  These records dated from 1762-1775 and were written in several very similar hands.

A selection of 101 bifolio pages were uploaded to Transkribus and transcribed by the Transkribus team.  David Woolley QC, a bencher at Middle Temple, then took care of proof-reading and correcting each page to ensure that the transcriptions were as accurate as possible.

These images and transcripts (around 80,000 transcribed words) became training data for generating a HTR model.  Data from the pre-exisiting ‘English Writing M1’ model was also included as part of the training process as a ‘base model’.  The ‘English Writing M1’ model is trained to recognise the writing of the English philosopher Jeremy Bentham (1748 – 1832) and his secretaries – it is freely available to all Transkribus users for their experiments.

The resulting HTR model can produce transcripts of images from the test set with a very low Character Error Rate of 3.31%.  This is an amazing result!  Automated transcripts with such a low error rate immediately become a useful research resource.

Automated transcription of a page from the Middle Temple records.

The team at Middle Temple also created a dictionary based on one of their ‘Bench Books’ which lists recurring names, abbreviations and unusual terms. This dictionary should hopefully improve the quality of the recognition.

Middle Temple is now exploring ways to build on this first great achievement, by making these transcripts available to researchers in a searchable database.

Thanks to Lesley Whitelaw, Barnaby Bryan and David Woolley at Middle Temple and Stuart Dunn at King’s College London for this collaboration.