+ Medievalists! Share data with our working group to improve Handwritten Text Recognition

With thousands of Transkribus users working all over the world, there is huge potential for collaborative work on the automated recognition of historical documents.

Dr Tobias Hodel (State Archives of Zurich, University of Zurich) has set up the ‘Gothic Hands’ working group with this mind, hoping to improve recognition of medieval Gothic script. The ‘Comb_Gothic_Bookwriting’ model has been trained on different sets of medieval scripts and is already available to all Transkribus users. In the best cases, it can produce automated transcripts with a Character Error Rate of less than 10%.

We are looking for more users to join this working group and share images and transcripts of Gothic scripts written between the 11th and 15th century. The current model has been trained primarily on German language material, so we are especially keen to receive documents written in Latin.  

The latest contributor to the working group is Digital Statius: The Achilleid, a project which is producing a digital edition of Achilleid, an unfinished epic poem written in Latin, in which the poet Statius (later 1st c. AD), narrates the childhood of Achilles and the stay of the hero on the island of Scyros. This text was part of the school curriculum in the Middle Ages, before losing its status as a classic. The project, funded by the Swiss National Science Foundation (SNSF) and based at the University of Geneva, aims to produce a new critical edition of the Achilleid, fully and exclusively digital, which takes into account the complete manuscript tradition of the poem (224 manuscripts, c. 8000 images). The open access digital critical edition will include a new text, a full interactive apparatus criticus, comparative visualization of numerous readings, comments, translations, links to other tools and/or platforms, and the images of the largest possible number of manuscripts.

If you work with Gothic script, you can join the team behind the Achilleid edition and many others by becoming part of the ‘Gothic Hands’ working group.

To participate in the group, you can:

  • share existing training data that you have already prepared in Transkribus
  • prepare new images and transcripts in Transkribus in the ‘Gothic Hands’ collection
  • send over files containing images and transcripts which can be matched automatically and converted into training data

Please contact Tobias Hodel (tobias.hodel@hist.uzh.ch) with any questions about the group.

Working together gives us a great chance to transcribe and search medieval documents more efficiently!

+ READ on the move to READ-COOP

The READ consortium together with several other institutions is currently preparing the foundation of a legal entity (working title: READ-COOP) which will serve as the basis for sustaining and further developing the Transkribus platform and related services.

The governance model will be based on the EU directive for European Cooperative Societies (SCE). Though the SCE will be set up according to EU law it will be open to members outside of the European Community as well.

An SCE

  • is a legal entity that allows its members to carry out common activities, while preserving their independence
  • has the principal objective of satisfying its members’ needs and not the return of capital investment
  • allows members to benefit proportionally to their profit and not to their capital contribution.

Read more

+ Podcast with READ project coordinator

Günter Mühlberger, coordinator of READ and head of the Digital Humanities Research Center at the University of Innsbruck has recently been interviewed on a new podcast (in German).

The interview was recorded by the NewsEye project which like READ, is funded by European Union’s Horizon 2020 scheme. NewsEye aims to use digital tools to provide enhanced access to digitised historical newspapers and the project will build upon READ’s existing achievements in relating to the automated recognition of printed text.

+ Learn how to add structural tags to documents in Transkribus

We have another new How to Guide for users of our Transkribus platform.  This time we’re showing you how to enrich documents with structural tags like ‘paragraph’, ‘heading’, or ‘footer’.

In the near future, it will be possible to train models to automatically recognise the structure of historical documents.  Adding structural tags creates training data for this process.  If you work with this feature, there is no need to tag every element of your documents – just focus on marking up the sections that are of interest to you.

If you have any questions about structural tags, the Transkribus team are here to help (email@transkribus.eu)

+ ScanTent makes it to Mali, West Africa!

Prototypes of the ScanTent, our device for digitising documents with a mobile phone, have been popping up all over Europe over the past year. And in December 2018, the first ScanTent made it to Africa!

Dr Vincent Hiribarren (King’s College London) took the tent to the town of Kita in western Mali to try it out before using the professional equipment (cameras, tripod, scanner) provided by the Endangered Archives Programme project called ‘Recovering the rich local history of Kita (Mali) through the salvaging of its archival heritage’.  This grant is held and directed by Dr Marie Rodet (SOAS, University of London).

The Endangered Archives Programme at The British Library awards annual grants to preserve archival material that is at risk of destruction or neglect. This funding means that endangered archival collections can be transferred to new homes, digitised and deposited at local institutions and in The British Library.

The ScanTent is a portable piece of equipment which holds a user’s phone in place above a historical document, providing a consistent source of light and leaving users with their hands free to turn pages or move documents around.  The advantages of the ScanTent become even greater when it is used in conjunction with our DocScan mobile app.  DocScan automatically detects the page area of a document and provides real-time feedback on image quality.  It also has an auto-shoot feature which will take a photo every time a page is turned.  Transkribus users can upload images to the platform directly from DocScan and these images can then be used for training an Automated Text Recognition model.

Dr Hiribarren installed DocScan on his phone in advance of his trip and was then able to set up the ScanTent quickly on location in Kita and start scanning! This experiment really shows that these tools have huge potential to open up access to unique collections of historical material all over the globe.

DocScan is available now, free of charge.  The ScanTent is still in development and units will be available for sale and hire later in 2019.

Testing out the ScanTent in Kita (Mali). Image credit: Vincent Hiribarren.

Find out more:

+ Latest Transkribus video tutorials

We’re celebrating the New Year with the release of several video tutorials designed to help new users navigate our Transkribus platform.

If you have a few minutes, you can get a nice overview of stages needed to automatically recognise and search handwritten and printed historical texts in Transkribus.

How to use Transkribus in 10 steps

Segmentation

Training a Handwritten Text Recognition model

Using Handwritten Text Recognition models

Keyword Spotting

If you need extra help with Transkribus, please check out our detailed How to Guides.

+ Meet the READ project partners Johanna Walcher

What’s your name?

Johanna Walcher

 

 

 

 

 

 

 

 

 

 

Where do you work?

The University of Innsbruck.

Tell us a bit about your background…

I did a bachelor’s degree in Transcultural Communication for Italian and Russian Language. Currently I am working towards my master’s degree in Media Studies and also study Philosophy. In my leisure time I love doing sports, preferably in the mountains. Hiking, running, mountain biking, skiing and yoga are great! But I can also sit still and spend a whole day reading books and newspapers. Whenever I can I travel the world and spend as much time as I can with my friends and family.

What is your role in the READ project?

In the READ project I am responsible for the Transkribus How to Guides, which should make life easier for users. I also run Transkribus webinars where I present the READ project and show interested people how to use the platform. As a member of the Dissemination Working Group, I am involved in spreading the word about the READ project.

What is at the top of your to-do list at the moment?

My current project is recording short screencast videos, which explain the different steps of the Transkribus workflow.  The first ones are already up on our YouTube channel soon.

What do you like best about working on READ?

All the interesting projects around READ and the inspiring people I work with.

If you could do another job for just one day, what would it be?

Good question! There are too many jobs in the world I would love to try and it would take me too much time to choose.  Working hours are expensive, so I will rather save READ some money and do my actual job, which I like a lot! 🙂

What can you see out of the window of your office? 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Thanks Johanna!

+Sharing data for Handwritten Text Recognition

At the READ project we are committed to sharing data and working collaboratively to improve the recognition of handwritten historical documents.

With this in mind, two of the project’ research groups have uploaded data relating to recent computer science competitions in handwriting and document layout analysis.

Data from the Pattern Recognition and Human Language Technology group at the Universitat Politècnica de València and the Computational Intelligence Lab (CITLab) at the University of Rostock:

Check out the ScriptNet-READ community on Zenodo for more of the data that READ project researchers are using for their experiments.

+ A new video tutorial helps with segmentation in Transkribus

Segmentation is a crucial stage of working with Handwritten Text Recognition in our Transkribus platform.  Digitised images of historical documents must be segmented into text regions, lines and baselines before they can be transcribed manually or automatically.  Segmentation can be performed automatically by the software to a high level of accuracy.  For more complex documents, users may then need to make some manual corrections – moving or deleting baselines for example.

If you’re new to segmentation in Transkribus, we have a new video tutorial which will help you get started.

You can find out more about working with Transkribus in our How to Guides.

+ Recognising printed Asian texts with Transkribus

Yes, you read that correctly – our Transkribus platform can indeed recognise printed Indian texts.

Conventional OCR software usually struggles to decipher the complexities of South Asian scripts.  Two projects have recently been working with nineteenth-century printed texts in Transkribus with the hope of getting better results.  Using images and transcripts from a collection, Transkribus users can train a model to recognise printed text of any type.

First of all, The British Library’s Two Centuries of Indian Print project is creating a digitised collection of works published in South Asia in the eighteenth and nineteenth centuries.  The project team trained a text recognition model in Transkribus with 50 pages (containing 5,700 words) of digitised images and transcripts from Bengali books.  The resulting model can produce transcripts of page from the collection with an average Character Error Rate of 21%.  Although this is a relatively high error rate, the team are planning to retrain the model by creating more pages of training data and focusing on improving the recognition of elements of the Bengali characters which were sometimes missed by the software.

The Naval Kishore Press was a nineteenth-century publishing house which brought works on various subjects to market in Hindi, Urdu, Arabic, Persian and Sanskrit. Part of its output are held by the library of the South Asia Institute (SAI) at Heidelberg University.  The South Asia Institute library and Heidelberg University Library are collaborating on the Naval Kishore Press – digital project, working to produce digitised and machine-readable text for a selection of texts published by this press.  The project team used 200 pages of images and transcripts to train a model in Transkribus to recognise Hindi and Sanskrit text.  This model can produce transcripts of the collection with a Character Error Rate of around 5%.  Fully searchable images and transcripts from the collection are now available to consult, download and annotate on Heidelberg University library’s online catalogue.

Read more: