The READ project will be there to tell an audience of digital humanities specialists how they can use Transkribus to apply Handwritten Text Recognition technology to historical documents. Dr Günter Mühlberger, coordinator of the READ project, will be giving a demo of Transkribus on 6 December.
We thought it was about time that we got to know the people working on the READ project a little better! We are armed with a list of questions that we’ll be asking the computer scientists, archivists, historians and researchers working on READ over the coming months to find out more about their research. Read on for our first interview…
I am a senior scientist at the Computer Vision Lab. My research interests relate to Cultural Heritage applications of Document Image Analysis. I finished my PhD in 2014, where I worked on Document Image Analysis Preprocessing of Low-Quality and Sparsely Inscribed Documents. I have recently worked on the multispectral acquisition of ancient documents as part of a project to process and analyse the Sinaitic Glagolitic Sacramentary Fragments, two medieval Slavonic manuscripts which were discovered in a monastery in Egypt in 1975. In my spare time I like to go skiing, rowing and watch TV series.
What is your role in the READ project?
Layout Analysis of documents, with a special focus on Form Classification.
What is top of your to-do list at the moment?
I am working on Form Classification and preparing for ScriptNet, the READ programme of competitions in Handwritten Text Recognition and Document Image Analysis.
What do you like best about working on READ?
The challenge of working with a large number of different documents and the interdisciplinarity of the project.
If you could do another job for just one day, what would it be?
Helicopter pilot 🙂
What can you see out of the window of your office?
Melina has been using Transkribus to create training data for a Handwritten Text Recognition (HTR) model that can provide automatic transcripts of the correspondence of the Brothers Grimm. This work was undertaken as part of a pilot project called Tracing Authorship in Noise (TrAIN), which is analysing how page noise affects the accuracy of HTR and OCR. This is part of the wider eTRAP project on the reuse of electronic texts. We are happy to have this feedback and we look forward to seeing the results the results of the HTR!
One of the oldest University libraries in Germany is working with some of the newest technology! Greifswald University Library and Archives have been in cooperation with the Transkribus team since September 2015 and now have some exciting results to share.
Around 800 pages of documents and transcripts from the University Archives have been uploaded to Transkribus. These documents were a collection of minutes from meetings of the Konzil, the central administrative body of Greifswald University. These pages were written by three professional writers in Kurrent script, between the years 1775 and 1840. The Transkribus team used these documents to generate a Handwritten Text Recognition (HTR) model capable of automatically reading documents in the Konzil collection.
Greifswald University Library has been able to integrate the HTR technology from Transkribus directly into its digital library system (Digitale Bibliothek Mecklenburg-Vorpommern). This innovation was realised using the Open Source Goobi software provided by Intranda. Library users are now able to conduct keyword searches in a sample of handwritten material from the Konzil collection. You can see the full-text search in action in this example query, where the system has searched for the word ‘Greifswald’. Why not try searching for yourself?
This is a first for the READ project and an important milestone in our mission to disseminate HTR technology. We are grateful to Greifswald University Library and Archives for showing that it is possible to provide HTR technology directly to users in order to facilitate research. Over the next few weeks, Greifswald University Library will be importing 100 more volumes from the Konzil collection into Transkribus to allow for more comprehensive searching of the collection.
A reminder that a full-text search function is also now available in the latest version of the Transkribus platform. Once you have trained a HTR model for your manuscript collection, you will be able to conduct a full-text search of your documents.
The event was designed to showcase the latest digital research in the fields of humanities and natural sciences. There were presentations from some of the READ partners and we also heard from other researchers around the UK, who discussed the opportunities and challenges of working with digital tools.
The conference was held at the Linnean Society, which is the oldest surviving natural history society in the world. It was founded in 1788 by the botanist James Edward Smith and is named after the Swedish naturalist Carl Linnaeus. The Society has held a collection of Linnaeus’ writings since 1829. Charles Darwin was a fellow of the Society and actually gave his first public lecture on his theory of evolution to a Linnean Society meeting in 1858. What an impressive place to open up our Digital Toolbox!
Networking in the Linnean Society Library [Image by Louise Seaward]
We were lucky enough to hear a keynote lecture from Professor Melissa Terras (UCL Centre for Digital Humanities) on the Transcribe Bentham crowdsourcing initiative. Professor Terras described how the phenomenal efforts of volunteer transcribers are contributing to the scholarly edition of the Collected Works of the British philosopher Jeremy Bentham. She also looked to the future, explaining that volunteer submissions are now being used as training data for Handwritten Text Recognition engines! For the rest of the morning, we heard from two more of the READ partners. Dr Roger Labahn (University of Rostock) and Dr Günter Mühlberger (University of Innsbruck and coordinator of the READ project) explained the theory and practice of using Transkribus to conduct searches of handwritten historical documents.
The afternoon was dedicated to the latest digital projects in the humanities and natural sciences. We heard about techniques of text mining, digitisation, optical character recognition, metadata organisation and crowdsourcing. Videos of the presentations will be available soon but in the meantime, you can consult the full conference programme to find out more.
Getting ready for the next presentations in the Linnean Society Meeting Room [Image by Louise Seaward]
Over 70 people attended the event, from archivists, curators and librarians, to researchers, project managers and computer experts. Our attendees helped to get the conference hashtag ‘#digtoolbox‘ trending on Twitter for the London area and lots of connections were made, both in person and online. The READ project is committed to open access research and open source tools – so we will continue sharing the contents of our Digital Toolbox!
On 27 September 2016, the Friedrich-Schiller-Universität (FSU) Jena hosted a workshop on ‘Automatic Text and Structure Recognition as Elementary Technologies for Digital Humanities’. 32 attendees from FSU, as well as nearby archives and libraries followed the invitation from Andreas Christoph and Barbara Aehnlich and met for an intense day in Jena, Germany – filled with plenary lectures and a hands-on Transkribus workshop.
Old meets new – picturesque scene set for the Transkribus day in Jena (Image by Eva Lang)
The program included talks by Eva Lang (Passau Diocesan Archives) on ‘From church registry books to data bases – Digitization Strategies in libraries and archives’, Günter Mühlberger (University of Innsbruck and READ project coordinator) on ‘Transkribus. A virtual research platform for automatic text recognition in printed and hand-written documents’, Florian Kleber (Computer Vision Lab, Vienna University of Technology) on ‘No text and hand-writing recognition without layout analysis’ and Raphael Unterweger (Innsbruck University Innovations) on ‘Structured data and document recognition with Rule-Appler and Structify’.
After a short lunch break, the group reconvened for a hands-on workshop, where Günter Mühlberger, assisted by Eva Lang, demonstrated the state-of-the-art of the Transkribus software, now also featuring a table editor and a user-friendly tagging system. After the long day, the participants were enabled to upload their own documents, transcribe their first test project and get a better understanding of the technologies behind hand-written text recognition.
Günter Mühlberger demonstrates the power of Transkribus (Image by Eva Lang)
Firstly, a Transkribus workshop will take place at the University of Zürich in October 2016. Due to high levels of interest, the workshop will be held twice – on 20 and 21 October. The Transkribus software, as well as other tools being developed by the READ project, will be presented. Participants will have the chance to work with their own documents and see what Transkribus is capable of regarding transcriptions and editions.
Second, on 10 November 2016, eight speakers will introduce tools used for digital scholarly editing to a public audience at the University of Zürich. A keynote speech will be delivered by Tobias Grüning from the University of Rostock who will explain how recurrent neural networks are being used in READ for Handwritten Text Recognition (HTR). There will also be talks about other digital editing tools and projects including ChartEX, histhub, corpus corporum, e-Manuscripta and TUSTEP.
Both events are open to all. More details can be found in the full programme. For more information and to register, please contact Tobias Hodel.
Passau Diocesan Archives took on the task of hosting the latest READ project meeting between 20 and 22 September 2016. Over 30 individuals from the 14 READ project partners met together in the pretty town of Passau in Southern Germany to discuss the current and future progress of our research into the Handwritten Text Recognition (HTR) of historical documents.
The first part of our meeting took the form of a public symposium before an audience of German archivists and researchers. To see what went on, check out this short video from Passau Diocesan Archives where staff from the archive talk about the conference and their participation in READ. The video is in German but non-speakers can scroll down to find out more about the symposium and the READ project meeting.
As the keynote speaker, Gerhard Fürmetz, director of the Bavarian State Archives used his presentation to show how digitisation has changed the way archives work.
We then heard from Dr Herbert Wurster, director of Passau Diocesan Archives. His archive has a large digitised collection of handwritten sacramental registers and is working with READ to facilitate the searching of these records for information relating to person names and births, marriages and deaths.
Next up, the coordinator of the READ project Dr Günter Mühlberger from the Digitisation and Digital Preservation Group (DEA) at the University of Innsbruck presented on READ and the Transkribus transcription platform. His talk described how archives can access Transkribus and outlined the way in which Handwritten Text Recognition engines produce automatic transcriptions of handwritten material.
Finally, Dr Florian Kleber from the Computer Vision Lab at Vienna University of Technology demonstrated how the READ project is working with large and varied datasets of transcribed historical material. Dr Kleber also explained that research competitions play a vital role in enabling computer scientists to evaluate and improve the effectiveness of their tools.
Once we said goodbye to everyone at the symposium, the READ project meeting could begin. During the first sessions, each project partner was given the opportunity to share news about their major achievements, possible setbacks and next steps. We met for dinner to bring our first day to a close and were welcomed to Passau with a short speech from Klaus Metzl, Vicar General, Very Reverend.
There are always a lot of laptops in the room at a READ meeting! [Image by Elena Mühlbauer]
The next day we formed working groups to tackle questions surrounding technical issues, research competitions and dissemination of the project outputs. Groups deliberated the best means of producing training data for HTR engines and ways to improve the accuracy of keyword searches of handwritten material. The development of new tools was also discussed, including a Table Recognition tool which will make it easier for Transkribus users to transcribe text in tables.
In the evening we were treated to a walking tour of Passau and a look behind at the scenes at the new building of the Diocesan Archives. It has been built on stilts to shield precious documents from any flooding from the town’s three rivers and the floor has been painted a liturgical shade of purple!
Passau Diocesan Archives [Image by Laurent Bolli]
Sunset in Passau [Image by Laurent Bolli]
We concluded proceedings on our last day with some SWOT analysis – what are the strengths and weaknesses of the READ project? What might be its opportunities and threats? After some fruitful discussion, there was just enough time for a group picture before we parted ways!
The READ project [Image by Elena Mühlbauer]
Thank you Passau! If you want to find out more about our meeting, take a look back at our twitter feed. We look forward to continuing the discussions at our next project meeting in Brussels!
This year’s DocEng symposium was organised by one of the READ partners, the Computer Vision Lab at Vienna University of Technology. Between 13 and 16 September academic and industrial researchers were welcomed to Vienna where they had the opportunity to hear about the latest research on document engineering and participate in workshops on topics such as Table Modelling and Future Publishing Formats.
One of the keynote lectures was delivered by Dr Günter Mühlberger, the projector coordinator of READ. Dr Mühlberger’s talk was entitled, ‘Research Infrastructures, or how Document Engineering, Cultural Heritage, and Digital Humanities can go together’. Dr Mühlberger wrote his PhD thesis on Johann Wolfgang von Goethe and has always been interested in integrating digital technologies into the humanities. His paper described the Transkribus research infrastructure which is being developed by the READ project. The talk showed an audience of specialised computer scientists that the technologies of Handwritten Text Recognition, Automatic Writer Identification and Keyword Spotting are hugely relevant to the humanities sector because they can improve access to historical records.