+ Machine Reading the Archive in Cambridge

It was a sunny Tuesday morning when the READ project made it to the Centre for Research in the Arts, Social Sciences and Humanities (CRASSH) at the University of Cambridge for our latest workshop.  Louise Seaward (Bentham Project, University College London) and Sebastian Colutto (University of Innsbruck) delivered a presentation and workshop on automated text recognition for handwritten and printed text.

The Mathematical Bridge at Queen’s College, University of Cambridge [Image by Louise Seaward]

Whilst Sebastian gave a technical overview of how our Transkribus platform can be used for automated text recognition, Louise explained the potential benefits of the automatic transcription and searching of documents from the perspective of a historian.  The team then delivered a hands-on workshop where staff and students from the university were able to get to grips with Transkribus.  Participants learnt how computers can be trained to recognise handwriting and how accurate this recognition can be.  There was also much interest in new methods for the automated recognition of printed text, which can produce even better results than Optical Character Recognition (OCR)!

Sebastian Colutto delivers a Transkribus workshop at the University of Cambridge [Image by Louise Seaward]

The event was part of ‘Machine Reading the Archive‘, a training and development programme for digital methods organised by Cambridge Digital Humanities Network, Cambridge Big Data and the Cambridge Digital History Programme.  The READ project looks forward to contributing to the programme again in the future!

+ Meet the READ project partners – Sofia Ares Oliveira

What’s your name?

Sofia Ares Oliveira.

Where do you work?

Digital Humanities Laboratory at Ecole Polytechnique Fédérale de Lausanne (EPFL).

Tell us a bit about your background…

I studied Electrical Engineering at EPFL and specialised in Information Technology, where I explored several signal processing topics, from acoustics to biomedical signals to images. I started working at the DHLab on cadaster map images and since then I have been working on the several thousands of historical documents from Venice that have been digitized and are waiting to be processed.

What is your role in the READ project?

At EPFL we are responsible for the Large Scale Demonstrator, the Venice Time Machine, which aims at building a multidimensional model of Venice and its evolution covering a period of more than 1000 years. I am mainly in charge of integrating and implementing computer vision and image processing tools for handwritten text documents and cadaster maps.

What is top of your to-do list at the moment?

Finalising the release of cython’s binding of the line segmentation tools on Transkribus, so that other READ partners can use it with python.

What do you like best about working on READ?

Working with people coming from different fields and countries, and the ‘product-oriented’ vision of the project, with direct feedback from users.

If you could do another job for just one day, what would it be?

Astronaut, a nice combination of scientist, engineer and explorer!

What can you see out of the window of your office? 

Thanks Sofia! 

+ A new model for Humanities research – collaboration with HumaReC

HumaReC is a new research platform developed by the Swiss Institute for Bioinformatics.  It is part of a project to investigate the digital production and publication of Humanities data using an edition of a New Testament manuscript as a test case.

HumaReC is aiming to establish a new model of Humanities research which allows for the continuous publication and analysis of data through transcriptions, blogs, a discussion forum and research publications. If you want to find out more, Claire Clivaz from HumaReC has recently written a blog post where she talks more about this idea of curating data in the Digital Humanities.

HumaReC are working with the READ project to train Handwritten Text Recognition technology to recognise the Arabic, Latin and Greek writing from the New Testament manuscript.  It is a particular challenge to process these three languages in one document collection!  They are also experimenting with ways to link files in our Transkribus platform to those which appear in their image viewer on their website.  You can take at look at their software on their GitHub page.

+ Transkribus in 10 steps?! Find out how in our new video…

Are you interested in using Transkribus for Handwritten Text Recognition?  If you have a couple of minutes, you can get an overview of the process in our new video.  How to use Transkribus – in 10 steps was put together by Elena Mühlbauer from Passau Diocesan Archives, who are one of the READ project partners.

You can find a more detailed version of this How to Guide, along with other instructional papers, on the Transkribus wiki.  

Or if you’re in the mood for more videos, the Transkribus YouTube channel has a growing playlist of video presentations relating to the READ project.

You could try ‘Handwritten Text Recognition: Key Concepts’ by Roger Labahn (University of Rostock)

or ‘Automated Writer Identification and its Use Cases for Archival Documents’ by Stefan Fiel (Technical University Vienna).  

Happy watching!  

+ Welcoming The British Library to the READ project network!

We are very happy to welcome The British Library into the READ project network as a Memorandum of Understanding partner.  The British Library collection is vast, containing more than 150 million items including a copy of Magna Carta and papers written by The Beatles.

Cooperation between READ project partners and The British Library has been developing across the past few years and the library is now working with Transkribus to train a Handwritten Text Recognition model to recognise colonial records from the nineteenth century.  We look forward to seeing the results soon!

The British Library joins the national libraries of Spain, France and Norway and many other archives, libraries and institutions who have signed a Memorandum of Understanding with the READ project.  If you are interested in becoming part of our network, send us an email to find out more!

+ Meet the READ project partners – Eva Lang

What’s your name?

Eva Lang.

 

 

 

 

 

 

 

 

 

 

Where do you work?

Passau Diocesan Archives.

Tell us a bit about your background…

I hold a Computer Science Diploma (equiv. MSc) from the University of Passau, where I worked in industrial research mainly for the automotive and textile industries. I joined the team at Passau Diocesan Archives for the READ project focusing on technical processes to help the archival users to do their research. Besides my work at the archives, I also work as city guide for mainly English-speaking visitors in our beautiful city of Passau. In my leisure time, I enjoy sports, arts and playing the piano.

What is your role in the READ project?

My role within the READ project is to apply the Handwritten Text Recognition technology on our very special historical documents.  Large parts of our images show tables and forms written in many different hands, so this is a unique and distinct challenge within the project.

What is top of your to-do list at the moment?

Digesting the results of the recent READ project review meeting with the European Commission in Brussels and improving the way in which users can use our Transkribus tool to process documents which are structured in tables and forms.

What do you like best about working on READ?

The interdisciplinary character of the project, bridging the historical, archival and computer worlds and working with partners from all over Europe.

If you could do another job for just one day, what would it be?

Work as a confectioner designing and decorating beautiful-looking cakes and pastries.

What can you see out of the window of your office? 

Here we can see the four Saints (left-to-right): Severin, Valentin, Maximilian and Stephan.  The Diocesan seminaries are named after Maximilian, Valentin (historic) and Stephan (still alive today). Severin lived and preached in Passau and the oldest church in the town (going back to around the year 470) is named after him.

Thanks Eva! 

+ Gothenburg calling! Report from Digital Humanities in the Nordic Countries conference

The READ project visited Sweden last week for the second Digital Humanities in the Nordic Countries conference.  The conference was hosted by the University of Gothenburg and organised by the Digital Humanities in the Nordic Countries association, which was founded in 2015.

University of Gothenburg [Image by Louise Seaward]

The conference kicked off with a morning session of workshops where attendees could get to grips with new software, tools and techniques.  Maria Kallio from the National Archives of Finland and Louise Seaward from the Bentham Project at University College London delivered a workshop during this session to demonstrate the Transkribus platform for Handwritten Text Recognition.

Maria Kallio from the National Archives of Finland teaches Transkribus [Image by Louise Seaward]

Around 15 participants were introduced to the READ project’s aim of transforming access to historical documents.  Working on their laptops, they learnt how to use Transkribus to produce training data for Handwritten Text Recognition.  Representatives from all of the Nordic countries took part and there was much interest in using Handwritten Text Recognition for all sorts of languages, from people working in archives, libraries and universities.

Once the workshop was over, we were able to enjoy the rest of the conference!  It was a packed few days with around 200 participants and nearly 60 presentations, plus keynotes, workshops and a poster slam.  We found particular inspiration in the panel on crowdsourcing and collaboration.  We heard how the Arthur Prior project at the University of Copenhagen has been recruiting academics to transcribe papers written by Arthur Prior, the philosopher and founder of temporal logic.  We also saw how the Latvian Folklore Archives experienced huge success with a well-publicised crowdsourcing campaign targeted primarily towards school children, which resulted in the transcription of nearly 15,000 pages in only 71 days!  The READ project will be following these projects with interest as we continue to develop a new open source crowdsourcing platform, where users can transcribe documents with the assistance of Handwritten Text Recognition technology.

Feskekôrka Fish Market in Gothenburg [Image by Louise Seaward]

You can catch up on some of the conference goings-on over on Twitter.  We are already looking forward to the 2018 conference in Helsinki, which the National Archives of Finland will be helping to organise!

+ READ presents at Digital Humanities conference – report from DHd 2017

In February 2017, the READ project was present at the DHd 2017 in Bern, Switzerland.  The annual conference of digital humanities in German brings together scholars and scientists working at the intersection of humanities and digital technologies. In addition to presentations of new approaches and technological as well as scholarly developments, the focus was laid on sustainability.  Keynotes and slides of the talks are now available online.

In a workshop, organized by Tobias Hodel (State archives of the canton of Zurich), READ brought the developed technologies and approaches closer to the stakeholders, mainly archival institutions with vast holdings and projects dealing with digital editions and transcriptions.  Besides the presentation of Transkribus, our main tool for transcription and Handwritten Text Recognition, the soon-to-be-implemented Writer Identification technology was introduced by Stefan Fiel of the Technical University of Vienna.

Stefan Fiel (Technical University Vienna) demonstrating Writer Identification [Image by Tobias Hodel]

The READ project also organised a panel, bringing together four research infrastructures that deal with the production of text from handwritten sources. Besides the already well-known Textgrid and Transcribo, the Dutch endeavor MONK presented its work with layout analysis and text recognition. Technicalities, longevity and the aim of the different projects were discussed. The virtual research environments for each project are directed at different groups of scholars and researchers. For those interested to find out more, we have contributed to a Dhd blog post comparing the different processes and capabilities of these projects.

READ and Transkribus were not only mentioned in its panel and the workshop but also in talks and presentations of other projects which are working regularly with our tools and algorithms (see i.e. the slides of Eva Fasshauer).  In conclusion, we detect not only an interest from the scholarly community but also a broader commitment to invest time in using our applications.

+ Meet the READ project partners – Hervé Déjean

What’s your name?

Hervé Déjean.

Where do you work?

Xerox Research Centre Europe.

Tell us a bit about your background…

I hold a PhD in Natural Language Processing, and for 10 years I’ve been working on Document Layout Analysis and Information Extraction.

What is your role in the READ project?

We are in charge of the Document Understanding part of the project.  We work on techniques to extract human understandable information from historical documents and codify it into a machine-readable form.  If you’re interested to find out more, I recently wrote a blog post about some of our research.

What is top of your to-do list at the moment?

I am preparing a demo of Information Extraction for the READ project’s upcoming review meeting with the European Commission.

What do you like best about working on READ?

Learning how to read old manuscripts!

If you could do another job for just one day, what would it be?

Street sweeper (that way I could meet Beppo!)

What can you see out of the window of your office? 

On a sunny day…

And in the snow…

Thanks Hervé! 

+ Coming soon – new DocScan app to help users digitise historical documents!

More and more archival holdings are being digitised.  But there are still thousands of document collections that exist only in manuscript form.  This means that interested readers must visit the archive in person to take pictures of and transcribe the documents they are interested in.

The READ project is seeking to make this process easier with a new digitisation service.  The Computer Vision Lab at Technical University Vienna is developing DocScan, an Open Source Android mobile app that allows archival users to take high-quality images of historical documents.

Screenshot of Transkribus DocScan

DocScan automatically detects the page area of a document and provides real-time feedback on the quality of the image according to factors like perspective, sharpness and light.  This allows users to take high-quality images that can be used for Handwritten Text Recognition in Transkribus, or simply for future research. The DocScan app will be connected to Transkribus so users can upload their images directly to our cloud.

The Computer Vision Lab are also working on a prototype of a ScanTent. This is a piece of equipment designed to hold a mobile phone in a stable position in order to produce a more standardised shot.  This could be particularly handy for scanning bound volumes, where two hands are sometimes needed to keep the pages in place.

DocScan and the ScanTent can also be of use to archives, as they could enable institutions to build up a collection of user-generated content.  QR code recognition or similar technology could be employed to ensure that images are organised correctly within an archive’s digital collections.

If you are interested in finding out more, you can read our reports:

Günter Mühlberger (University of Innsbruck), Markus Diem, Stefan Fiel and  Florian Kleber (all at the Computer Vision Lab, Technical University Vienna), D5.14 ScanREAD.

Günter Mühlberger (University of Innsbruck), Markus Diem, Fabian Hollaus, Stefan Fiel and  Florian Kleber (all at the Computer Vision Lab, Technical University Vienna), D81. Open Innovation Forum.

You can also take a look at the back-end of the DocScan app on our Github page.

We will be partnering with several archives to test out these two products and we plan to organise a ‘scanathon’ to see how quickly users can produce good quality digital images.  Stay tuned to hear more about the development and testing of the app!