+ Meet the READ project partners – Eva Lang

What’s your name?

Eva Lang.

 

 

 

 

 

 

 

 

 

 

Where do you work?

Passau Diocesan Archives.

Tell us a bit about your background…

I hold a Computer Science Diploma (equiv. MSc) from the University of Passau, where I worked in industrial research mainly for the automotive and textile industries. I joined the team at Passau Diocesan Archives for the READ project focusing on technical processes to help the archival users to do their research. Besides my work at the archives, I also work as city guide for mainly English-speaking visitors in our beautiful city of Passau. In my leisure time, I enjoy sports, arts and playing the piano.

What is your role in the READ project?

My role within the READ project is to apply the Handwritten Text Recognition technology on our very special historical documents.  Large parts of our images show tables and forms written in many different hands, so this is a unique and distinct challenge within the project.

What is top of your to-do list at the moment?

Digesting the results of the recent READ project review meeting with the European Commission in Brussels and improving the way in which users can use our Transkribus tool to process documents which are structured in tables and forms.

What do you like best about working on READ?

The interdisciplinary character of the project, bridging the historical, archival and computer worlds and working with partners from all over Europe.

If you could do another job for just one day, what would it be?

Work as a confectioner designing and decorating beautiful-looking cakes and pastries.

What can you see out of the window of your office? 

Here we can see the four Saints (left-to-right): Severin, Valentin, Maximilian and Stephan.  The Diocesan seminaries are named after Maximilian, Valentin (historic) and Stephan (still alive today). Severin lived and preached in Passau and the oldest church in the town (going back to around the year 470) is named after him.

Thanks Eva! 

+ Gothenburg calling! Report from Digital Humanities in the Nordic Countries conference

The READ project visited Sweden last week for the second Digital Humanities in the Nordic Countries conference.  The conference was hosted by the University of Gothenburg and organised by the Digital Humanities in the Nordic Countries association, which was founded in 2015.

University of Gothenburg [Image by Louise Seaward]

The conference kicked off with a morning session of workshops where attendees could get to grips with new software, tools and techniques.  Maria Kallio from the National Archives of Finland and Louise Seaward from the Bentham Project at University College London delivered a workshop during this session to demonstrate the Transkribus platform for Handwritten Text Recognition.

Maria Kallio from the National Archives of Finland teaches Transkribus [Image by Louise Seaward]

Around 15 participants were introduced to the READ project’s aim of transforming access to historical documents.  Working on their laptops, they learnt how to use Transkribus to produce training data for Handwritten Text Recognition.  Representatives from all of the Nordic countries took part and there was much interest in using Handwritten Text Recognition for all sorts of languages, from people working in archives, libraries and universities.

Once the workshop was over, we were able to enjoy the rest of the conference!  It was a packed few days with around 200 participants and nearly 60 presentations, plus keynotes, workshops and a poster slam.  We found particular inspiration in the panel on crowdsourcing and collaboration.  We heard how the Arthur Prior project at the University of Copenhagen has been recruiting academics to transcribe papers written by Arthur Prior, the philosopher and founder of temporal logic.  We also saw how the Latvian Folklore Archives experienced huge success with a well-publicised crowdsourcing campaign targeted primarily towards school children, which resulted in the transcription of nearly 15,000 pages in only 71 days!  The READ project will be following these projects with interest as we continue to develop a new open source crowdsourcing platform, where users can transcribe documents with the assistance of Handwritten Text Recognition technology.

Feskekôrka Fish Market in Gothenburg [Image by Louise Seaward]

You can catch up on some of the conference goings-on over on Twitter.  We are already looking forward to the 2018 conference in Helsinki, which the National Archives of Finland will be helping to organise!

+ READ presents at Digital Humanities conference – report from DHd 2017

In February 2017, the READ project was present at the DHd 2017 in Bern, Switzerland.  The annual conference of digital humanities in German brings together scholars and scientists working at the intersection of humanities and digital technologies. In addition to presentations of new approaches and technological as well as scholarly developments, the focus was laid on sustainability.  Keynotes and slides of the talks are now available online.

In a workshop, organized by Tobias Hodel (State archives of the canton of Zurich), READ brought the developed technologies and approaches closer to the stakeholders, mainly archival institutions with vast holdings and projects dealing with digital editions and transcriptions.  Besides the presentation of Transkribus, our main tool for transcription and Handwritten Text Recognition, the soon-to-be-implemented Writer Identification technology was introduced by Stefan Fiel of the Technical University of Vienna.

Stefan Fiel (Technical University Vienna) demonstrating Writer Identification [Image by Tobias Hodel]

The READ project also organised a panel, bringing together four research infrastructures that deal with the production of text from handwritten sources. Besides the already well-known Textgrid and Transcribo, the Dutch endeavor MONK presented its work with layout analysis and text recognition. Technicalities, longevity and the aim of the different projects were discussed. The virtual research environments for each project are directed at different groups of scholars and researchers. For those interested to find out more, we have contributed to a Dhd blog post comparing the different processes and capabilities of these projects.

READ and Transkribus were not only mentioned in its panel and the workshop but also in talks and presentations of other projects which are working regularly with our tools and algorithms (see i.e. the slides of Eva Fasshauer).  In conclusion, we detect not only an interest from the scholarly community but also a broader commitment to invest time in using our applications.

+ Meet the READ project partners – Hervé Déjean

What’s your name?

Hervé Déjean.

Where do you work?

Xerox Research Centre Europe.

Tell us a bit about your background…

I hold a PhD in Natural Language Processing, and for 10 years I’ve been working on Document Layout Analysis and Information Extraction.

What is your role in the READ project?

We are in charge of the Document Understanding part of the project.  We work on techniques to extract human understandable information from historical documents and codify it into a machine-readable form.  If you’re interested to find out more, I recently wrote a blog post about some of our research.

What is top of your to-do list at the moment?

I am preparing a demo of Information Extraction for the READ project’s upcoming review meeting with the European Commission.

What do you like best about working on READ?

Learning how to read old manuscripts!

If you could do another job for just one day, what would it be?

Street sweeper (that way I could meet Beppo!)

What can you see out of the window of your office? 

On a sunny day…

And in the snow…

Thanks Hervé! 

+ Coming soon – new DocScan app to help users digitise historical documents!

More and more archival holdings are being digitised.  But there are still thousands of document collections that exist only in manuscript form.  This means that interested readers must visit the archive in person to take pictures of and transcribe the documents they are interested in.

The READ project is seeking to make this process easier with a new digitisation service.  The Computer Vision Lab at Technical University Vienna is developing DocScan, an Open Source Android mobile app that allows archival users to take high-quality images of historical documents.

Screenshot of Transkribus DocScan

DocScan automatically detects the page area of a document and provides real-time feedback on the quality of the image according to factors like perspective, sharpness and light.  This allows users to take high-quality images that can be used for Handwritten Text Recognition in Transkribus, or simply for future research. The DocScan app will be connected to Transkribus so users can upload their images directly to our cloud.

The Computer Vision Lab are also working on a prototype of a ScanTent. This is a piece of equipment designed to hold a mobile phone in a stable position in order to produce a more standardised shot.  This could be particularly handy for scanning bound volumes, where two hands are sometimes needed to keep the pages in place.

DocScan and the ScanTent can also be of use to archives, as they could enable institutions to build up a collection of user-generated content.  QR code recognition or similar technology could be employed to ensure that images are organised correctly within an archive’s digital collections.

If you are interested in finding out more, you can read our reports:

Günter Mühlberger (University of Innsbruck), Markus Diem, Stefan Fiel and  Florian Kleber (all at the Computer Vision Lab, Technical University Vienna), D5.14 ScanREAD.

Günter Mühlberger (University of Innsbruck), Markus Diem, Fabian Hollaus, Stefan Fiel and  Florian Kleber (all at the Computer Vision Lab, Technical University Vienna), D81. Open Innovation Forum.

You can also take a look at the back-end of the DocScan app on our Github page.

We will be partnering with several archives to test out these two products and we plan to organise a ‘scanathon’ to see how quickly users can produce good quality digital images.  Stay tuned to hear more about the development and testing of the app!

+ Georgian Papers Programme working with Transkribus

The Georgian Papers Programme is an exciting collaboration between King’s College London and the Royal Collection Trust, along with US partners Omohundro Institute of Early American History & Culture and William & Mary.

The project is cataloguing, digitising and making available manuscripts relating to the reign of the British King George III (1760-1820).  In doing so, it aims to enhance public understanding of the monarchy during an important period of British history.

The Georgian Papers Programme has begun to work with Transkribus in order to train Handwritten Text Recognition engines to process some of these papers.  Justin Clement from the Transcription team has produced a Prezi demonstration explaining how the Georgian papers are being transcribed in our Transkribus platform.  Transcripts created in Transkribus can be shared with users and also used as training data for Handwritten Text Recognition technology.

Georgian Papers in Transkribus – Image from Prezi by Justin Clement

This training data will ultimately make it possible for computers to automatically transcribe and search these documents, thereby making them much more accessible.  We look forward to seeing how this project develops!

+ It’s competition time! Launch of new ScriptNet platform

For computer scientists, competitions are one of the most effective means of improving their research and technology.  With this in mind, the READ project is pleased to announce the launch of ScriptNet!  ScriptNet is a new platform of competitions related to Handwritten Text Recognition and Document Image Analysis.

ScriptNet

Projects and research groups can create and customise new competitions on the site.  Groups and individual researchers can register and participate in already active competitions.  Competitions on Handwritten Text Recognition, Baseline Detection, Keyword Spotting and Writer Identification have already been announced and are open for participation.

All participants work with the same standardised sets of documents, in the hope of producing the most accurate results.  Competitions are often linked to major conferences such as the International Conference on Document Analysis and Recognition (ICDAR).

When different researchers work on the same problem in this way, the quality of the results is likely to be enhanced.  Competitions therefore allow computers to process handwritten historical documents more accurately and form a vital part of the READ project’s mission to make these documents more accessible.

The ScriptNet platform is available in English, French and Greek.  All competition datasets will be assigned a Digital Object Identifier (DOI) and uploaded to the Zenodo repository.

For more information about ScriptNet, take a look at the report written by Giorgos Sfikas, Basilis Gatos (both at National Center for Scientific Research “Demokritos”) and Verónica Romero Gómez (Universitat Politècnica de València).

+ Read our 2016 reports!

The READ project receives funding from the European Commission’s Horizon 2020 programme.  As a condition of this funding, we are required to produce reports where we explain all the work that has been undertaken across each year of the project.

Our first set of reports (known as ‘deliverables’) have just been submitted and are now available on our Publications page.  They should make interesting reading for anyone keen to know more about the potential of Handwritten Text Recognition.

Have a look through the titles and see which area of our work most interests you.  You can read about the progress of our research on Layout Analysis, Document Understanding, Image Enhancement and Writer Identification.  There is also information on the new tools we are developing for crowdsourcing, e-learning and document scanning.  You can also find out about the different historical collections that we are working with.

We will be blogging more about some of our 2016 milestones over the next few weeks.  If you’re looking for more of an overview, why not read our blog post where we recapped the highlights from the project’s first year!

+ New article in Update magazine for Library professionals

The Bentham Project at University College London and The Linnean Society have forged a fruitful collaboration as part of the READ project.

The two institutions have worked to create training data for Handwritten Text Recognition on their respective collections and also organised a successful conference in London in October 2016.

Networking at the Digital Toolbox conference, October 2016 [Image by Louise Seaward]

Networking at the Digital Toolbox conference, October 2016 [Image by Louise Seaward]

Louise Seaward (Bentham Project) and Elaine Charwat (formerly of The Linnean Society) have written about their work with READ in the following article in Update, the magazine of the Chartered Institute of Library and Information Professionals (CILIP).

Louise Seaward and Elaine Charwat, ‘If you teach a computer to READ’, CILIP Update, December/January 2016/17.

 

+ Finding patterns in eighteenth-century weddings – new blog from Xerox

Xerox Research Centre Europe is one of the READ research partners, with responsibility for Document Understanding.  Document Understanding is a crucial part of the process of training computers to recognise historical documents, as Hervé Déjean from the Xerox team explains in this blog.

Document Understanding involves analysing the layout of a document in order to extract human understandable information about its content. Hervé’s blog presents a useful overview of the concept and offers specific details about how this method can be applied to historical documents.

Image from Passau Diocesan Archives

Hervé describes how he has been using Sequential Pattern Mining Techniques on eighteenth-century wedding registers provided by Passau Diocesan Archives, another partner in the READ project.  Document Understanding helps to ensure that we can group information from a document into a meaningful sequence – in this case, ensuring the right groom is matched with the right bride on the right day!