+ Georgian Papers Programme working with Transkribus

The Georgian Papers Programme is an exciting collaboration between King’s College London and the Royal Collection Trust, along with US partners Omohundro Institute of Early American History & Culture and William & Mary.

The project is cataloguing, digitising and making available manuscripts relating to the reign of the British King George III (1760-1820).  In doing so, it aims to enhance public understanding of the monarchy during an important period of British history.

The Georgian Papers Programme has begun to work with Transkribus in order to train Handwritten Text Recognition engines to process some of these papers.  Justin Clement from the Transcription team has produced a Prezi demonstration explaining how the Georgian papers are being transcribed in our Transkribus platform.  Transcripts created in Transkribus can be shared with users and also used as training data for Handwritten Text Recognition technology.

Georgian Papers in Transkribus – Image from Prezi by Justin Clement

This training data will ultimately make it possible for computers to automatically transcribe and search these documents, thereby making them much more accessible.  We look forward to seeing how this project develops!

+ It’s competition time! Launch of new ScriptNet platform

For computer scientists, competitions are one of the most effective means of improving their research and technology.  With this in mind, the READ project is pleased to announce the launch of ScriptNet!  ScriptNet is a new platform of competitions related to Handwritten Text Recognition and Document Image Analysis.

ScriptNet

Projects and research groups can create and customise new competitions on the site.  Groups and individual researchers can register and participate in already active competitions.  Competitions on Handwritten Text Recognition, Baseline Detection, Keyword Spotting and Writer Identification have already been announced and are open for participation.

All participants work with the same standardised sets of documents, in the hope of producing the most accurate results.  Competitions are often linked to major conferences such as the International Conference on Document Analysis and Recognition (ICDAR).

When different researchers work on the same problem in this way, the quality of the results is likely to be enhanced.  Competitions therefore allow computers to process handwritten historical documents more accurately and form a vital part of the READ project’s mission to make these documents more accessible.

The ScriptNet platform is available in English, French and Greek.  All competition datasets will be assigned a Digital Object Identifier (DOI) and uploaded to the Zenodo repository.

For more information about ScriptNet, take a look at the report written by Giorgos Sfikas, Basilis Gatos (both at National Center for Scientific Research “Demokritos”) and Verónica Romero Gómez (Universitat Politècnica de València).

+ Read our 2016 reports!

The READ project receives funding from the European Commission’s Horizon 2020 programme.  As a condition of this funding, we are required to produce reports where we explain all the work that has been undertaken across each year of the project.

Our first set of reports (known as ‘deliverables’) have just been submitted and are now available on our Publications page.  They should make interesting reading for anyone keen to know more about the potential of Handwritten Text Recognition.

Have a look through the titles and see which area of our work most interests you.  You can read about the progress of our research on Layout Analysis, Document Understanding, Image Enhancement and Writer Identification.  There is also information on the new tools we are developing for crowdsourcing, e-learning and document scanning.  You can also find out about the different historical collections that we are working with.

We will be blogging more about some of our 2016 milestones over the next few weeks.  If you’re looking for more of an overview, why not read our blog post where we recapped the highlights from the project’s first year!

+ New article in Update magazine for Library professionals

The Bentham Project at University College London and The Linnean Society have forged a fruitful collaboration as part of the READ project.

The two institutions have worked to create training data for Handwritten Text Recognition on their respective collections and also organised a successful conference in London in October 2016.

Networking at the Digital Toolbox conference, October 2016 [Image by Louise Seaward]

Networking at the Digital Toolbox conference, October 2016 [Image by Louise Seaward]

Louise Seaward (Bentham Project) and Elaine Charwat (formerly of The Linnean Society) have written about their work with READ in the following article in Update, the magazine of the Chartered Institute of Library and Information Professionals (CILIP).

Louise Seaward and Elaine Charwat, ‘If you teach a computer to READ’, CILIP Update, December/January 2016/17.

 

+ Finding patterns in eighteenth-century weddings – new blog from Xerox

Xerox Research Centre Europe is one of the READ research partners, with responsibility for Document Understanding.  Document Understanding is a crucial part of the process of training computers to recognise historical documents, as Hervé Déjean from the Xerox team explains in this blog.

Document Understanding involves analysing the layout of a document in order to extract human understandable information about its content. Hervé’s blog presents a useful overview of the concept and offers specific details about how this method can be applied to historical documents.

Image from Passau Diocesan Archives

Hervé describes how he has been using Sequential Pattern Mining Techniques on eighteenth-century wedding registers provided by Passau Diocesan Archives, another partner in the READ project.  Document Understanding helps to ensure that we can group information from a document into a meaningful sequence – in this case, ensuring the right groom is matched with the right bride on the right day!

+ A new Transkribus User Report

Chiara Petrolini, a post-doctoral fellow at the German Historical Institute in Rome (DHI) recently spent a few days with the Transkribus team at the University of Innsbruck.

She has kindly written a User Report about her experience of working with Transkribus so far.

Dr Petrolini is an early modern scholar, working on a project about the court librarian Sebastian Tengnagel and the Imperial Library in Vienna.  She has begun transcribing Tengnagel’s seventeenth-century correspondence, with a view to training the Handwritten Text Recognition engine to recognise this handwriting.  She is also finding Transkribus a useful transcription tool for documents written in more than one language, as scholars with different skills can work on the same document from different locations.

This new project will help us to spread the word about Transkribus in Italy – we will be coming there for a workshop soon!

+ Looking back on 2016…

January is always a time for reflection and at the READ project, we have a lot to reflect on!  We’ve been busy over the past 12 months in our mission to use new technologies to make historical documents more accessible.  We wanted to give a quick recap of our major milestones and our future plans.

Research

Our research teams have been exploring the fields of Handwritten Text Recognition, Layout Analysis, Document Understanding, Writer Identification, Language Models and more.  Some of these technologies are already available in our Transkribus tool and more will be integrated over the coming months.  Towards the end of 2016 we also started to prepare for the launch of our ScriptNet platform, a new collection of research competitions where computer scientists will experiment with huge amounts of data to improve their technologies.

Discussion topics at one of the READ project meetings [Image by Louise Seaward]

Discussion topics at one of the READ project meetings [Image by Louise Seaward]

Services

The Transkribus tool has been maintained and improved across the year.  Over 2000 new users registered for a Transkribus account in 2016 and they are now able to access new features such as full-text search and a table editing tool.  We have also developed How to Guides to help people navigate the platform.

We are working with partners inside and outside of the project to develop bespoke Handwritten Text Recognition models capable of transcribing and searching specific collections of documents.  Our most successful models so far relate to eighteenth- and nineteenth-century German and English handwriting.  But we are working with many more languages, styles and time-frames – watch this space!

Demonstrating the Table Editing Tool in Transkribus [Image by Louise Seaward]

Demonstrating the Table Editing Tool in Transkribus [Image by Louise Seaward]

Dissemination

Dissemination is a key part of READ – we want to raise awareness about the technology that we are developing and ensure that it is used by the people who need it.

We have helped to organise four conferences in Germany, Austria and the United Kingdom for collection holders, researchers and computer scientists.  We have also been travelling a lot – delivering 30 Transkribus workshops (at last count!) in different European cities.  In these workshops, we teach people how to use Transkribus and explain the potential of Handwritten Text Recognition.  If you are interested in organising a workshop at your institution, just send us an email!

018

READ project members taking a break from their computers at a meeting in Passau, Germany [Image by Louise Seaward]

In terms of our research outputs, we are working to ensure that our project publications are Open Access, our research tools are Open Source via Github and our published research data is being made available in Zenodo.

We have had fun spreading the word about Transkribus on Twitter and will be branching out to YouTube and Facebook this year.

Collaboration

Our network grew steadily across 2016.  Over 30 institutions have now signed a Memorandum of Understanding with READ, which brings them into the project network.  To give just a couple of examples, we are working with the Belgrade University Library on training computers to understand Cyrillic text and receiving advice from the Institute for Documentology and Editing on the role of Transkribus in digital scholarly editing.

Cyrillic document from the University Library of Belgrade.

Cyrillic document from the Belgrade University Library.

What’s next?

All this work will continue into 2017 but there will also be some exciting new developments.

The project technologies are beginning to be integrated into new web tools which will be made available via the Transkribus website.  An e-learning module, a platform for crowdsourced transcription and a mobile app for scanning documents are all in the works.  We are also developing our business plan to ensure that we can sustain the services provided by Transkribus far into the future.

Want to find out more?

You can find more detailed summaries of the work that READ has completed in these different areas by taking at look at the latest reports (deliverables) that we have submitted to the European Commission.

+ Watch presentations from our ‘Digital Toolbox’ Conference

On 10 October 2016, we asked researchers, archivists and curators to discuss ‘What should be in your Digital Toolbox?’ at our conference in London.  This event was organised by the Linnean Society (part of the READ MOU network) and the Bentham Project at University College London (one of the READ partners).  Videos and slides of the speakers’ presentations are now available.

Networking in the Linnean Society Library [Image by Louise Seaward]

Networking in the Linnean Society Library [Image by Louise Seaward]

There was a great exchange of ideas on the day, both in person and on Twitter, about the best means of extracting data from complex handwritten and printed records.  You can now get a flavour of what went on through the videos and slides below.

Professor Melissa Terras (University College London), If you teach a computer to READ: Transcribe Bentham, Transkribus, and Handwriting Technology Recognition 

Dr Günter Mühlberger (University of Innsbruck), Transkribus as a Toolkit for text Recognition, Transcription and Information Extraction 

Dr Roger Labahan (University of Rostock), Key concepts of Handwritten Text Recognition

Dr Mia Ridge (The British Library), The Art of Work in the Age of Mechanical Reproduction

Professor James Loxley (University of Edinburgh), Lines of Enquiry: Reordering Edinburgh’s Literary History

Dr Elspeth Haston (Royal Botanic Garden Edinburgh), Automating Label Data Capture from Natural History Specimens

Alison Harding and Lisa Cardy (Natural History Museum/Biodiversity Heritage Library), Unlocking Biodiversity Data @ The Biodiversity Heritage Library 

Dr Victoria Van Hyning (University of Oxford/Zooniverse), Metadata Extraction and Full Text Transcription on the Zooniverse Platform

 

+ Meet the READ project partners – Max Bryan

What’s your name?

Max Bryan.

Where do you work?

The Department for Natural Language Processing at Leipzig University.

Tell us a bit about your background…

My main research interests lie in everything that has to do with neural networks. I first became interested in this subject at Hamburg University where I wrote my Masters thesis on different learning strategies.  In my free time, I like to paint or cook with my Chinese friends.

What is your role in the READ project?

Our group is responsible for creating various language resource tools to be integrated into Handwritten Text Recognition models.  We are also sharing our knowledge of language models with the READ project partners.

What is top of your to-do list at the moment?

Using dictionaries to create various formats for the experiments and training a language model that learns to recognize abbreviations.

What do you like best about working on READ?

Working with people that do very similar things but come from different directions and thus have different views.

If you could do another job for just one day, what would it be?

Pilot or train conductor.

What can you see out of the window of your office? 

leipzig

Thanks Max! 

+ What’s that written in the margin? Handwritten Text Recognition, Marginalia and John Stuart Mill

Some people are horrified by the thought of writing notes on the pages of books.  But for the English philosopher John Stuart Mill (1806 – 1873), marginal notes were a useful way to record his thoughts and observations as he read.

Mill’s collection of books is now in the possession of Somerville College at the University of Oxford.  The John Stuart Mill Collection holds more than 1500 books once owned by Mill.  Many of these texts contain annotations and markings made by Mill.

The John Stuart Mill Collection, Somerville College, University of Oxford [Image by Louise Seaward]

The John Stuart Mill Collection, Somerville College, University of Oxford [Image by Louise Seaward]

Somerville College, in collaboration with the University of Alabama, is currently undertaking a project to digitise and categorise this marginalia.  These partners have now begun to work with Transkribus, with a view to applying Handwritten Text Recognition to Mill’s scribblings.

READ partners from Xerox Research Centre Europe and the Computer Vision Lab at Vienna Technical University are working with hundreds of images from the Mill collection.  They aim to use Document Understanding to distinguish between the printed and handwritten text on the pages of these books and also use Handwritten Text Recognition to transcribe the comments which Mill wrote in the margins.   Transcripts of the Mill marginalia would be an invaluable resource to scholars and would complement the forthcoming Mill Marginalia database.

This is an exciting experiment for the READ project, as the methods and results of this endeavour could be applicable to other collections where marginal annotations appear on printed texts.  Many other writers, including Oscar Wilde and Mark Twain, were habitual annotators and technology from the READ project could help us to understand how they read, processed and understood books and articles.