+ Working with (Early Modern) Dutch script? Join a new Transkribus working group!

by Annemieke Romein, University of Ghent

(Dutch language version below)

Throughout the Early Modern era much was written in the Dutch language, not just in the Low Countries – but in former colonies, among certain religious groups within Northern America, and in Hansa cities as well. An Early Modern Gothic script was widely used, though it had some varieties depending on its contexts, aim, and type. First experiments with documents from Belgium (Ghent, in Flanders) have demonstrated that the Dutch language can be recognised by Handwritten Text Recognition (HTR) models with a good level of accuracy.

The next step is to combine different examples of Early Modern Dutch texts in order to build and improve generic models for the recognition of various types of documents. Dr. Annemieke Romein (Erasmus University Rotterdam/ Ghent University), Dr. Jetze Touber, and Koen Verstraeten have initiated the ‘Early Modern Dutch’ working group – where all Transkribus users can work together towards the aim of the improved recognition of the Dutch language. Scroll down to find out more about joining the working group and its aims.

The process of combining training data of different Early Modern Dutch documents has already started at Ghent University. Various researchers at the Institute for Early Modern History and the Ghent Center for Digital Humanities are bringing materials together in order to train a HTR model. However, within a multidisciplinary group such as this, we have quickly realised that there are various types of texts as well as periods within the early modern period to deal with. Sixteenth-century handwriting is different from a century later, even when in terms of content little changed; likewise, texts written with a political-institutional or legal background will differ tremendously from diaries, letters and academic texts. Nonetheless, each of these types of texts can train the recognition of the text as well as of the handwriting. How smart computers can be made, is yet to be discovered within such a context.

In order to streamline this endeavour, three Ghent-based historians are working together and will be coordinating/ training different language models, hopefully leading to one final model for the Dutch language (depending on the amount of training material).

Dr. Annemieke Romein 16th, 17th, 18th century Political-institutional/ legal texts (incl. requests, letters of statesmen).
Dr. Jetze Touber 16th, 17th, 18th century Cultural texts (diaries, letters); Scholarly, academic and religious texts.
Koen Verstraeten 19th century Cultural texts (diaries, letters); Scholarly and academic texts.

The ‘Early Modern Dutch’-working group is looking for further examples of documents written in Dutch from the 16th, 17th and 18th century. You can help us add to the collection – all that is needed are images (preferably around 300 dpi) and transcriptions.

You can:

  • share existing training data that you have already prepared in Transkribus (duplicate it to the folder we will invite you to).
  • prepare new images and transcripts in Transkribus in the ‘Early Modern Dutch’ collection
  • send over files containing images and transcripts which can be matched automatically and converted into training data using the Text2image tool.

Please do indicate what type of textual material you are sharing, so that we have an overview and can start training models a.s.a.p..

To join the working group and get access to the ‘Early Modern Dutch’ collection in Transkribus, contact the group at: TranskribusEMDutch@gmail.com.

The ‘Early Modern Dutch’ working group aims to demonstrate that training based algorithms like Handwritten Text Recognition need significant input from many stakeholders – they can only be improved by cooperation and sharing!

————————————————————————————————————

Werkt u met Vroegmoderne Nederlandse teksten (± 1500-1900)? Sluit u aan bij de Transkribus werkgroep!

Er zijn veel teksten geschreven in de Nederlandse taal, niet alleen in de Lage Landen zelf, maar ook in voormalige koloniën, bij religieuze groepen in Noord-Amerika, alsook in de Hanze steden. Het vroegmoderne gotische schrift werd veel gebezigd, hoewel er variaties te vinden zijn afhankelijk van de context, doel en het type tekst. Eerste experimenten met documenten laten zien dat de Nederlandse taal middels Automatische Tekst Herkenning (OCR) modellen herkend kunnen worden en dat middels training hier goede resultaten geboekt kunnen worden.

De volgende stap is het combineren van verschillende voorbeelden van Nederlandse teksten, in een poging om algemene taalmodellen te maken die verschillende typen documenten kunnen analyseren en herkennen. Dr. Annemieke Romein (Erasmus University Rotterdam/ Ghent University – IEMH), Dr. Jetze Touber (UGent – IEMH), en Koen Verstraeten (UGent archief) nemen het initiatief om een werkgroep ‘Vroegmodern Nederlands’ te starten. De focus ligt op de periode 1500-1900, maar materiaal uit andere perioden is eveneens welkom. In deze groep kunnen Transkribus-gebruikers samenwerken om de herkenning van de Nederlandse taal van teksten te verbeteren. Leest u vooral verder om meer te komen over deelname aan deze groep en de doelen.

Het proces van het combineren van trainingsmateriaal van verschillende vroegmoderne teksten is al enige tijd aan de gang. Aan de UGent zijn verschillende onderzoekers van het Institute for Early Modern History en het Ghent Center for Digital Humanities bezig met het uploaden van hun materialen naar Transkribus. Via Text2Image worden bestaande transcripties aan foto’s gekoppeld en worden computers getraind. Dit is momenteel in volle gang. We hebben ons al snel gerealiseerd date r verschillende typen teksten bestaan, alsook verschillende tijdsperioden waarin gradueel veranderingen optreden. Alle soorten teksten kunnen worden getraind in Transkribus, maar daar is veel trainingsmateriaal voor nodig. Méér dan een enkele onderzoeker kan verzamelen. Daarom deze oproep tot deelname.

Transkribus (voorlopig) een gratis programma dat kan worden gebruikt om servers in Innsbruck te trainen om handschriften (maar ook drukwerk) te herkennen middels “Handwriting Text Recognition” (HTR). Ten minste 75 pagina’s getranscribeerde tekst zijn nodig om een handschrift goed te kunnen herkennen, maar dat betreft dan één auteur. Hoe meer materiaal er wordt geüpload, hoe universeler wordt het model. Het wordt dan steeds breder toepasbaar. Archieven, bibliotheken en erfgoedinstellingen, maar zeker ook individuele onderzoekers wordt dringend verzocht om hun materiaal te delen dat de 16e tot en met 19e eeuw bestrijkt.

Drie Gentse onderzoekers zijn betrokken bij het coördineren van het Nederlandstalige model en zullen tests uitvoeren om een zo accuraat mogelijk model (of modellen) te trainen. Voornoemde onderzoekers houden zich bezig met respectievelijk:

Dr. Annemieke Romein 16e, 17e, 18e, , 19e eeuw Politiek-institutionele/ juridische teksten (incl. rekesten, brieven van staatslieden)
Dr. Jetze Touber 16e, 17e, 18e eeuw Culturele teksten (dagboeken, brieven); wetenschappelijke, academische en religieuze teksten.
Koen Verstraeten 19e eeuw Culturele teksten (dagboeken, brieven); wetenschappelijke, academische en religieuze teksten.

Als u materiaal beschikbaar wilt stellen en deel wilt nemen aan deze werkgroep vragen wij u om contact op te nemen via TranskribusEMDutch@gmail.com. Het is handig als u dan aangeeft om welk type teksten het gaat, zodat wij een beeld hebben in welke modellen wij dit kunnen gaan gebruiken.

Veel gestelde vragen:

  • Afbeeldingen en transcripties die u via Transkribus op hun server plaatst (direct via het programma, of via de Text2image tool) blijven privé: u heeft hier uitsluitend toegang toe.
  • U kunt er voor kiezen bepaalde documenten te delen (dupliceren) naar de groep Vroegmodern Nederlands. Deze groep heeft uitsluitend tot doel het trainen van taalmodellen om het Nederlands sneller te doen herkennen. Deelnemers van deze groep kunnen teksten van anderen zien.
  • Het is dus uw keuze welke documenten u met ons deelt! Hoe meer materiaal ons bereikt, hoe makkelijker het wordt om taalmodellen te trainen.
  • U heeft materiaal (foto’s en transcripties) maar u gebruikt nog geen Transkribus? Geen probleem. Wanneer u een account aanmaakt en contact opneemt met Transkribus (email@transkribus.eu) kunnen zij u helpen in het proces. U kunt het materiaal op verschillende manieren beschikbaar stellen en Transkribus koppelt de afbeeldingen aan de transcripties. (Tot mei 2019 is deze service gratis).
  • U bent een instelling en vraagt zich af wat voor u het voordeel is? In de eerste plaats traint het een taalmodel dat heel veel onderzoekers en instellingen (incl. de uwe) van dienst kan zijn bij het sneller herkennen van handschriften. Het materiaal dat u beschikbaar stelt, in uw eigen account van Transkribus, kunt u ook gebruiken om doorzoekbare Pdf’s te maken. U heeft dan een afbeelding van het bronmateriaal, met op de achtergrond de transcripties (of naar keuze: eveneens eronder geplaatst); dit kunt u gebruiken om het materiaal voor uw publiek beschikbaar te stellen. Het is dus eveneens een mooie manier van presenteren!
  • Kosten? Tot juni 2019 is Transkribus gratis. Het wordt momenteel via Europese Onderzoeksgelden gefinancierd (het READ project). Na juni start “READ-COOP” waarin individuele gebruikers gratis gebruik blijven maken, maar ‘groot gebruikers’ zoals instellingen een bijdrage gevraagd zal worden. Hoe hoog deze kosten zullen zijn is nog niet precies bekend, maar er wordt benadrukt dat dit niet al te hoog zal zijn omdat het besef er is dat instellingen hier meestal niet veel geld aan kunnen uitgeven. MAAR: voorlopig is de service gratis en kunt u dus de “doorzoekbare pdf’s” als tegenprestatie krijgen en u kunt altijd na juni stoppen met gebruik van het programma!

+ From Foucault to our future – Transkribus User Conference 2018

On 8-9 November 2018, Vienna was overtaken by more than 100 Transkribus users keen to share their experiences and learn about the latest advances in Handwritten Text Recognition.

The Transkribus User Conference was hosted at the Technical University of Vienna for the second time.  Although the skies were less sunny than last year, the conference programme was just as packed with user case studies, demonstrations of new tools and technological insights.

We got going before the conference even started with a Scanathon event at the Austrian Academy of Sciences on 7 November.  We invited participants to try digitising documents with their mobile phone using our DocScan app and ScanTent device.  After train delays in Austria, we breathed a sigh of relief when the ScanTents arrived and we could start testing and receiving feedback on our new prototypes!

Participants at the Vienna Scanathon on 7 November 2018.  Image credit: Elena Muehlbauer

Coming back to the conference, user stories were a big highlight of the event.  We heard how researchers and archivists from around Europe are using our Transkribus platform to recognise a variety of writing including the papers of the French philosopher Michel Foucault, early modern signatures and initials and sixteenth century Polish tax registers.

We also heard about one of the first projects using Transkribus for crowdsourcing.  Crowd leert computer lezen (or Crowd teaches the computer how to read) at Amsterdam City Archives has connected Transkribus to the VeleHanden crowdsourcing platform, allowing volunteers to produce transcriptions that can then be automatically used as training data for recognising notarial documents.

The conference was also an important showcase for ground-breaking advances in text recognition technology including:

  • HTR+ – a faster and more accurate form of Handwritten Text Recognition using Tensorflow
  • Keyword Spotting – a sophisticated form of keyword searching using the power of Handwritten Text Recognition technology
  • Table Processing – applying automated layout analysis and templates to tabular documents

Gundram Leifert from CITLab, University of Rostock presents HTR+. Image credit: Elena Muehlbauer

We looked towards our future too with a presentation on the new READ-COOP.  The READ project will come to an end in July 2019, from which point Transkribus services will be provided as part of this new cooperative.  A freemium service model is planned where basic Transkribus functions will remain free to all and more intensive users (research projects, archives etc.) will be liable for some charges  – more details coming soon!

Finally, we hope that the conference was a unique opportunity for our users to speak directly to computer scientists and developers about their requirements and their research. It certainly looked like lots of nice connections were being made!

Discussion in full flow at Transkribus User Conference 2018. Image credit: Louise Seaward.

Huge thanks go to all our presenters and participants.  We also need to thank the Computer Vision Lab at the Technical University of Vienna for hosting the conference and our other READ project colleagues who helped with organisational details.  We are looking forward to next year’s conference already!

+ New Transkribus transcription conventions now available

We know that handwritten historical documents are often complex in their structure and content.

Transkribus users usually need to transcribe at least 75 pages of handwritten material in our platform in order to create training data for Handwritten Text Recognition (HTR).  We have produced a new How to Guide covering our transcription conventions to make this process easier.

It answers some of the common queries around the transcription of elements like punctuation, abbreviations diacritics, strikethrough text, names and much more.

You can find further guidance on working with Transkribus in the other How to Guides on the Transkribus wiki.  And if you can’t find an answer to your query, we are always happy to hear from you by email (email@transkribus.eu)

+ Wandering around baroque Naples – The Pandetta project by ilCartastorie.

by Sergio Riolo, il Cartastorie 

The Historical Archives of The Banco di Napoli is one of the most important archives in the world. It holds documentation belonging to the eight ancient Neapolitan banks, which were operational between 1539 and 1640, and then were merged to create the Banco delle Due Sicilie (1809) and, after the political unification of Italy, the Banco di Napoli (1861). The Fondazione Banco di Napoli and its museum-foundation ilCartastorie are the keepers of this huge treasure that fill three hundred rooms in Palazzo Ricca, at the centre of the city of Naples. All this documentation features remarkably homogeneous handwriting due to the schools of writing existing in each bank over the centuries.

The ilCartastorie, to preserve its archive and to make it more visible through new media, started a program of digitisation using the Transkribus platform, through which all the names of bank clients, from 1573 to 1600 for each bank existing at that time will be made more accessible and searchable.

The whole archive, from 1539 to 1900, contains more than three thousand client ledgers, called ‘pandettas’, containing an estimated total of seventeen million names. It is an astonishingly well-organised and preserved database of people and organisations which is highly important for scholars, researchers, genealogists, and citizens.

The Foundation and its museum started their path towards the horizon of mass digitisation and Handwritten Text Recognition (HTR), choosing a specific segment in the four-century long timeline of this documentation, from the starting point of the first bank to the dawn of the seventeenth century, for a total of two hundred and forty thousand names split into sixty-three archival units.

A team of six people is now dealing with Transkribus for this data accessibility project. We have already made a first trial run, training a HTR model based on ten thousand words, including names, surnames and account numbers. This first ‘beta’ model produced a satisfactory result of 13% of Character Error Rate (CER) within one month, and now it is helping us to deal with the other pandettas, accelerating the speed of the transcription and therefore reducing the amount of time needed to complete the work.

The first pandetta from the Banco di Ave Gratia Plena, with its three thousand names, was finished last week and the second is proceeding well. We hope to complete all four of the client ledgers written with this handwriting and, then, proceed with a second model in order to deal with the rest of Ave Gratia Plena‘s ledgers dating up to the 1600’s before the end of January 2019.

A second phase of project will connect the names in the “pandetta” with the precious reasons for payment written on another kind of documents. It is our hope that you will be able to discover the daily business and the economic life of thousands of citizens in baroque Naples.

+ Update on table processing

Back in April we appealed for help in generating a new data set that could be used to improve the automated layout analysis of historical documents set out in tables.  We asked, and you answered!

Thanks to submissions from our network, READ researchers at the Computer Vision Lab at the Technical University of Vienna, Naver Labs Europe and the Passau Diocesan Archives have been compiling a sizeable collection of images of historical documents containing tables.

We now have a total of around 1,500 images from 25 contributors all around the world.  The delivered sources show a great variety of tables from hand-drawn accounting books to stock exchange lists and train timetables, from record books to prisoner lists, simple tabular prints in books, production census and many, many more.

READ researchers are preparing the data set as the basis for a computer science research competition in early 2019 (more details coming soon!).  This collection will be used to evaluate different approaches to the automated recognition of tables.

There is still a lot for us to learn about what constitutes a table.  Working with this heterogeneous data should help us to move beyond the specifics and come up with some generic guidelines and techniques for processing these kinds of pages.

We are very thankful to our network for delivering such a variety of tabular data and we look forward to sharing our next progress report!

Screenshot of 1937 Irish Census in Transkribus.  Image courtesy of National University of Ireland, Galway.

+ More than 15,000 Transkribus users!

Drumroll please!  Transkribus now has more than 15,000 users!  Our users are based mainly in Europe but also extend into Africa, Australia, America and other parts of the globe.

This expansion of our user-base is a significant achievement for the READ project.  Back when the project started in January 2016, there were only 2828 registered Transkribus users.   And a broad user network is very important for us.  By working with an enormous variety of documents provided by different researchers, projects and institutions, we are developing robust Handwritten Text Recognition technology that can cope with all sorts of scripts.

So we look forward to collaborating with lots more new users in 2019 and beyond!  And if you haven’t tried out Transkribus yet, why not have a go?

+ Transkribus on Euronews TV

Check us out – we’re on TV again!  EuroNews TV, a leading 24-hour information channel, has produced a short documentary film featuring READ which sheds light on the latest research in Handwritten Text Recognition.

The film is a co-production between EuroNews and the European Commission.  It is being aired in 10 languages on the award-winning Futuris programme on European science, research and innovation and should hopefully be seen by 430 million households in 130 countries!

+ Experiments with Transkribus and early printed text

We love hearing what our users have been getting up to with our Transkribus platform for Handwritten Text Recognition.

Annika Rockenberger from the National Library of Norway has written a blog about her experiments with Transkribus as part of her work on a digital edition of the writings of the German journalist, historian and poet Georg Greflinger (1620-1677).

Annika is working with early printed text which cannot be adequately recognised with OCR.  She explains that Transkribus users can train a model to recognise this kind of printed text, with around 5000 words of transcribed material.

Unfortunately in this case, digitised images from tightly bound books have made it difficult for the programme to detect the location of text on a page.  Annika hopes to continue her experiments with Transkribus at a later date with better quality images.  Read more on the Greflinger Digital Edition blog:

+ Searching Jeremy Bentham’s manuscripts with Keyword Spotting

The Bentham Project has been experimenting with the Handwritten Text Recognition (HTR) of Bentham’s manuscripts for the past five years, first as a partner in the tranScriptorium project and now as part of READ .

Read about their progress with HTR and our Transkribus platform in blog posts from June 2017 and  February 2018.

Keyword Spotting

The results have thus far been impressive, especially considering the immense difficulty of Bentham’s own handwriting.  But automated transcription is not yet at a point where it is sufficiently accurate to be used by Bentham Project researchers as a basis for scholarly editing.

However, the current state of the technology is strong enough for keyword searching!  And thanks to a collaboration with the PRHLT research center at the Universitat Politècnica de València (another partner in the READ project), there are some exciting new results to report.  It is now possible to search over 90,000 digital images of the central collections of Bentham’s manuscripts, which are held at Special Collections University College London and The British Library.

A Keyword Spotting search for the word ‘pleasure’

Appeal for volunteers!

A Google sheet has been prepared with some suggested search terms in 5 different spreadsheet tabs (Bentham’s neologisms, concepts, people, places and other).  The Bentham Project is appealing for people to record their searches online, using the suggested search terms and some new ones too.  Some of the results will be shared at the upcoming Transkribus User Conference in November.

Background

The PRHLT team have processed the Bentham papers with cutting-edge HTR and probabilistic word indexing technologies. This sophisticated form of searching is often called Keyword Spotting. It is more powerful than a conventional full-text search because it uses statistical models trained for text recognition to search through probability values assigned to character sequences (words), considering most possible readings of each word on a page.

The result is that this vast collection of Bentham’s papers can be efficiently searched, including those papers that have not yet been transcribed! The accuracy rates are impressive. The spots suggest around 84-94% accuracy (6-16% Character Error Rate) when compared with manual transcriptions of Bentham’s manuscripts. More precisely speaking, laboratory tests show that the word average search precision ranges from 79% to 94%. This means that, out of 100 average search results, only as few as 6 may fail to actually be the words searched for. The accuracy of spotted words depends on the difficulty of Bentham’s handwriting – although it is possible to find useful results in Bentham’s scrawl! There could be as many as 25 million words waiting to be found.

A search for the word ‘happiness’ uncovers Bentham’s most famous phrase, written in his own hand.

Use cases

This fantastic site will be invaluable to anyone interested in Bentham’s philosophy.  It will help Bentham Project researchers to find previously unknown references in pages that have not yet been transcribed.  It will allow researchers to quickly investigate Bentham’s concepts and correspondents.  It should also help volunteer transcribers in the Transcribe Bentham initiative to find interesting material to transcribe.

This interface is a prototype beta version.  In the future, there are plans to increase the power of this research tool by connecting it to other digital resources, allowing users to quickly search the manuscripts at the UCL library repository, the Bentham papers database and the Transcribe Bentham Tanscription Desk and linking these images to rich existing metadata.

Feedback on this new search functionality is welcomed at: transcribe.bentham@ucl.ac.uk

Similar Keyword Spotting technology (based on research by the CITlab team at the University of Rostock, another one of the READ project partners) is currently available to all users of the Transkribus platform.  Find out more about how to get started with Keyword Spotting.

+ New to Transkribus? Master the platform in just 10 steps

Maybe you’ve just discovered Transkribus and are feeling a bit overwhelmed?  Our updated video should help you get to grips with working with our Handwritten Text Recognition technology – in just 10 steps (and under 4 minutes!).

You can find more detailed information about working with our platform in our How to Guides.

While you’re on the Transkribus YouTube channel, check out our other videos too – including presentations from Transkribus users at the the 2017 Transkribus User Conference.