+ Wandering around baroque Naples – The Pandetta project by ilCartastorie.

by Sergio Riolo, il Cartastorie 

The Historical Archives of The Banco di Napoli is one of the most important archives in the world. It holds documentation belonging to the eight ancient Neapolitan banks, which were operational between 1539 and 1640, and then were merged to create the Banco delle Due Sicilie (1809) and, after the political unification of Italy, the Banco di Napoli (1861). The Fondazione Banco di Napoli and its museum-foundation ilCartastorie are the keepers of this huge treasure that fill three hundred rooms in Palazzo Ricca, at the centre of the city of Naples. All this documentation features remarkably homogeneous handwriting due to the schools of writing existing in each bank over the centuries.

The ilCartastorie, to preserve its archive and to make it more visible through new media, started a program of digitisation using the Transkribus platform, through which all the names of bank clients, from 1573 to 1600 for each bank existing at that time will be made more accessible and searchable.

The whole archive, from 1539 to 1900, contains more than three thousand client ledgers, called ‘pandettas’, containing an estimated total of seventeen million names. It is an astonishingly well-organised and preserved database of people and organisations which is highly important for scholars, researchers, genealogists, and citizens.

The Foundation and its museum started their path towards the horizon of mass digitisation and Handwritten Text Recognition (HTR), choosing a specific segment in the four-century long timeline of this documentation, from the starting point of the first bank to the dawn of the seventeenth century, for a total of two hundred and forty thousand names split into sixty-three archival units.

A team of six people is now dealing with Transkribus for this data accessibility project. We have already made a first trial run, training a HTR model based on ten thousand words, including names, surnames and account numbers. This first ‘beta’ model produced a satisfactory result of 13% of Character Error Rate (CER) within one month, and now it is helping us to deal with the other pandettas, accelerating the speed of the transcription and therefore reducing the amount of time needed to complete the work.

The first pandetta from the Banco di Ave Gratia Plena, with its three thousand names, was finished last week and the second is proceeding well. We hope to complete all four of the client ledgers written with this handwriting and, then, proceed with a second model in order to deal with the rest of Ave Gratia Plena‘s ledgers dating up to the 1600’s before the end of January 2019.

A second phase of project will connect the names in the “pandetta” with the precious reasons for payment written on another kind of documents. It is our hope that you will be able to discover the daily business and the economic life of thousands of citizens in baroque Naples.

+ Update on table processing

Back in April we appealed for help in generating a new data set that could be used to improve the automated layout analysis of historical documents set out in tables.  We asked, and you answered!

Thanks to submissions from our network, READ researchers at the Computer Vision Lab at the Technical University of Vienna, Naver Labs Europe and the Passau Diocesan Archives have been compiling a sizeable collection of images of historical documents containing tables.

We now have a total of around 1,500 images from 25 contributors all around the world.  The delivered sources show a great variety of tables from hand-drawn accounting books to stock exchange lists and train timetables, from record books to prisoner lists, simple tabular prints in books, production census and many, many more.

READ researchers are preparing the data set as the basis for a computer science research competition in early 2019 (more details coming soon!).  This collection will be used to evaluate different approaches to the automated recognition of tables.

There is still a lot for us to learn about what constitutes a table.  Working with this heterogeneous data should help us to move beyond the specifics and come up with some generic guidelines and techniques for processing these kinds of pages.

We are very thankful to our network for delivering such a variety of tabular data and we look forward to sharing our next progress report!

Screenshot of 1937 Irish Census in Transkribus.  Image courtesy of National University of Ireland, Galway.

+ More than 15,000 Transkribus users!

Drumroll please!  Transkribus now has more than 15,000 users!  Our users are based mainly in Europe but also extend into Africa, Australia, America and other parts of the globe.

This expansion of our user-base is a significant achievement for the READ project.  Back when the project started in January 2016, there were only 2828 registered Transkribus users.   And a broad user network is very important for us.  By working with an enormous variety of documents provided by different researchers, projects and institutions, we are developing robust Handwritten Text Recognition technology that can cope with all sorts of scripts.

So we look forward to collaborating with lots more new users in 2019 and beyond!  And if you haven’t tried out Transkribus yet, why not have a go?

+ Transkribus on Euronews TV

Check us out – we’re on TV again!  EuroNews TV, a leading 24-hour information channel, has produced a short documentary film featuring READ which sheds light on the latest research in Handwritten Text Recognition.

The film is a co-production between EuroNews and the European Commission.  It is being aired in 10 languages on the award-winning Futuris programme on European science, research and innovation and should hopefully be seen by 430 million households in 130 countries!

+ Experiments with Transkribus and early printed text

We love hearing what our users have been getting up to with our Transkribus platform for Handwritten Text Recognition.

Annika Rockenberger from the National Library of Norway has written a blog about her experiments with Transkribus as part of her work on a digital edition of the writings of the German journalist, historian and poet Georg Greflinger (1620-1677).

Annika is working with early printed text which cannot be adequately recognised with OCR.  She explains that Transkribus users can train a model to recognise this kind of printed text, with around 5000 words of transcribed material.

Unfortunately in this case, digitised images from tightly bound books have made it difficult for the programme to detect the location of text on a page.  Annika hopes to continue her experiments with Transkribus at a later date with better quality images.  Read more on the Greflinger Digital Edition blog:

+ Searching Jeremy Bentham’s manuscripts with Keyword Spotting

The Bentham Project has been experimenting with the Handwritten Text Recognition (HTR) of Bentham’s manuscripts for the past five years, first as a partner in the tranScriptorium project and now as part of READ .

Read about their progress with HTR and our Transkribus platform in blog posts from June 2017 and  February 2018.

Keyword Spotting

The results have thus far been impressive, especially considering the immense difficulty of Bentham’s own handwriting.  But automated transcription is not yet at a point where it is sufficiently accurate to be used by Bentham Project researchers as a basis for scholarly editing.

However, the current state of the technology is strong enough for keyword searching!  And thanks to a collaboration with the PRHLT research center at the Universitat Politècnica de València (another partner in the READ project), there are some exciting new results to report.  It is now possible to search over 90,000 digital images of the central collections of Bentham’s manuscripts, which are held at Special Collections University College London and The British Library.

A Keyword Spotting search for the word ‘pleasure’

Appeal for volunteers!

A Google sheet has been prepared with some suggested search terms in 5 different spreadsheet tabs (Bentham’s neologisms, concepts, people, places and other).  The Bentham Project is appealing for people to record their searches online, using the suggested search terms and some new ones too.  Some of the results will be shared at the upcoming Transkribus User Conference in November.

Background

The PRHLT team have processed the Bentham papers with cutting-edge HTR and probabilistic word indexing technologies. This sophisticated form of searching is often called Keyword Spotting. It is more powerful than a conventional full-text search because it uses statistical models trained for text recognition to search through probability values assigned to character sequences (words), considering most possible readings of each word on a page.

The result is that this vast collection of Bentham’s papers can be efficiently searched, including those papers that have not yet been transcribed! The accuracy rates are impressive. The spots suggest around 84-94% accuracy (6-16% Character Error Rate) when compared with manual transcriptions of Bentham’s manuscripts. More precisely speaking, laboratory tests show that the word average search precision ranges from 79% to 94%. This means that, out of 100 average search results, only as few as 6 may fail to actually be the words searched for. The accuracy of spotted words depends on the difficulty of Bentham’s handwriting – although it is possible to find useful results in Bentham’s scrawl! There could be as many as 25 million words waiting to be found.

A search for the word ‘happiness’ uncovers Bentham’s most famous phrase, written in his own hand.

Use cases

This fantastic site will be invaluable to anyone interested in Bentham’s philosophy.  It will help Bentham Project researchers to find previously unknown references in pages that have not yet been transcribed.  It will allow researchers to quickly investigate Bentham’s concepts and correspondents.  It should also help volunteer transcribers in the Transcribe Bentham initiative to find interesting material to transcribe.

This interface is a prototype beta version.  In the future, there are plans to increase the power of this research tool by connecting it to other digital resources, allowing users to quickly search the manuscripts at the UCL library repository, the Bentham papers database and the Transcribe Bentham Tanscription Desk and linking these images to rich existing metadata.

Feedback on this new search functionality is welcomed at: transcribe.bentham@ucl.ac.uk

Similar Keyword Spotting technology (based on research by the CITlab team at the University of Rostock, another one of the READ project partners) is currently available to all users of the Transkribus platform.  Find out more about how to get started with Keyword Spotting.

+ New to Transkribus? Master the platform in just 10 steps

Maybe you’ve just discovered Transkribus and are feeling a bit overwhelmed?  Our updated video should help you get to grips with working with our Handwritten Text Recognition technology – in just 10 steps (and under 4 minutes!).

You can find more detailed information about working with our platform in our How to Guides.

While you’re on the Transkribus YouTube channel, check out our other videos too – including presentations from Transkribus users at the the 2017 Transkribus User Conference.

+ Sharing data with Transkribus – Transcribimus and minutes of Vancouver City Council

We can all agree that it’s nice to share – and in the READ project, sharing data brings direct benefits for the Handwritten Text Recognition technology in our Transkribus platform.  According to principles of machine learning, the more images and transcripts that are submitted to us as training data, the stronger the Handwritten Text Recognition technology can become.  Images and transcripts are not publicly shared but they contribute to a general improvement in the technology behind the scenes.

Transcribimus is a community project based in Vancouver, Canada with a sizeable collection of transcripts which they will be using to train an Handwritten Text Recognition model.

Transcribimus all started when Sam Sullivan, former mayor of Vancouver, started to research the City Council minutes from the late nineteenth century with a view to exploring the achievements of Vancouver’s second mayor, David Oppenheimer.  Sam’s physical limitations prevented him from visiting the archives as often as he would have liked.  So he formed a partnership with Margaret Sutherland, a local retiree who had experience of genealogy and reading old handwriting.  Margaret began transcribing and digitising the minutes for Sam and was gradually joined by other volunteer transcribers including Christopher Stephenson, a graduate student in Library and Archival studies who provided lots of assistance.  Transcribimus eventually became an online platform where more than 20 volunteers have transcribed some 3,500 pages of handwritten minutes.

Image from the City Council Minutes. City of Vancouver Archives, VMA 23-5 page 214. Image credit: Margaret Sutherland.

These transcriptions are already freely available on the Transcribimus website.  The City of Vancouver Archives will ultimately display the images and transcripts on their website too.

The vast majority of the minutes are written in one hand, so these images and transcripts will likely feed into a strong Handwritten Text Recognition model that produces useful transcripts of the collection. Transcribimus volunteers could then check and correct any errors in these automated transcripts – and the transcription of the City Council minutes should hopefully be realised more quickly!

  • Do you have existing transcriptions that you have produced or collected as part of a research project?  Ideally 500 pages or more…
  • Send them to us and we can process them and train a model to recognise the writing in your documents!
  • To find out more about working with existing transcripts, consult our How to Guide or contact us.

+ Learn more about Transkribus in Zagreb

Join us for an event in the Croatian capital of Zagreb on Thursday 18th October.

The event is hosted by ICARUS Croatia and the Faculty of Philosophy at the University of Zagreb.

There will be a morning of lectures from READ project researchers based at the University of Innsbruck and University College London, which will explain the workings of Transkribus and the possibilities of Handwritten Text Recognition for different kinds of historical documents.

There are also limited spaces for a Transkribus workshop, where participants will be able to learn tips and tricks for working directly with the platform.

To enquire about attending the Zagreb event, please email Vlatka Lemić (vlemic@arhiv.hr).

+ Join us for Vienna Scanathon at the Austrian Academy of Sciences

Digitising historical documents? There’s an app for that!

Join us in Vienna for our next Scanathon event, hosted by the Austrian Academy of Sciences and the Austrian Centre for Digital Humanities.

Screenshot of DocScan app

Participants will have the opportunity to test out the DocScan mobile app and the ScanTent device, new tools which facilitate the digitisation of historical documents with a mobile phone.

The event will take place on the afternoon of Wednesday 7 November.  Attendance is free and open to all – but registration is required:

The Scanathon could be an ideal pre-conference activity for anyone attending the 2018 Transkribus User Conference, also in Vienna.

We ask that attendees bring their smartphone to the event so they can work with the tools. The DocScan app is currently only available on Android phones.

Participants are also invited to bring their own documents to digitise during the event.

DocScan and the ScanTent are being developed by one of the READ project partners, the Computer Vision Lab at the Technical University of Vienna.