ATAPY converts the entire Danish Classic Literature Canon into XML
Det Kongelige Bibliotek, København og Arkiv for Dansk Litteratur
The Royal Danish Library in Copenhagen has the largest book collection in Northern Europe and strives to facilitate access to its resources using advanced technologies. As part of this effort, the Library launched an ambitious project, "Danish Literature Archive", to convert the entire Danish literary canon (the works of 70 selected Danish authors from the XIth to early XXth century) into computer text. Specifically, to XML format so the work would be available on the web.
The large number of books, their diverse content (verse, prose, pictures, tables, notes and comments) and the preservation requirements for layout and typesetting made this an especially challenging job. To make it possible, the contractor would have to possess seemingly incompatible qualities. On the one hand, the company had to be competent with modern Optical Character Recognition packages, proficient in XML coding and capable of designing specialized software instruments to facilitate the conversion process. This required high IT qualifications and extensive hands-on experience in data capture technologies. On the other hand, almost all real-life mass data input projects still involve a lot of manual labor. No matter how accurate an OCR system is, it will make mistakes - especially when working with such difficult material as old books with complex layouts. Another issue with the Library’s material was that full automation of XML coding was not possible because of the diversity of attributes. The contractor had to be able to provide many qualified operators at a reasonable price or the project cost would exceed the financial capability of any library.
The Library IT staff looked for a partner outside the EU to solve these problems. They were attracted to Russia because it is the home of ABBYY FineReader, the world-renowned OCR system. Following a several months of a trial process, the Library selected ATAPY Software, a leading developer of custom OCR solutions based on FineReader technologies and experienced media service provider. The pilot projects demonstrated that ATAPY combined high IT professionalism with access to an extensive pool of qualified multi-lingual operator resources.
The book conversion process was organized into three key phases:
1. Reading scanned images into text format.
The Library provided ATAPY with scanned pages in TIFF format. The quality of the images was remarkably good, which provided an important contribution to the efficiency of the remaining stages. The ABBYY FineReader analyzed automatically the images, which segmented them to distinguish text from pictures and revealed the table structure. Layout operators reviewed the segmentation results. Pages were recognized using FineReader's outstanding omnifont capabilities that are augmented with many font-specific patterns. This raised the recognition quality for most of the old books. Then a group of operators proofread the OCR results. Special attention was given to non-Danish texts, as some of them could not even be OCRed (Old Greek, Hebrew etc).
2. Preparation of initial XML documents.
Verified text was exported to the Microsoft® Word format. XML operators, armed with an arsenal of custom tools and macro programs, used Microsoft® Word as the environment for adding XML tags. This was a meticulous task since the full list of tags contained over 50 entries and only half of them came through automatic identification. The remaining half had to be spotted and marked manually in the Danish language.
3. Assembly of book XML files.
Once markup was finished, XML specialists assembled the books adding supplementary "entire-book" tags and bibliographic information.
As a software company and a media service company, it was possible for ATAPY to dispatch experienced customization engineers and develop project-specific program utilities for every conversion phase. This made it possible for ATAPY to decrease the processing time 10 to 20% as the project evolved and to pass along the savings to the client. The project was successfully completed and all the books are available online at http://www.adl.dk.
After years in the sphere of media service, ATAPY has become an expert in the field by working with texts that had different layouts, structures and languages. These included library cards, encyclopedia articles, magazine publications, rarities dating back to the XIXth century and other materials representing all genres and formats. In addition to the Royal Danish Library, the list of ATAPY's Media Service clients include Springer Publishing house (Germany), University of Innsbruck (Austria), J.B. Metzler Verlag (Germany), EasyData B.V (Netherlands), Consodata (France), PRNet (Turkey) among other institutions and companies. ATAPY utilizes a highly effective data capture process that relies on both IT infrastructure and human resources. High-speed, high-quality multi-language material processing, client communication in four languages and very affordable pricing are ATAPY's trademarks that are applied to every contract, big or small.
Working with ATAPY has been a pleasure. We were impressed with the degree of attention paid to producing the best possible text of the works and the accuracy of the results.'
Virginia Laursen, Webmaster
Royal Danish Library
About the Royal Danish Library
The Royal Library in Copenhagen is the national library of Denmark and the largest library in the Nordic countries. It contains numerous historical treasures and all works printed in Denmark since the XVIIth century are available there. Thanks to past donations, the library houses most of the known Danish printed works since the printing of the first Danish book in 1482.
More information about ATAPY Media Services: