ATAPY Software - OCR, Document Imaging, Document Management, Data Capture, Data Conversion
Services and Solutions for Document Management
ABBYY Certified Reseller
Microsoft Certified Partner
RusSoft
SibAcademSoft

The Royal Danish Library in Copenhagen and Arkiv for Dansk Litteratur

ATAPY Software converted the entire Danish Classic Literature Canon into XML

Virginia Laursen photo

‘Working with ATAPY has been a pleasure. We have been impressed with the high level of concern for producing the best possible text of the works and the accuracy of the results.’

Virginia Laursen, Webmaster
Royal Danish Library
Copenhagen, Denmark

The Royal Danish Library in Copenhagen has the largest book collection in Northern Europe and strives to facilitate access to its resources by using advanced technologies. It undertook an ambitious project named ‘Arkiv for Dansk Litteratur’ geared towards converting the whole Danish literary canon (the works of 70 carefully selected Danish authors from the 11th to the early part of the 20th century) into computer text — namely, the XML format, and making it available for Web access.

The great number of books, their diverse contents (verses, prose, pictures, tables, notes and comments), as well as layout and typesetting preservation requirements made this project a very special task. In order to succeed, the contractor had to possess some seemingly incompatible qualities. On one hand, it must be very competent with modern Optical Character Recognition technologies, proficient in XML coding, capable of designing specialized software instruments that facilitate the conversion process. This requires high IT qualification and extensive hands-on experience in automated data input. On the other hand, almost all real-life mass data input projects still entail a lot of manual labor. No matter how accurate your OCR system is, it will make mistakes — especially on such a difficult material like old books with a complex layout. However, in most cases full automatization of XML coding is not possible due to the diversity of the attributes. Therefore, the contractor must be able to offer many qualified operators at a reasonable cost, or the project's price tag will exceed financial budgets of any library.

The Library IT staff attempted to solve this problem by searching for a partner outside the EU. Their attention was drawn to Russia, the home of the world-renowned OCR system ABBYY FineReader. Following months of trial, the Library fixed upon ATAPY Software, a leading developer of custom OCR solutions based on FineReader technologies and an experienced media service provider. The pilot projects demonstrated that ATAPY had combined high IT professionalism with access to an extensive pool of qualified multi-lingual human resources.

The books conversion process consisted of three large phases:

1. Reading scanned images into text format. The Library provided ATAPY with scanned pages in TIFF format. The quality of the images was remarkably good, which was an important contribution to the efficiency of the remaining stages. ABBYY FineReader automatically analyzed (‘segmented’) the images to distinguish the text from pictures and revealed the table structure. Segmentation results were reviewed by layout operators. After that, pages were recognized by using FineReader’s outstanding omnifont capabilities augmented with many font-specific patterns that raised the recognition quality for the majority of old books. Then a group of operators thoroughly proofread the OCR results. Special attention was paid to non-Danish inclusions, some of which could not even be OCRed (Old Greek, Hebrew etc).

2. The next phase was preparation of the initial XML-document, the verified text was exported to Microsoft® Word format. A group of XML operators armed with an arsenal of customized tools and macro programs used Word as an environment for adding XML tags. This had been a very challenging task since the full list of tags contained over 50 entries, and only half of them yielded to automatic identification. The remaining half had to be spotted and marked manually — all that had to be done in Danish language.

3. The final phase was the assembly of the XML file. Once markup was finished the XML specialists assembled the books, adding supplementary ‘entire-book’ tags and bibliographic information.

Being a software company in addition to a media service company made it possible for ATAPY to dispatch experienced customization engineers to design and improve project-specific program utilities for every conversion phase. This allowed ATAPY, as the project moved on, to gradually decrease processing time by another 10 to 20%, passing the savings to the client. All finished books are currently available online at http://www.adl.dk.

After two years of working in the area of media service ATAPY became a true expert in this field, having dealt with texts of different layouts, structures and languages. Those texts include library cards, encyclopedia articles, magazine publications, rarities that date back to the XIXth century, and other materials of all genres and formats. In addition to the Royal Danish Library, on the list of ATAPY’s media service clients are Springer-Verlag (Germany), University of Innsbruck (Austria), J. B. Metzler Verlag (Germany), EasyData  B. V. (Netherlands), Consodata (marketing research company; France), PRNet (media clipping company; Turkey), other institutions and companies. As a result, ATAPY possesses a highly effective processing system — both technically and logistically. High-speed, high-quality multi-language processing, client communication in 4 languages, at a very affordable price are ATAPY’s trademarks which it continues to exhibit in every contract, no matter how big or small they are.

About the Royal Danish Library

Royal Danish Library logoThe Royal Library in Copenhagen is the national library of Denmark and the largest library in the Nordic countries.

It contains numerous historical treasures; all works that have been printed in Denmark since the XVIIth century are deposited there. Thanks to extensive donations in the past the library holds nearly all known Danish printed works back to the first Danish book, printed in 1482.

ATAPY Software

All rights reserved © 2001-2010