FineClip 1.0: The Intelligent Solution for Press Clipping
FineClip digitizes press clippings in a unified operator-assisted process from a variety of formats including graphic image formats, image-only PDF and text PDF formats.
The Application features automatic segmentation of pages into separate articles, preservation of an article’s logical structure, and automated text extraction.
FineClip makes it possible to create digital copies of articles with metadata in searchable format that preserves the structure and content of the original source.
It was designed specifically for media monitoring and information agencies.
FineClip reduces the time it takes to digitize articles because of the automatic analysis, segmentation and recognition functions.
Comfortable, functional tool
FineClip has a simple, user-friendly interface that provides the operator with a wide range of options for editing articles and providing metadata.
Qualitative data output
FineClip digitizes copies of articles in the most widely used formats (PDF, HTML, XML) and retains the logical structure and content of the corresponding source. The output allows for searching required information in the articles according to set parameters.
The Application captures the full text of articles through:
- OCR of scanned printed material from all accepted graphic formats including image-only PDF;
- Extracting a text layer from a text PDF format.
The Application analyzes the page layout and the logical structure of articles and:
- Automatically divides the page contents into 3 types of blocks: text blocks, picture blocks, and table blocks. Each type of block defines certain behavior: text blocks are to be recognized (in case of text PDF the text is extracted), picture blocks are represented as is, and table blocks are recognized (in case of text PDF the text is extracted) and presented as a table with the table structure and layout fully preserved. Each block type can be re-defined manually later if there are mistakes;
- Automatically detects article titles (defined by their font characteristics);
- Segments the page into separate articles based on customizable markers for the beginning and ending of the article.
The press clipping operator works in a user-friendly user interface in which he/she can view the source image of the article on a newspaper page and the recognized article on adjacent synchronized screens. The operator loads the processed article (automatically segmented and recognized).
The user interface allows the operator to:
- Configure the general processing settings:
- recognition language;
- title format;
- export format.
- View page segmentation results (all articles and blocks belonging to each article);
- Delete articles from the selection;
- Add or delete blocks, change the order of blocks in the article or move blocks between articles if there has been a segmentation error;
- Create a new article by adding blocks from other articles (merging several articles together);
- Re-recognize article (or any element of it) with different processing settings;
- Re-define the block type (for example, if a text block was marked as a picture) and re-recognize the block;
- Zoom in and out on specific objects on the page;
- Manually draw a type of block on a page (if the system fails to segment the article correctly several times in a row) and re-recognize the blocks;
- Manually correct errors in recognized text;
- Export clipping results.
For each article, the operator manually specifies the output metadata source (news source, date) and article metadata (title).
Export of the articles
The Application allows exporting articles in all accepted file formats including the most popular: text-under-image PDF, HTML, and XML. In all cases, the export results provide full preservation (or reconstruction) of the layout and the logical structure of the source article.
Search and keyword highlighting
If necessary, the FineClip allows for searching the article by keyword and highlighting the search results. To perform a keyword search, the operator loads a text file containing a list of keywords prior to processing an article. The article is than automatically analyzed, and the keywords highlighted in both the operator interface and the export results. The option to add a clipping repository is also available. That makes it possible to perform a quick article search by title/issue, an all-repository full-text search by keyword, or even an advanced semantic search with the help ABBYY Compreno technology (licensed separately).
See Also: Projects for Media Monitoring and Media Service Companies: