Last modified: June 08, 2010
The toolkit defines a number of free function which are not image methods. These are defined in ocr_toolkit.py and can be imported in a python script with
from gamera.toolkits.ocr.ocr_toolkit import *
While the class Page splits the image into Textline objects and possibly classifies the characters, it does not generate an output string. For this purpose, you can use the function textline_to_string.
Returns a unicode string of the text in the given Textline.
Signature:
textline_to_string (textline, heuristic_rules="roman", extra_chars_dict={})
with
- textline:
- A Textline object containing the glyphs. The glyphs must already be classified.
- heuristic_rules:
Depending on the alphabeth, some characters can very similar and need further heuristic rules for disambiguation, like apostroph and comma, which have the same shape and only differ in their position relative to the baseline.
When set to "roman", several rules specific for latin alphabeths are applied.
- extra_chars_dict
- A dictionary of additional translations of classnames to character codes. This is necessary when you use class names that are not unicode names. Will be passed to return_char.
As this function uses return_char, the class names of the glyphs in textline must corerspond to unicode character names, as described in the documentation of return_char.
Converts a unicode character name to a unicode symbol.
Signature:
return_char (classname, extra_chars_dict={})
with
- classname:
- A class name derived from a unicode character name. Example: latin.small.letter.a returns the character a.
- extra_chars_dict
- A dictionary of additional translations of classnames to character codes. This is necessary when you use class names that are not unicode names. The character 'code' does not need to be an actual code, but can be any string. This can be useful, e.g. for ligatures:
return_char(glyph.get_main_id(), {'latin.small.ligature.st':'st'})
classname must correspond to the standard unicode character names, as in the examples of the following table:
Character | Unicode Name | Class Name |
---|---|---|
! | EXCLAMATION MARK | exclamation.mark |
2 | DIGIT TWO | digit.two |
A | LATIN CAPITAL LETTER A | latin.capital.letter.a |
a | LATIN SMALL LETTER A | latin.small.letter.a |
Groups the given glyphs to words based upon the horizontal distance between adjacent glyphs.
with
- glyphs:
- A list of Cc data types, each of which representing a character. All glyphs must stem from the same single line of text.
- threshold:
- Horizontal white space greater than threshold will be considered a word separating gap. When None, the threshold value is calculated automatically as 2.5 times teh median white space between adjacent glyphs.
The result is a nested list of glyphs with each sublist representing a word. This is the same data structure as used in Textline.words
These functions are used in the segmentation methods of class Page. You will generally not need to call them, unless you are implementing a custom segmentation method.
Splits image regions representing text lines into characters.
Signature:
get_line_glyphs (image, segments)
with
- image:
- The document image that is to be further segmentated. It must contin the same underlying image data as the second argument segments
- segments:
- A list Cc data types, each of which represents a text line region. The image views must correspond to image, i.e. each pixels has a value that is the unique label of the text line it belongs to. This is the interface used by the plugins in the "PageSegmentation" section of the Gamera core.
The result is returned as a list of Textline objects.
Returns an RGB image with bounding boxes of the given glyphs as hollow rects. Useful for visualization and debugging of a segmentation.
Signature:
show_bboxes (image, glyphs)
with:
- image:
- An image of the textdokument which has to be segmentated.
- glyphs:
- List of rects which will be drawn on image as hollow rects. As all image types are derived from Rect, any image list can be passed.