|
Welcome. Clara OCR is a free OCR, written for systems supporting the C library and the X Windows System. Clara OCR is intended for the cooperative OCR of books. There are some screenshots available at http://www.claraocr.org/. This documentation is extracted automatically from the comments of the Clara OCR source code. It is known as "The Clara OCR Advanced User's Manual". It's currently unfinished. First-time users are invited to read "The Clara OCR Tutorial". Developers must read "The Clara OCR Developer's Guide".
Clara is an optical character recognition (OCR) software, a program that tries to identify the graphic images of the characters from a scanned document, converting their digital images to ASC, ISO or other codes. The name Clara stands for "Cooperative Lightweight chAracter Recognizer". Clara offers two revision interfaces: a standalone GUI and and a web interface, able to be used by various different reviewers simultaneously. Because of this feature Clara is a "cooperative" OCR (it's also "cooperative" in the sense of its free/open status and development model).
For some years now we have tested and used OCR softwares, mainly for old books. Popular OCR softwares (those bundled with scanners) are useful tools. However, OCR is not a simple task. The results obtained using those programs vary largely depending on the the printed document, and, for most texts we're interested on, the results are really poor or even unusable. In fact, it's not a surprise that many digitalization projects prefer not to use OCR, but typists only. For a programmer, it is somewhat intuitive that OCR could achieve good results even from low quality texts, when an add-hoc approach is used, focusing one specific book (for instance). Within this approach, OCR becomes a matter of finding one software adequate for the texts you're trying to OCR, or perhaps develop a new one. So a free and easy to customize OCR (on the source code level) would be a valuable resource for text digitalization projects. Dealing with graphics is not among our main occupations, but after analysing many scanned materials, we began to write some simple and specialized recognition tools. More recently (in the third quarter of 1999) a simple X interface linked to a naive bitmap comparison heuristic was written. From that prototype, Clara OCR evolved. Since then, many new ideas from various persons helped to make it better.
It's not a bad idea to enumerate some principles that have driven Clara OCR development. They'll make easier to understand the features and limitations of the software (these principles may change along time). 1. Clara is an OCR for printed texts, not for handwritten texts. 2. Clara was not designed to be used to OCR one or two single pages, but to OCR a large number of documents with the same graphic characteristics (font, size, etc). So it can take advantage of a fine (and perhaps expensive) training. This will be tipically the case when OCRing an entire book. 3. We chose not support directly multiple graphic formats, but only Jeff Poskanzer's raw PBM and PGM. Non-PBM/PGM files will be read through filters. 4. Clara OCR wants to be a tool that makes viable the sum and reuse of human revision effort. Because of this, on the OCR model implemented by Clara, training and revision are one same thing. The revision is a sum of punctual and independent acts and alternates with reprocessing steps along a refinement process. 5. The Clara GUI was implemented and behaves like a minimalistic HTML viewer. This is just an easy and standard way to implement a forms interface. 6. We have tried to make the source code portable across platforms that support the C library and the Xlib. Clara has no special provision to be ported to environments that do not support the Xlib. We avoided to use a higher level graphic environment like Motif, GTK or Qt, but we do not discourage initiatives to add code to Clara OCR adapt or adapt better to these or other graphic environments. 7. We generally try to make the code efficient in terms of RAM usage. CPU and disk usage (for session files) are less prioritary.
Clara OCR focuses the Latin Alphabet ("a", "b", "c", ...), used by most European languages, and the indo-arabic digits ("0", "1", "2", ...), but we're trying to support as many alphabets as possible. To say that Clara OCR supports a given alphabet means that Clara OCR (a) is able to be trained from the keyboard for the symbols of that alphabet, eventually applying some transliteration from that alphabet to latin. For instance, when OCRing a greek text, if the user presses the latin "a" key (assuming that the keyboard has latin labels), Clara is expected to train the current symbol as "alpha". (b) knows the vertical alignment of each letter of that alphabet, for instance, knows that the bottom of an "e" is aligned at the baseline; (c) knows which letters accept or require which signs (accents and others, like the dot found on "i" and "j"); (d) contains code to help avoiding common mistakes, like recognizing "e" as "c", "l" as "1", etc. To say that Clara OCR supports a given alphabet does not necessarily mean that Clara OCR (a) knows some particular encoding (ISO-8859-X, Unicode, etc) for that alphabet; (b) contains or is able to use fonts for that alphabet to display the OCR output on the PAGE (OUTPUT) window. Even ignoring the standard encondings for one given alphabet (e.g. ISO-LATIN-7 for Greek), Clara eventually will be able to produce output using TeX macros, like {\Alpha}.
Clara differs from other OCR softwares in various aspects: 1. Most known OCRs are non-free and Clara is free. Clara focus the X Windows System. Clara offers batch processing, a web interface and supports cooperative revision effort. 2. Most OCR softwares focus omnifont technology disregarding training. Clara does not implement omnifont techniques and concentrate on building specialized fonts (some day in the future, however, maybe we'll try classification techniques that do not require training). 3. Most OCR softwares make the revision of the recognized text a process totally separated from the recognition. Clara pragmatically joins the two processes, and makes training and revision one same thing. In fact, the OCR model implemented by Clara is an interactive effort where the usage of the heuristics alternates with revision and visual fine-tuning of the OCR, guided by the user experience and feeling. 4. Clara allows to enter the transliteration of each pattern using an interface that displays a graphic cursor directly over the image of the scanned page, and builds and maintains a mapping between graphic symbols and their transliterations on the OCR output. This is a potentially useful mechanism for documentation systems, and a valuable tool for typists and reviewers. In fact, Clara OCR may be seen as a productivity tool for typists, instead of a typical OCR. 5. Most OCR softwares are integrated to scanning tools offerring to the user an unified interface to execute all steps from scanning to recognition. Clara does not offer one such integrated interface, so you need a separate software (e.g. SANE) to perform scanning. 6. Most OCR softwares expect the input to be a graphic file encoded in tiff or other formats. Clara supports only raw PBM/PGM.
Clara OCR will run on a PC (386, 486 or Pentium) with GNU/Linux and Xwindows. Clara OCR will hopefully compile and run on a PC with any unix-like operating system and Xwindows. Currently Clara OCR won't run on big-endian CPUs (e.g. Sparc) nor on systems lacking X windows support (e.g. MS-Windows). Higher-level libraries like Motif, GTK or Qt are not required. A relatively fast CPU is recommended (300MHz or more). Memory usage depends on the documents, and may range from some few megabytes to various tenths os megabytes The normal operation will create session files on your hard disk, so some megabytes of free disk space are required (a large project may require plents of gigabytes). Clara OCR can read and write gzipped files (see the -z command-line switch). If you need to build the executable and/or the documentation, then an ANSI C compiler (with some GNU extensions) and a (version 5) perl interpreter are required.
For those who need to download and compile the source code (hopefully this will be unnecessary for most users as soon as Clara binary distributions become available), it may be downloaded from http://www.claraocr.org/. It's a compressed tar archive with a name like clara-x.y.tar.gz (x.y is the version number). The compilation will generally require no more than issue the following commands on the shell prompt:
If some of these steps fail, please try to obtain assistance from your local experts. They will solve most simple problems concerning wrong paths or compiler options. You can also read the subsection "Compilation and startup pitfalls".
This subsection is intended to help people that are experiencing fatal errors when building the executable or when starting it. After each error message we'll point out some hints. Bear in mind that most hints given below are very elementary concerning Unix-like systems. If you have problems, try to read all hints because details explained once are not repeated. If you cannot understand them, please try to ask your local experts, or try to read an introductory book on Unix things. Please don't email questions like these to the Clara developers, except when the hint suggests it. 1. Path-related pitfalls
If you don't know what I'm speaking about, take a look on the directory where the Clara source codes are, and you'll see there a file named "makefile". This file contains the names of the tools to be used and rules to build the Clara executable. It contains also important paths, like those where the system headers (files .h) and libraries can be found. If the names or the paths don't reflect those on your system, you need to edit the makefile accordingly.
2. Compilation pitfalls
3. Runtime pitfalls
Clara OCR is intended to OCR a relatively large collection of pages at once, typically a book. So we will refer the material that we are OCRing as "the book". Let's describe a small but real project as an example on how to use Clara to OCR one "book". This section is in fact an in-depth tutorial on using Clara OCR. In order to try all techniques explained along this section, please download and uncompress the file referred as "page 143" of Manuel Bernardes Branco Dictionary (Lisbon, 1879), available at http://www.claraocr.org. It's a tarball containing the two text columns (one per file) of that page. Just to make the things easier, we will assume that the files 143-l.pgm and 143-r.pgm were downloaded to the directory /home/clara/books/MBB/pgm/. We will assume also that the programs "clara", and "selthresh.pl" are on the PATH. Some programs required to handle PBM files (pgmtopbm, pnmrotate and others, by Jef Poskanzer) are also required. These programs can be easily found around there, and are included on most free operating systems.
Clara OCR cannot scan paper documents by itself. Scanning must be performed by another program. The Clara OCR development effort is using SANE (http://www.mostang.com/sane) to produce 600 or 300 dpi images. The Clara OCR heuristics are tuned to 600 dpi. Scanners offer three scanning modes: black-and-white (also known as "bitmap" or "lineart", however the meaning of these words may vary depending on the context), "grayscale" and "color". Clara OCR requires black-and-white or grayscale input. Both black-and-white and grayscale images may be saved in a variety of formats by scanning programs. However, only PBM (for black-and-white) and PGM (for grayscale) formats are recognized. Generally grayscale 600 or 300 dpi will be the best choice, but black-and-white 600 dpi may be good for new, high quality printed materials. If your scanning program do not support the PBM or PGM formats, try to save the images in TIFF format and convert to PBM or PGM using the command tifftopnm. If for some reason the TIFF format cannot be used, choose any other format that preserves all data (don't use "compressing" formats like JPEG), and for which a conversion tool is available, to convert it to PBM or PGM. Obs. Programs that scan or handle (e.g. rotate) images may sometimes perform unexpected tasks, as applying dithering or reducing algorithms by themselves. An image transformed to become nice or small may be useless for OCR purposes. Obs. The PBM and PGM formats do not carry the original resolution (dots-per-inch) at which the image was scanned. As some heuristics require that information, Clara OCR expects to be informed about it through the command-line switch -y (so take note of the resolution used). Grayscale means that each pixel assumes one gray "level", typically from 0 (black) to 255 (white). This is a good choice for scanning old or low-quality printed materials, because it's possible to use specialized programs to analyse the image and choose a "threshold", in such a way that all pixels above that threshold will be considered "white", and all others will be considered black (when scanning in black-and-white mode, the threshold is chosen by the scanning program or by the user). The threshold may be global (fixed for the entire page) or local (vary along the page). In most cases grayscale will achieve better results. However, as grayscale images are much larger than black-and-white images, 300 dpi (instead of 600 dpi) may be mandatory when using grayscale due to disk consumption requirements. Obs. Try to limit yourself to the optical resolution oferred by the scanner. Most old scanners are 300 dpi, but the scanning software obtains higher resolutions through interpolation. Newer scanners may be optical 600 dpi or 1200 dpi or more. Obs. the page 143 of Manuel Bernardes Branco Dictionary that we're using along these tests was scanned using the SANE scanimage command:
Obs. Now you can try avoid links in segmentation step. Just set "Try avoid links" parameter in Tune tab. (Normal values <=1) The four thresholding methods currently avaliable are: manual (global), histogram-based (global), classification-based (local), classification-based (global).
Histogram-based thresholding is the default method. It computes automatically a thresholding value based on the distribution of grayshades. To use it, just enter the TUNE tab and select (it's selected by default) the "use histogram-based global thresholder". To make a try, load a PGM image and press OCR or ask the Segmentation OCR step. Obs. You can correct the automatic-detected threshold with "Threshold factor" in Tune tab. A global thresholding value can be manually specified. This corresponds to the "use manual global thresholder" entry. The choice of the thresholding value is performed through a visul interface called "instant thresholding". To use it, load one PGM image and select the "Instant thresholding" entry (Edit menu). Then use '<', '>', '+' and '-' to change the thresholding value. When ok, press ESC. Note that the selected value will be applied only when the segmentation step runs.
Global thresholding does not address those cases where the printing intensity (or paper properties) vary along one same page. Local thresholding methods are required on such cases. Clara OCR implements a classification-based local (per-symbol) thresholder. Saying that it's classification-based means that the OCR engine is used to choose the threshold. In other words, the threshold chosen is that for which the classifier successfully recognized the symbol (in fact, this is a brute-force approach). The local binarizer can be manually applied at any symbol. To do so, load one PGM page and click any symbol directly on the PAGE tab. Two thresholding values will be chosen. The pixels found to be "black" for each one are painted "black" (smaller value) and "gray" (larger value). At this moment, it's possible to add the thresholded symbol as a pattern (just press the key corresponding to its transliteration). Remember that this thresholder relies on the classifier, so if the OCR is not trained, you'll get no benefit. Two versions of the local binarizer were developed, a "weak" one and a "strong" one. The "weak" one just tries to change the threshold on those symbols not successfully classified using the default threshold. The "strong" one (unfinished) also tries to criticize locally the segmentation results. By default, the weak version is used. To try the strong one, check the corresponding checkbox at the TUNE tab. Obs. As an alternative, use the "Balance" feature + global thresholding.
Clara OCR includes a simple threshold selection script to compute global best thresholds based on classification results. Let's try it on our 2-page book. Just create a directory, cd to it and run the selthresh.pl script informing the resolution and the names of the images:
Once the best thresholds are known, use pgmtopbm to produce the black-and-white images. It's also a good idea to approach the resolution to 600 dpi using pnmenlarge. Yet pnmenlarge does not add information to the image, the classification heuristics will behave better. In our case, the command should be
In order to capture the output of selthresh.pl (to extract the per-page best thresholds), it's ok to re-generate it as many times as needed (just repeat the same selthresh.pl command, because once all computations become performed, the script will just read the results from selthresh.out and output the results). A final warning: selthresh.pl may be fooled by too dark images. So if the right limit is much larger than it should be, selthresh.pl may produce bad results. So be careful concerning the right limit of the interval. As a practical advice, keep in mind that the best threshold for most images is less then 0.6. In the near future we'll use statistical measurements to choose the interval to analyse, in order to prevent such problems and to make unnecessary a manual choice. obs. the tarball also includes an alternative selthresh.pl, named slethresh_fidian.pl. It contains instructions on how to use it.
Sometimes the printing is skewed relatively to the paper margins. Skew is a problem to the OCR heuristics. As the Clara OCR engine just detects components by pixel contiguity and builds classes of symbols, in practice the effect of skew will be a larger number of patterns, and therefore a larger revision cost. In some cases, a careful manual scanning can solve the problem. When acceptable, a set-square solves the problem: just align one text line at one set-square rule and the edge of the scanner glass at the other rule (we're supposing that the bookbinding was disassembled). The bundled preprocessor now includes a method to compute and correct skew, but it's not on by default. To activate it, enter the TUNE tab and select the "Use deskewer" checkbox. Now deskewing will be applied when the OCR button is pressed (or when the "Preprocessing" OCR step is requested). Note that preprocessing is called only once per page, so if the page was already preprocessed, it won't be deskewed.
Clara OCR expects to find on one same directory one or more images of scanned pages. In our case, this directory is assumed to be /home/clara/books/BC/pbm. By default, on this same directory, various files will be created to store the OCR data structures. So, if 143-l.pbm and 143-r.pbm are the pages to OCR, then after processing all pages at least once (not done yet) the work directory will contain the following files:
When Clara OCR is processing the page x.pbm, the files "x.session", "acts" and "patterns" are in memory. These three files together are generally referred as "the section". So the menu option "save session" means saving all three files.
Patterns are selected symbols from the book. They're obtained from manual training, or from automatic selection. The patterns are used to deduce the transliteration of the unknown symbols by the bitmap comparison heuristics. In other words, the OCR discovers that one symbol is the letter "a" or the digit "1" comparing it with the patterns. The book font is the collection of all patterns. The term "book font" was chosen to make sure that we're not talking about the X font used by the GUI. The book font is stored on a separate file ("patterns", on the work directory). Clara OCR classifies the patterns into "types", one type for each printing font. By now, most of this work must be done manually. Someday in the future, the auto-tuning features and the pre-build customizations will hopefully make this process less painful. So, before OCRing one book, it's convenient to observe the different fonts used. In our case, we have three fonts (the quotations refer the page 5.pbm):
Now we can select some patterns from the pages 143-l.pbm and 143-r.pbm. Try:
At this point, the "Auto-classify" feature (Edit menu) may be quite useful. When on, Clara OCR will apply the just trained pattern to solve all unknown symbols, so after training an "a", only those "a" letters dissimilar to that trained will remain unknown (grayed). Now save the session (menu "File"), exit Clara OCR (menu "File"), and enter Clara OCR again using the same commands above. Try to load one file and/or to observe the patterns on the tabs PATTERN, PATTERN (list), TUNE (SKEL), etc. This is a good way to experience that Clara OCR is started and exited many times along the duration of one OCR project. The last remark in this subsection: instead of the just described manual pattern selection, Clara OCR is able to select by itself the patterns to use from the pages. In order to use this feature, after selecting the checkbox "Build the bookfont automatically" (TUNE tab), classify the symbols (just press the OCR button using the mouse button 1, or press the mouse button 3 over it and select the "classify" item). However, the current recommendation is to prefer the manual selection of patterns, at least as a first step.
Currently, symbol classification can be performed by three different classifiers: skeleton fitting, border mapping or pixel distance. The choice is done on the TUNE tab. Border mapping is currently experimental. Pixel distance has been used as an auxiliar classifier. Skeleton fitting is a more mature code and is highly customizable. It's the default classification method by now. When using skeleton fitting, two symbols are considered similar when each one contains the skeleton of the other. So the classification result depends strongly on how skeletons are computed. As an example, the figure presents one symbol ("e"). The symbol black pixels are the dots ('.'). The skeleton black pixels are stars ('*').
Instead of trying the TUNE (SKEL) tab, it's possible to specify skeleton computation parameters through the -k command-line switch. Note however that if a selection was performed through the TUNE (SKEL) tab, that selection will override the parameters informed to -k, so be careful. Clara OCR has an auto-tune feature to choose the "best" skeleton computation parameters. To use it, check the "Auto-tune skeleton parameters" entry on the TUNE tab. This feature is currently left off by default because manual tuning can achieve better results. Examples: 1. Quality printing without thin details
To classify the book symbols (i.e. to discover the transliteration of unknown symbols using the patterns), enter Clara OCR, select "Work on all pages" ("Options" menu) and press the OCR button using the mouse button 1, or press the mouse button 3 and select "Classification". The classification may be performed many times. Each time, different parameters may be tried to refine the results already achieved. When the classification finishes, observe the pages 5.pbm and 6.pbm. Much probably, some symbols will be greyed. In other words, the classifier was unable to classify all symbols. The statistics presented on the PAGE (LIST) tab may be useful now. To reduce the number of unknown symbols there are three choices: add more patterns, change the skeleton computation parameters, or try another classifier. To add more patterns, just train some greyed symbols and reclassify all pages again. The reclassification will be faster than the first classification because most symbols, already classified, won't be touched. To change the skeleton computation parameters, exit Clara OCR, restart it informing the new parameters through -k, select "Re-scan all patterns" ("Edit" menu), select "Work on all pages" ("Options" menu) and reclassify. May be easier to choose and set the new parameters using the TUNE (SKEL) tab, as explained earlier. However, remember that the parameters chosen through the TUNE (SKEL) tab override the parameters informed through -k. To try another classifier, first select the "Re-scan all patterns" entry on the "Edit" menu. Then enter the TUNE tab and select the classifier to use from the available choices (skeleton-base, border mapping and pixel distance). The pixel distance may be a good choice. Then reclassify all pages. The "Re-scan all patterns" is required because for each symbol Clara OCR remembers the patterns already tried to classify it, and do not try those patterns again. However, when the skeleton computation parameters change, or when the classifier changes, those same patterns must be tried again. Maybe in the future Clara OCR will decide by itself about re-scanning all patterns.
At this point, we can generate the output for all pages. The output is already available if the classification was performed clicking the OCR button with mouse button 1. If not, just select the "Work on all pages" item on the "Options" menu, and click the OCR button using the mouse button 1. The per-page output will be saved to the files 5.html and 6.html. Maybe the output will contain unknow symbols. Maybe the output presents broken lines or broken words. If so, the numbers used to perform symbol alignment must be changed. These numbers are configured on the TUNE tab ("Magic numbers" section). They're part of the session data, so they'll be saved to disk. There are 7 such numbers:
To OCR an entire book is a long process. Perhaps along it a problem is detected. Bad choice of skeleton computation parameters, or a bad page contaminating the bookfont, some files loss due to a crash, etc. How to solve them? Clara OCR does not offer currently a complete set of tools to solve all these problems. In some cases, a simple solution is available. In others, a solution is expected to become available in future versions. This session will depict some practical cases, and explain what can be done and what cannot be done for each one.
In order to make easier the usage of read-only media, Clara OCR allows splitting the files in two directories, one for images and other for work files. The path of the first is stored on pagesdir, and the second, on workdir. For instance:
A somewhat rigid directory structure is recommended for high-volume digitalization projects based on Clara and using the web interface. In this case, there will be multiple "pagesdir" directories ("book1" and "book2" from the docsroot in the figure) and, for each one, a corresponding "workdir" ("book1" and "book2" from the workroot in the figure).
From the stats presented by the PAGE (LIST) tab it's possible to detect problems on specific pages. A low factorization may be a simptom of a bad choice of brightness for that page. In such a case, it's probably a good idea to remove completely that page. To remove a page is a delicate operation. Clara OCR currently does not offer a "remove page" feature. Basically, it should remove all patterns from that page, remove the revision data acquired from that page, and remove the page image and its session file.
What to do when the OCR classifies incorrectly a large quantity of symbols? (to be written)
When OCRing a large book, a good approach is to divide its pages into a number of smaller sections and OCR each one. So for a book with, say, 1000 pages, we could OCR pages 1-200, then 201-400, etc. After finishing the first section, of course we desire reuse on the second section the training and revision effort already spent. This is not the same as adding the pages 201-400 to the first section, because we do not want handle the pages 1-200 anymore. Basically we need to import the patterns of the first section when starting to process the second. Well, Clara OCR is currently unable to make this operation.
The Clara OCR web interface allows remote training of symbols. To use it, a web server able to run perl CGIs (e.g. Apache) is required. Let's present the steps to activate the web interface for a simple case, with only one book (named "book1"). Basically, one needs to create a subtree anywhere on the server disk (say, "/home/clara/www/"), owned by the user that will manage the project (say, "clara"), with subdirectories, "bin", "book1" and "book1/doubts":
Now we need to process the PBM files in order to create some "doubts". The script clara.pl also requires a symlink to the clara binary (change the path /usr/local/bin/clara as required):
1. Apache expects to be explicitly allowed to follow symlinks. The file access.conf should contain, in our case, a section similar to the following:
Types of revision acts (to be written). Discarding deduced data (to be written).
The "page (list)" tab offers recognition statistics on a per-page basis. The contents of each column on this tab is described below: POS: The sequential position on the list. The current page is informed by an asterisk on this column. FILE: The name of the file that contains the PBM image of the document. RUNS: The number of OCR runs on this page. Partial OCR runs, like classification (started by the "classify" button also count as one run. TIME: Total CPU time wasted with OCR operations on this page. I/O time (reading and saving session files) is not included. WORDS: Current number of words on this page. This variable is updated by the "build" step. SYMBOLS: Current number of symbols on this page. This variable is updated by the "build" step. DOUBTS: Current number of untransliterated CHAR symbols on this page. This variable is updated by the "build" step. CLASSES: Current number of classes on this page. FACT: Quotient between the number of symbols and the number of classes. RECOG: Quotient between (symbols-doubts) and symbols, where "symbols" is the number of symbols and "doubts" is the number of doubts as defined above. PROGRESS: difference between the current recog rate and the recog rate for the previous run.
In this section, the Clara application window will be described in detail, both to document all its features and to define the terminology.
The application window is divided into three major areas: the buttons ("zoom", "OCR", "stop", etc) the "plate" (right), including the tabs ("page", "symbol" and "font"), and one or more "document windows" inside the plate. We say "document window" because each window is exhibiting one "document". This "document" may be the scanned page (PAGE window), the current OCR output for this page (PAGE OUTPUT window), the symbol form (PAGE SYMBOL window), the GPL (GPL window) and so on. However, we'll refer the document windows merely as "windows". Around each window there are two scrollbars. On the botton of the application window there is a status line. On the top there is a menu bar (fully documented on the section "Reference of the menus").
Three tabs are oferred, and each one may operate in one or more "modes". For instance, pressing the PATTERN tab many times will circulate two modes: one presenting the windows "pattern" and "pattern (props)" and another with the window "pattern (list)". On each tab, Clara OCR displays on the plate one or more windows. Each such window is called a "document window" to distinguish them from the application window. Each such window is supposed to be displaying a portion of a larger document, for instance
All available tabs and the modes for each one are listed below. The numbers (1, 2, etc) are only to make easier to distinguish one mode from the others. There is no effective association between the modes and the numbers.
The application buttons are those displayed on the left portion of the Clara X window. They're labelled "zoom", "OCR", etc. Three types of buttons are available. There are on/off buttons (like "italic"), multi-state buttons (like the alphabet button), where the state is informed by the current label, and there are buttons that merely capture mouse clicks, like the "zoom" button. Some are sensible both to mouse button 1 and to mouse button 3, others are sensible only to mouse button 1.
zoom - enlarge or reduce bitmaps. The mouse buttom 1 enlarge bitmaps, the mouse button 3 reduce bitmaps. The bitmaps to enlarge or reduce are determined by the current window. If the PAGE window is active, then the scanned document is enlarged or reduced. If the PAGE (fatbits) or the PATTERN window is active, then the grid is enlarged or reduced. If the PAGE (symbol) or the PATTERN (props) or the PATTERN (list) window is active, then the web clip is enlarged or reduced.
OCR - start a full OCR run on the current page or on all pages, depending on the state of the "Work on current page only" item of the Options menu.
stop - stop the current OCR run (if any). OCR does not stop immediately, but will stop as soon as possible.
zone - start definition of the OCR zone. Currently zoning in Clara OCR is useful only for saving the zone can as a PBM file, using the "save zone" item on the "File" menu. By now, only one zone can be defined and the OCR operations consider the entire document, ignoring the zone.
type - read-only button, set accordingly to the pattern type of the current symbol or pattern. The various letter sizes or styles (normal, footnote, etc) used by the book are numbered from 0 by Clara OCR ("type 0", "type 1", etc).
bad - toggles the button state. The bad flag is used to identify damaged bitmaps.
latin/greek/etc - read-only button, set accordingly to the alphabet of the current symbol or pattern.
When the "Show alphabet map" option of the "View" menu is selected, the GUI will include an alphabet map between the buttons and the plate. This map presents all symbols from the current alphabet. The current alphabet is selected using the alphabet button. The alphabet button circulates all alphabets selected on the "Alphabets" menu. Clara OCR offers an initial support for multiple alphabets. To become useful, it needs more work. The alphabet map currently does not offer any functionality. For some alphabets (Cyrillic and Arabic) the alphabet map is disabled on the source code due to the large alphabet size. Currently Clara OCR does not contain bitmaps for displaying Katakana.
Most menus are acessible from their labels menu bar (on the top of the application window). The labels are "File", "Edit", etc. Other menus are presented when the user clicks the mouse button 3 on some special places (for instance the button "OCR"). Let's describe all menus and their entries.
This menu is activated from the menu bar on the top of the application X window.
Enter the page list to select a page to be loaded.
Save on disk the page session (file page.session), the patterns (file "pattern") and the revision acts (file "acts").
Save on disk the first zone as the file zone.pbm.
Save the contents of the PAGE LIST window to the file report.txt on the working directory.
Just quit the program (asking before if the session is to be saved.
This menu is activated from the menu bar on the top of the application X window.
When selected, the right or the left arrows used on the PATTERN or the PATTERN PROPS windows will move to the next or the previous untransliterated patterns.
When selected, the classification heuristic will retry all patterns for each symbol. This is required when trying to resolve the unclassified symbols using a second classification method.
When selected, the engine will re-run the classifier after each new pattern trained by the user. So if various letters "a" remain unclassified, training one of them will perhaps recognize some othersm helping to complete the recognition.
When selected, the mouse button 1 will fill the region around one pixel on the pattern bitmap under edition on the font tab.
When selected, the mouse button 1 will paint individual pixels on the pattern bitmap under edition on the font tab.
When selected, the mouse button 1 will clear the region around one pixel on the pattern bitmap under edition on the font tab.
When selected, the mouse button 1 will clear individual pixels on the pattern bitmap under edition on the font tab.
When selected, the pattern list window will divide the patterns in blocks accordingly to their (page) sources.
When selected, the pattern list window will use as the first criterion when sorting the patterns, the number of matches of each pattern.
When selected, the pattern list window will use as the second criterion when sorting the patterns, their transliterations.
When selected, the pattern list window will use as the third criterion when sorting the patterns, their number of pixels.
When selected, the pattern list window will use as the fourth criterion when sorting the patterns, their widths.
When selected, the pattern list window will use as the fifth criterion when sorting the patterns, their heights.
Remove from the font all untransliterated fonts.
Set the pattern type for all patterns marked as "other".
Try to find a barcode on the loaded page.
Perform on-the-fly global thresholding.
Reset the parameters for skeleton computation for all patterns.
This menu is activated from the menu bar on the top of the application X window.
Use a small X font (6x13).
Use the medium font (9x15).
Use a large X font (10x20).
Use the default font (7x13 or "fixed" or the one informed on the command line).
Toggle the hide scrollbars flag. When active, this flag hides the display of scrolllbar on all windows.
Toggle the hide fragments flag. When active, fragments won't be included on the list of patterns.
Show the HTML source of the document, instead of the graphic rendering.
Toggle the web clip feature. When enabled, the PAGE_SYMBOL window will include the clip of the document around the current symbol that will be used through web revision.
Toggle the alphabet map display. When enabled, a mapping from Latin letters to the current alphabet will be displayed.
Identify the symbols on the current class using a gray ellipse.
Display bitmap matches when performing OCR.
Display all bitmap comparisons when performing OCR.
Display bitmap matches when performing OCR, waiting a key after each display.
Display all bitmap comparisons when performing OCR, waiting a key after each display.
Display each candidate when tuning the skeletons of the patterns.
Perform a presentation. This item is visible on the menu only when the program is started with the -A option.
This item selects the alphabets that will be available on the alphabets button.
This is a provision for future support of Arabic alphabet.
This is a provision for future support of Cyrillic alphabet.
This is a provision for future support of Greek alphabet.
This is a provision for future support of Hebrew alphabet.
This is a provision for future support of Kana alphabet.
Words that use the Latin alphabet include those from the languages of most Western European countries (English, German, French, Spanish, Portuguese and others).
Indo-arabic decimal numbers like 1234, +55-11-12345678 or 2000.
Ideograms.
OCR operations (classification, merge, etc) will be performed only on the current page.
OCR operations (classification, merge, etc) will be performed on all pages.
Toggle the emulate deadkeys flag. Deadkeys are useful for generating accented characters. Deadkeys emulation are disabled by default The emulation of deadkeys may be set on startup through the -i command-line switch.
Toggle the automenu feature. When enabled, the menus on the menu bar will pop automatically when the pointer reaches the menu bar.
When selected, the PAGE tab will display only the PAGE window. The windows PAGE_OUTPUT and PAGE_SYMBOL will be hidden.
This menu is activated when the mouse button 3 is pressed on the PAGE window.
Change to PAGE_FATBITS focusing this symbol.
Scroll the window contents in order to the current pointer position become the bottom left.
The pattern of the class of this symbol will be the unique pattern used on all subsequent symbol classifications. This feature is intended to be used with the "OCR this symbol" feature, so it becomes possible to choose two symbols to be compared, in order to test the classification routines.
Starts classifying only the symbol under the pointer. The classification will re-scan all patterns even if the "re-scan all patterns" option is unselected.
Merge this fragment with the current symbol.
Create a symbol link from the current symbol (the one identified by the graphic cursor) to this symbol.
Make the current symbol nonpreferred and each of its components preferred.
Create an accent link from the current symbol (the one identified by the graphic cursor) to this symbol.
Run locally the symbol pairing heuristic to try to join this symbol to the word containing the current symbol. This is useful to know why the OCR is not joining two symbols on one same word.
Run locally the word pairing heuristic to try to join this word with the word containing the current symbol on one same line. This is useful to know why the OCR is not joining two words on one same line.
Run locally the geometrical merging heuristic to try to merge this piece to the current symbol.
Present the coordinates and color of the current pixel.
Identify the individual closures when displaying the current document.
Identify the individual symbols when displaying the current document.
Identify the individual words when displaying the current document.
Display absent symbols on pattern type 0, to help building the bookfont.
Report the scale on the tab when the PAGE window is active.
On the PAGE window displays the bounding boxes instead of the symbols themselves. This is useful when designing new heuristics.
This menu is activated when the mouse button 3 is pressed on the PAGE.
Scroll the window contents in order to the current pointer position become the bottom left.
Scroll the window contents in order to the centralize the closure under the pointer.
Build the closure border path and activate the flea.
Build the closure border path and search straight lines there using linear distances.
Build the closure border path and search straight lines there using correlation.
Apply the isbar test on the closure.
Detect closure extremities.
Show the skeleton on the windows PAGE_FATBITS. The skeletons are computed on the fly.
Show the border on the window PAGE_FATBITS. The border is computed on the fly.
For each symbol, will show the skeleton of its best match on the PAGE (fatbits) window.
For each symbol, will show the border of its best match on the PAGE (fatbits) window.
This menu is activated when the mouse button 3 is pressed on the OCR button. It allows running specific OCR steps (all steps run in sequence when the OCR button is pressed).
Start preproc.
Start detecting text blocks.
Start binarization and segmentation.
All OCR data structures are submitted to consistency tests. This is under implementation.
Compute the skeletons and analyse the patterns for the achievement of best results by the classifier. Not fully implemented yet.
Revision data from the web interface is read, and added to the current OCR training knowledge.
start classifying the symbols of the current page or of all pages, depending on the state of the "Work on current page only" item of the Options menu. It will also build the font automatically if the corresponding item is selected on the Options menu.
Merge closures on symbols depending on their geometry.
Start building the words and lines. These heuristics will be applied on the current page or on all pages, depending on the state of the "Work on current page only" item of the Options menu.
Obs. this is not implemented yet. Start filtering through ispell to generate transliterations for unknow symbols or alternative transliterations for known symbols. Clara will use the dictionaries available for the languages selected on the Languages menu. Filtering will be performed on the current page or on all pages, depending on the state of the "Work on current page only" item of the Options menu.
The OCR output is generated to be displayed on the "PAGE (output)" window. The output is also saved to the file page.html.
Files containing symbols to be revised through the web interface are created on the "doubts" subdirectory of the work directory. This step is performed only when Clara OCR is started with the -W command-line switch.
Bookfont handling options
Run in batch mode. The application window will not be created, and the OCR will automatically execute a full OCR run on all pages (or on the page specified through -f).
Choose the number of gray levels or the colors to be used by the GUI. To choose the number the colors AND the colors, this option must be used twice. The Clara OCR GUI uses by default only five colors, internally called "white", "black", "gray", "darkgray" and "vdgray" ("very dark gray"). There are two predefined schemes to map these internal colors into RGB values: "c" (color) and the default (grayscale). Alternatively, the mapping may be explicited, informing the RGB values separated by commas. The notation #RRGGBB is not supported; RGB values must be specified through color names known by the xserver (e.g. "brown", "pink", "navyblue", etc, see the file /etc/X11R6/lib/X11/rgb.txt). The following example specify the default mapping:
X Display to connect (by default read the environment variable DISPLAY).
Run in debug mode. Debug messages will be sent to stderr. Debug messages are generated when an acceptable but not reasonable event is detected.
Reviewer and reviewer type. All revision data is assigned by Clara to its originator. By default the reviewer name is "nobody" and its type is "A". The reviewer generally will be an email address or a nickname, The type may be T (trusted), A (arbiter) or N (anonymous). Example:
The X font to use (must be a font with fixed column size, e.g. "fixed" or "9x15").
Scanned page or page directory. Defaults to the current directory. The argument must be a pbm file (with absolute or relative path) or the path (absolute or relative) of the directory where the pbm file(s) was (were) placed.
X geometry.
Display short help and exit.
Emulate dead keys functionality.
Parameters SA,RR,MA,MP,ML,MB,RX,BT used to compute skeletons. BUG: these parameters are ignored when a "patterns" already exists. In this case, Clara will read the parameters from the "patterns" file.
Switch off optimizations. Generally useful only for debug purposes. Non-supported displays depths (if any) may require '-N d'. The argument is the list of the (one or more) optimizations to switch off (s, a, j, q, c, x or d). Examples:
Select output format (t=text, h=html). The default is HTML.
Parameters for filtering symbol comparison. PNT1 and PNT2 are the pixel number thresholds. These thresholds are used to filter out bad candidates when classifying symbols. The first threshold is for strong similarity and the second for weak similarity. The comparison algorithm performs two passes. The first pass uses PNT1 to filter. The second pass uses PNT2. So on the first pass only patterns "quite similar" to the symbol to classify are tried. On the second pass, we relax and permit more patterns to be tried. This method helps to achieve a good performance. As PNT1 becomes larger, less patterns will be tried on the first pass. As PNT2 becomes smaller, more patterns will be tried on the second pass. MD is the maximum clearance to try a skeleton. The clearance must be an integer in the range 4..30 (default 6). The shape recognition algorithm will refuse to try to fit an skeleton into a symbol if the difference of the widths or heights of them is larger than twice the clearance. Examples:
Maximum number of doubts per run. The argument must be an integer (default 30).
Avoid loading and creation of session files. Also reports bookfont size on stdout before exiting. This option is intended to be used by the selthresh.pl script.
Switch on trace messages. Trace messages depict the execution flow, and are useful for developers. Trace messages are written to stderr.
Verbose mode. Without this option, Clara runs quietly (default). Otherwise, informative warnings about potentially relevant events are sent to stderr.
Print version and compilation options and exit.
Web mode. Will read from the doubts subdir the input collected from web, and will dump on that same directory the doubts to be reviewed.
Work directory. Defaults to the page directory (see -f). The path of the directory where the OCR will write the output, the acts, the book font and the session files. The doubts directory (web operation) is assumed to be a subdirectory of the work directory.
Switch off (0) or on (1) index checking. Index checking is performed in some critical points in order to detect memory leaks. Index checking is unavailable when Clara is compiled with the symbol MEMCHECK undefined.
Inform the resolution of the scanned image in dots per inch (default 600). This resolution applies for all pages to be processed until the program exits.
Write (and read) compressed session files (*.session, acts and patters will be compressed using GNU zip). Be careful: if -z is used, any existing uncompressed file (*.session, acts or patterns) will be ignored. So if you start using uncompressed files and suddenly decides to begin using compressed files, then compress manually all existing files before starting Clara with the -z switch. Clara OCR support for reading and writing compressed files depends on the platform, and requires gzip and gunzip to be installed in some directory of binaries included in the PATH.
ZPS, that is, the size of the bitmap pixels measured in display pixels, when in fat bit mode. Must be a small odd integer (1, 3, 5, 7 or 9).
Clara OCR is free software. Its source code is distributed under the terms of the GNU GPL (General Public License), and is available at http://www.claraocr.org/. If you don't know what is the GPL, please read it and check the GPL FAQ at http://www.gnu.org/copyleft/gpl-faq.html. You should have received a copy of the GNU General Public License along with this software; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. The Free Software Foundation can be found at http://www.fsf.org.
Clara OCR was written by Ricardo Ueda Karpischek. Giulio Lunati wrote the internal preprocessor. Clara OCR includes bugfixes produced by other developers. The Changelog (http://www.claraocr.org/CHANGELOG) acknowledges all them (see below). Imre Simon contributed high-volume tests, discussions with experts, selection of bibliographic resources, propaganda and many ideas on how to make the software more useful. Ricardo authored various free materials, some included (at least) in Conectiva, Debian, FreeBSD and SuSE (the verb conjugator "conjugue", the ispell dictionary br.ispell and the proxy axw3). He recently ported the EiC interpreter to the Psion 5 handheld and patched the Xt-based vncviewer to scale framebuffers and compute image diffs. Ricardo works as an independent developer and instructor. He received no financial aid to develop Clara OCR. He's not an employee of any company or organization. Imre Simon promotes the usage and development of free technologies and information from his research, teaching and administrative labour at the University. Roberto Hirata Junior and Marcelo Marcilio Silva contributed ideas on character isolation and recognition. Richard Stallman suggested improvements on how to generate HTML output. Marius Vollmer is helping to add Guile support. Jacques Le Marois helped on the announce process. We acknowledge Mike O'Donnell and Junior Barrera for their good criticism. We acknowledge Peter Lyman for his remarks about the Berkeley Digital Library, and Wanderley Antonio Cavassin, Janos Simon and Roberto Marcondes Cesar Junior for some web and bibliographic pointers. Bruno Barbieri Gnecco provided hints and explanations about GOCR (main author: Jorg Schulenburg). Luis Jose Cearra Zabala (author of OCRE) is gently supporting our tentatives of using portions of his code. Adriano Nagelschmidt Rodrigues and Carlos Juiti Watanabe carefully tried the tutorial before the first announce. Eduardo Marcel Macan packaged Clara OCR for Debian and suggested some improvements. Mandrakesoft is hosting claraocr.org. We acknowledge Conectiva and SuSE for providing copies of their outstanding distributions. Finally, we acknowledge the late Jose Hugo de Oliveira Bussab for his interest in our work. The fonts used by the "view alphabet map" feature came from Roman Czyborra's "The ISO 8859 Alphabet Soup" page at http://czyborra.com/charsets/iso8859.html. The names cited by the CHANGELOG and not cited before follow (small patches, bug reports, specfiles, suggestions, explanations, etc). Brian G., Bruce Momjian, Charles Davant (server admin), Daniel Merigoux, De Clarke, Emile Snider (preprocessor, to be released), Erich Mueller, groggy, Harold van Oostrom, Ho Chak Hung, Jeroen Ruigrok, Laurent-jan, Nathalie Vielmas, Romeu Mantovani Jr (packager), Ron Young, R P Herrold, Sergei Andrievskii, Stuart Yeates, Terran Melconian, Thomas Klausner (packager), Tim McNerney, Tyler Akins.
|