The Clara OCR FAQ ----------------- WELCOME These are the Clara OCR Frequently Asked Questions. They're useful for a first contact with Clara OCR. If you're looking for information on how to use Clara OCR, please try the Clara OCR Tutorial instead. Clara OCR can be found at http://www.claraocr.org/. CONTENTS 1. What is Clara OCR? 2. Why is Clara a "cooperative OCR"? 3. Is Clara OCR Free? Open Source? 4. Is Clara OCR a GNU program? 5. On which platforms does Clara OCR run? 6. Does Clara OCR have a command-line interface? 7. Does Clara OCR run on KDE? GNOME? 8. Which languages are supported by Clara OCR? 9. Does Clara OCR support Unicode? 10. Is Clara OCR omnifont? 11. How does Clara differ from other OCRs? 12. What is PBM/PGM/PPM/PNM? 13. How can I scan paper documents using Clara OCR? 14. I've tried Clara OCR, but the results disappointed me 15. How can I get support on Clara OCR? 16. Does Clara OCR induce to Copyright Law infringements? 17. How can I help the Clara OCR development effort? 1. What is Clara OCR? Clara is an OCR program. OCR stands for "Optical Character Recognition". An OCR program tries to recognize the characters from the digital image of a paper document. The name Clara stands for "Cooperative Lightweight chAracter Recognizer". 2. Why is Clara a "cooperative OCR"? Clara is a cooperative OCR because it offers an web interface for training and revision, so these tasks can benefit from the revision effort of many people across the Internet. However, Clara OCR also offers a powerful X-based GUI for standalone usage. 3. Is Clara OCR Free? Open Source? Clara OCR is distributed within the terms of the Gnu Public License (GPL) version 2. Yes, Clara OCR is Free. Yes, Clara OCR is Open Source. Clara OCR is not "Shareware", nor "Public Domain". 4. Is Clara OCR a GNU program? Clara OCR is unrelated to the GNU Project but its development is strongly based on GNU programs (GCC, Emacs and others), as well as on other free softwares, like the Linux kernel and XFree86. Clara OCR is free software because we agree on the free software ideal as stated by the GPL. To make this agreement explicit we also adopted some suggestions from the Free Software Foundation. These suggestions apply to the Clara OCR documentation: (a) GPL programs are referred as "free software", not "open source". (b) The term "GNU/Linux (operating system)" is used rather than "Linux (operating system)". (c) We do not recommend non-free softwares and do not refer the user to non-free documentation for free softwares. Furthermore, Clara OCR will support Guile as an extension language in the near future. Obs. We write "free software" instead of "open source" just for coherence. We dislike antagonisms between the various initiatives created along the years to freely produce, use, change and distribute software. 5. On which platforms does Clara OCR run? Clara OCR is being developed on 32-bit Intel running GNU/Linux. Currently Clara OCR won't run on big-endian CPUs (e.g. Sparc) nor on systems lacking X windows support (e.g. MS-Windows). A relatively fast CPU (300MHz or more) is recommended. There is a port initiative to MS-Windows being worked. See also the next question. 6. Does Clara OCR have a command-line interface? Yes, but the X Windows headers and libraries are required anyway to compile the source code, and the X Windows libraries are required to run even the Clara OCR command-line interface. Unless someone reworks the code, it's not possible to detach the GUI in order to compile Clara OCR on systems that do not support X Windows. 7. Does Clara OCR run on KDE? GNOME? Clara OCR will hopefully run on any graphic environment based on Xwindows, including KDE, GNOME, CDE, WindowMaker and others. Clara OCR depends only on the X library, and does not require GTK, Qt or Motif to run. Clara OCR does not use the X Toolkit (aka "Xt"). Clara OCR has been successfully tested on X11R5 and X11R6 environments with twm, fvwm, mwm and others. 8. Which languages are supported by Clara OCR? As a generic recogniser, Clara OCR may be tried with any language and any alphabet. However, there are some restrictions. Currently Clara OCR expects the words to be written horizontally, and there are some heuristics that suppose some geometric relationships typical for the Latin Alphabet and the accents used by most european languages. Support for language-specific spell checking is expected to be added soon. 9. Does Clara OCR support Unicode? No, Clara OCR does not support Unicode, and the support to the ISO-8859 charsets is partial. 10. Is Clara OCR omnifont? No, Clara OCR is not omnifont. Clara OCR implements an OCR model based on training. This model makes training and revision one same thing, making possible to reuse training and revision information (see also the next question). 11. How does Clara differ from other OCRs? This is a quote from the Clara Advanced User's Manual: Clara differs from other OCR softwares in various aspects: 1. Most known OCRs are non-free and Clara is free. Clara focus the X windows system. Clara offers batch processing, a web interface and supports cooperative revision effort. 2. Most OCR softwares focus omnifont technology disregarding training. Clara does not implement omnifont techniques and concentrate on building specialized fonts (some day in the future, however, maybe we'll try classification techniques that do not require training). 3. Most OCR softwares make the revision of the recognized text a process totally separated from the recognition. Clara pragmatically joins the two processes, and makes training and revision parts of one same thing. In fact, the OCR model implemented by Clara is an interactive effort where the usage of the heuristics alternates with revision and fine-tuning of the OCR, guided by the user experience and feeling. 4. Clara allows to enter the transliteration of each pattern using an interface that displays a graphic cursor directly over the image of the scanned page, and builds and maintains a mapping between graphic symbols and their transliterations on the OCR output. This is a potentially useful mechanism for documentation systems, and a valuable tool for typists and reviewers. In fact, Clara OCR may be seen as a productivity tool for typists. 5. Most OCR softwares are integrated to scanning tools offerring to the user an unified interface to execute all steps from scanning to recognition. Clara does not offer one such integrated interface, so you need a separate software (e.g. SANE) to perform scanning. 6. Most OCR softwares expect the input to be a graphic file encoded in tiff or other formats. Clara supports only raw PBM and PGM. 12. What is PBM/PGM/PPM/PNM? PBM, PGM and PPM are graphic file formats defined by Jef Poskanzer. PNM is not a graphic file format, but a generic reference to those three formats. In other words, to say that a program supports PNM means that it handles PBM, PGM and PPM. PBM = Portable BitMap PGM = Protable GrayMap PPM = Portable PixMap PNM = Portable aNyMap PBM files are black-and-white images, 1 bit per pixel. PGM files are grayscale images, 8 bits per pixel. PPM files are color images, 24 bits per pixel. Currently Clara OCR likes raw PBM and raw PGM files only. A scanned page stored in some format other than PBM or PGM can be converted to PBM or PGM using the netpbm tools, ImageMagick or others. PNM files may be "raw" or "plain". The plain versions are rarely used. Clara OCR does not support plain PBM nor plain PGM. To make sure about the file format, try the "file" utility, for instance file test.pbm Remember that image conversion sometimes implies data loss. For instance, to convert a color image to black-and-white, each pixel must be mapped to either black or white, so the original color (say, red, lightblue, seagreen, tomato, mistyrose, etc) is dropped. Also, the conversion process should decide for each pixel if it will be mapped to black or to white. Generally, the program that performs the conversion offers a variety of different mapping criteria. The OCR results depend strongly on the criterion chosen. 13. How can I scan paper documents using Clara OCR? You cannot. Clara OCR includes no support for scanners. To scan paper documents, use another software, like the one bundled with your scanner, or SANE (http://www.mostang.com/sane/). The development tests are using SANE. 14. I've tried Clara OCR, but the results disappointed me All OCR programs will disappoint you depending on the texts you're trying to recognize. If you're a developer, join the Clara OCR development effort and try to make it behave better on your texts. If your are not a developer, wait a new version and try again. 15. How can I get support on Clara OCR? If the documentation did not solve your problems, try the discussion list. 16. Does Clara OCR induce to Copyright Law infringements? No. Clara OCR is just a tool for character recognition like many others that can be purchased or are bundled with scanners. The Clara OCR Project claims all users to be aware about the Copyrigth Law and not infringe it. The Clara OCR Project abominates any try to infringe the legitimate laws of any country. Nonetheless, the Clara OCR Project supports the free and public availability of materials produced to be free, or of materials out of copyright due to its age. The Clara OCR Project recognizes the right of anyone to produce free or non-free materials. 17. How can I help the Clara OCR development effort? The best way is to use Clara OCR to recognize the texts you're interested on, and try to make it adapt better to them. The Developer's Guide should help in this case (C programming skills are required). The Clara OCR Project acknowledges all efforts to make Clara OCR more widely known and used.