LOCR - An Optical Character Recognition Program for Linux

Miguel A. Lerma

I worked in this project during the Summer of 2000. Then I got busy with other projects and never had a chance to come back to this one. In its current state the recognizing algorithm needs improvement (e.g. to avoid confusing similar looking symbols such as 0 and O, 1 and l, recognizing italics, etc.), but I made it public so that others can take advantage of the ideas in it.

Sources (version 0.1.0)

How to use it

First, compile it:
     gcc -O2 locr.c -o locr
Next, use it. It works in two modes:

Example:

  1. Original document (PDF version)
  2. Document in PNM format
  3. Ouput of locr (ISO-8859-1 encoding)

How it works

  1. Load image as an array of 0's and 1's.
  2. Remove dust and snow.
  3. Make table of blocks. (A block is a set of contiguous pixels. Some characters are made of more than one block, for instance "i" has two blocks: the body and the dot.)
  4. Remove atypical blocks (with unrealistic dimensions). In this way pictures are removed.
  5. Find columns of text. Make table of columns. Sort blocks by columns.
  6. On each column find lines of text. Make table of lines. Sort blocks by lines and by their position on each line.
  7. On each line join blocks horizontally overlapped. This transforms the component pieces of each character (such as the body and the dot in an "i") into a single object.
  8. Compute attributes of each character (e.g.: number of pieces, number of holes, vertical position on the line, etc.)
  9. For each character, compare its attributes to those stored in the data file. Select possible "candidates". Compare the given character to the candidates selected and decide which one is the closest. The final comparison is made by computing the Hamming distance (number of pixels where they differ) between scaled 16x16 versions of the characters. If no candidate is found similar enough to the current character, relax the criteria and try again.
  10. In "generate data" mode the previous step is replaced by just dumping the information collected into a file.

Other free OCR projects


Emailme to: mlerma at math dot northwestern dot edu