LOCR - An Optical Character Recognition Program for Linux
Miguel A. Lerma
I worked in this project during the Summer of 2000.
Then I got busy with other projects and never had a chance to come
back to this one. In its current state the recognizing algorithm
needs improvement (e.g. to avoid confusing similar looking symbols such
as 0 and O, 1 and l, recognizing italics, etc.), but I made it public
so that others can take advantage of the ideas in it.
Sources (version 0.1.0)
- locr.c: C source of the program.
- foo.ps: Sample document used for
scanning and generating data.
- foo.txt: Plain text version of the sample
text (ISO-8859-1 encoding).
- foo.pnm: Scanned version of the sample text.
- foo.dat: Data generated from the sample text.
- gpl.txt: GNU General Public License.
- locr-0.1.0.tgz: The whole thing in a
single file.
How to use it
First, compile it:
gcc -O2 locr.c -o locr
Next, use it. It works in two modes:
- Generate data mode: Used for generating a database of
information to be used later in recognizing mode. Example: we have a
document "foo.ps" and we want to use it for compiling data. Do the
following:
- Print it.
- Scan it. Make sure that the output is a P1 or P4 pnm file "foo.pnm".
The recommended resolution is 300dpi.
Alternatively, the two previous steps can be replaced by the following
command:
gs -sDEVICE=pnmraw -r300x300 -sOutputFile=foo.pnm -dNOPAUSE foo.ps quit.ps
- Write a plain text version "foo.txt" of the document.
- Generate the data file "foo.dat" with the following command:
./locr -g -t foo.txt foo.pnm foo.dat
The original document should be very simple (single column, no
pictures...), contain a wide range of characters (we want to collect
as much information as possible) and be virtually perfect (otherwise
the program will make mistakes and the information collected will be
useless). Test it by using the program to recognize "foo.pnm" using
"foo.dat" as datafile. The output should be identical to "foo.txt",
except possibly for the spacing of the characters.
- Recognizing mode: Uses information previously collected in a
datafile for recognizing a scanned text. Example, after scanning a
document we get a PNM file "document.pnm" and we want to recognize it
using "foo.dat" as data file. Do it with the following command (the
output goes to "document.out"):
./locr -d foo.dat document.pnm document.out
Example:
- Original document (PDF version)
- Document in PNM format
- Ouput of locr (ISO-8859-1 encoding)
How it works
- Load image as an array of 0's and 1's.
- Remove dust and snow.
- Make table of blocks. (A block is a set of contiguous pixels.
Some characters are made of more than one block, for instance "i" has
two blocks: the body and the dot.)
- Remove atypical blocks (with unrealistic dimensions). In this way
pictures are removed.
- Find columns of text. Make table of columns. Sort blocks by
columns.
- On each column find lines of text. Make table of lines. Sort
blocks by lines and by their position on each line.
- On each line join blocks horizontally overlapped. This transforms
the component pieces of each character (such as the body and the dot
in an "i") into a single object.
- Compute attributes of each character (e.g.: number of pieces,
number of holes, vertical position on the line, etc.)
- For each character, compare its attributes to those stored in the
data file. Select possible "candidates". Compare the given character
to the candidates selected and decide which one is the closest. The
final comparison is made by computing the Hamming distance (number of
pixels where they differ) between scaled 16x16 versions of the
characters. If no candidate is found similar enough to the current
character, relax the criteria and try again.
- In "generate data" mode the previous step is replaced by just
dumping the information collected into a file.
Other free OCR projects
Emailme to: mlerma at math dot northwestern dot edu