Optimizing Binarization for OCRopus

In many hours of work and frustration I have learned that page segmentation and character models have a strong influence on the result of OCR. However, I always underestimated the effects of the initial step – the binarization.

Now I have looked at the binarization step more closely. Starting material are rather poor scans  of the Austrian-Hungarian casualty lists. The scans themselves are not so bad. However, the paper is more than 100 years old and yellowed. There is a cutout:

Instead of looking at the binarization result directly or applying any kind of metric to it, I deceided to measure the final OCR result, i.e. the character accuracy. To avoid any problems during page segmentation I cut the two-column scans in half manually. Additionally I counted the detected lines. Since I used the same ground truth data for all measurements a missing/additional line would immediately be noticed by a very high error rate.

Typically ocropus-nlbin is used to create a bi-tonal image. As you can see the result is not very promissing. May characters are eroded.

After running the page segmentation ocropus-gpagesep without additional parameters I noticed that some characters/words were eroded. By using the debug mode (“-d” option) you can see these holes in the line seeds:

With the “–usegauss” option that switches to a Gaussian kernel instead of a uniform kernel that problem can be avoided. But there was still another problem. Occasionally two lines were merged into one image. I also investiaged that problem using the debug mode. This is the relevant section:

The 7th and 8th line get merged. Apparently, the small red connection between the two lines is the cause of the merging. It seems as if the problem always occurs near white column separators. Therefore, I have to deactivate the detection of white column separators:

ocropus-gpageseg --csminheight 100000 --usegauss 0001.bin.png

Now the page separation works without problems and I can run the character prediction using my trained character model for the casualty lists.

As a different approach for binarization I have used ScanTailor, a program that has served me well in the past optimizing scans. By default ScanTailtor produces a 600dpi bi-tonal image from 300dpi gayscale input. It also allows the use the configure the threshold used for binarization. Although values from -50 to +50 are possible, for this example only values between 0 and +40 seem reasonable.

Finally, here are the results of my experiments:

The bars in blue show the results of the ScanTailor binarization. The most right bar shows the effect the resolution of the bi-tonal image has. It is a 300dpi version with the best threshold setting determined for the 600dpi images.

As you can see, the binarization method used has a significant influence on the recognition quality. In my case the character recognition error is reduced by more than 60%.

You can find all images and ground truth data for this experiment on GitHub: https://github.com/jze/ocropus-binarization-experiment

One Response to “Optimizing Binarization for OCRopus”
  1. Philipp says:

    This is a very interesting analysis on the effects of different parameters. I definitely have to look closer at the Gaussian kernel and try it out. Thank you for sharing all of these.

    Did you tweak the image with ScanTailor manually further, or would it also be possible to do this step automatically? The improving recognition result seems to come from the different binarization but also from the different resolution. Would the error rate go further down, if one increase the resolution more?

    Another improvement is to use the grayscale images for prediction (you still need the binarized images for the steps in between), i.e. something like
    i) ocropus-gpageseg -n –csminheight 100000 –usegauss –gray left.bin.png right.bin.png
    ii) ocropus-rpred -m verlustliste_österreich-ungarn.pyrnn.gz OCRpus-nlbin/*.nrm.png

    This results for me in an error rate of 3,7% (no ScanTailor optimization, 300dpi).