Sunday, April 22, 2012

Linux and OS X OCR

Mostly for my notes:
pdftoppm -f 2 -gray AmericanLegion.pdf AL

for i in *.pgm
do
    pnmtotiff $i > ${i%%.pgm}.tif
    rm $i
done


for i in *.tif
do
    tesseract $i ${i%%.tif}
done
Tesseract needs images of decent resolution; e.g.: in PowerPoint it's better to "Save as Pictures" at higher than the default resolution: