Wednesday, September 14, 2011

Convert Scanned PDF Files To Text In Linux With OCR

   Recently I needed to get a scanned PDF document onto my Kindle. Anyone who has tried this knows it's a problem, since the Kindle doesn't really handle regular PDF files as well as a computer can, and scanned PDFs are even worse. The fonts are too small and you have to zoom in manually and constantly readjust things to be able to read the entire document. It's a pain. Calibre is pretty good at converting text PDFs into more Kindle compatible formats, but for scanned PDFs that have the text stored in unsearchable images Calibre can't do much of anything. You'll need to use OCR (Optical Character Recognition) software to create a text copy of the image-based PDF and give that over to Calibre to convert to mobi or whatever your ereader takes.

   Normally I don't even bother trying to read scanned PDFs on my Kindle, I just read them on my computer, but for this occasion I needed to go over some longer documents and didn't want to strain my eyes looking at an LCD for hours. Text pdfs are different, I read those on my Kindle all the time, I just have Calibre mangle them into mobi format first.

   First I searched for graphical tools to do this under Linux, and found gscan2pdf. I tried messing around with gscan2pdf for a longer period of time than I should have bothered. At this point I don't know if it can actually do what I wanted, I couldn't figure out how to make it work and at this point I no longer care to mess with it. Apparently the OCR functionality in gscan2pdf is meant for appending the existing scanned PDF with text that can be searched or indexed by a desktop search program like Beagle/Tracker/Spotlight/Google Desktop/Whatever. Just use the command line for OCR'ing your image-based PDF files into text files, it's quicker and not hard once you find the right commands, which I've already gone and Google'd for you.

   All you need to do is use Ghostscript to extract the pages of the scanned PDF into one large TIFF file, and then run Tesseract to OCR those pages into a (hopefully) coherent text file that can then be passed to Calibre or direct to your ereader.

   First install ghostscript and tesseract if you don't already have them.

   In Debian/Ubuntu-based distros:

      sudo apt-get install ghostscript tesseract-ocr tesseract-ocr-eng (though you should really be using aptitude in place of apt-get, IMO)

   Then for each scanned PDF you want to convert to text, run the following:

      gs -dNOPAUSE -sDEVICE=tiffg4 -r600x600 -dBATCH -sPAPERSIZE=a4 -sOutputFile=OUTPUT.tif PDFNAME.pdf

   and then:

      tesseract OUTPUT.tif TEXTNAME -l eng

   Obviously change PDFNAME.pdf to the name of your scanned PDF file and TEXTNAME to the desired name of your end result text file. Duh. :)

   This worked well for one of my work documents, but it spat out gibberish text for another two. Your mileage will definitely vary from file to file. It's too bad some scanned documents are so difficult for this software to read. OCR clearly isn't perfect at this point, but recognizing text in images is a hard problem to solve. At least we've got a workable free software OCR option in things like Tesseract.

   Another problem is that it also took forever to run Tesseract on long documents on my slow laptop. It took around 12 hours for one particularly long PDF file to get OCR'd. You'll want to run Tesseract on the fastest machine you've got, not your junk netbook.

   Despite the problems and utter failure with two documents I at least got one readable document out of this effort, and the knowledge to at least have a chance of converting any scanned PDFs I come across in the future. I guess that's something.

   If anyone reading this has a better free solution I'd love to hear it, because rescanning the original paper documents and doing a better job scanning them this time isn't an option for me.

1 comment:

  1. Good post. I'd recommend adding -dFirstPage=XXX and -dLastPage=YYY to do a sample before doing an entire document. Might save some time in the long run