Wednesday, September 28, 2011

Run 32bit Java Apps on 64bit Linux With a Second JVM

So yesterday I ran into an application I wanted to run that required a 32bit JVM to operate without errors. Since I'm running 64bit Ubuntu Linux and I've (obviously) got a 64bit JVM installed, that meant getting a second JVM installed and running the app with that JVM instead of the system default one. Bit of a problem when you don't want to switch the system default JVM just to run one app, but there's a way around that.

Wednesday, September 21, 2011

DIY WiFi Signal Booster

Building my own HDTV antenna out of aluminum foil reminded me of a few years back when I needed to improve my WiFi signal, and I built foil reflectors to do that as cheaply as possible. There are a bunch of designs out there, but I like the EZ-12 parabolic reflector from since it's quick, cheap and easy to make. It's an aluminum, paper and tape solution to weak WiFi signal strength.

Wednesday, September 14, 2011

Convert Scanned PDF Files To Text In Linux With OCR

   Recently I needed to get a scanned PDF document onto my Kindle. Anyone who has tried this knows it's a problem, since the Kindle doesn't really handle regular PDF files as well as a computer can, and scanned PDFs are even worse. The fonts are too small and you have to zoom in manually and constantly readjust things to be able to read the entire document. It's a pain. Calibre is pretty good at converting text PDFs into more Kindle compatible formats, but for scanned PDFs that have the text stored in unsearchable images Calibre can't do much of anything. You'll need to use OCR (Optical Character Recognition) software to create a text copy of the image-based PDF and give that over to Calibre to convert to mobi or whatever your ereader takes.

   Normally I don't even bother trying to read scanned PDFs on my Kindle, I just read them on my computer, but for this occasion I needed to go over some longer documents and didn't want to strain my eyes looking at an LCD for hours. Text pdfs are different, I read those on my Kindle all the time, I just have Calibre mangle them into mobi format first.

   First I searched for graphical tools to do this under Linux, and found gscan2pdf. I tried messing around with gscan2pdf for a longer period of time than I should have bothered. At this point I don't know if it can actually do what I wanted, I couldn't figure out how to make it work and at this point I no longer care to mess with it. Apparently the OCR functionality in gscan2pdf is meant for appending the existing scanned PDF with text that can be searched or indexed by a desktop search program like Beagle/Tracker/Spotlight/Google Desktop/Whatever. Just use the command line for OCR'ing your image-based PDF files into text files, it's quicker and not hard once you find the right commands, which I've already gone and Google'd for you.

   All you need to do is use Ghostscript to extract the pages of the scanned PDF into one large TIFF file, and then run Tesseract to OCR those pages into a (hopefully) coherent text file that can then be passed to Calibre or direct to your ereader.

   First install ghostscript and tesseract if you don't already have them.

   In Debian/Ubuntu-based distros:

      sudo apt-get install ghostscript tesseract-ocr tesseract-ocr-eng (though you should really be using aptitude in place of apt-get, IMO)

   Then for each scanned PDF you want to convert to text, run the following:

      gs -dNOPAUSE -sDEVICE=tiffg4 -r600x600 -dBATCH -sPAPERSIZE=a4 -sOutputFile=OUTPUT.tif PDFNAME.pdf

   and then:

      tesseract OUTPUT.tif TEXTNAME -l eng

   Obviously change PDFNAME.pdf to the name of your scanned PDF file and TEXTNAME to the desired name of your end result text file. Duh. :)

   This worked well for one of my work documents, but it spat out gibberish text for another two. Your mileage will definitely vary from file to file. It's too bad some scanned documents are so difficult for this software to read. OCR clearly isn't perfect at this point, but recognizing text in images is a hard problem to solve. At least we've got a workable free software OCR option in things like Tesseract.

   Another problem is that it also took forever to run Tesseract on long documents on my slow laptop. It took around 12 hours for one particularly long PDF file to get OCR'd. You'll want to run Tesseract on the fastest machine you've got, not your junk netbook.

   Despite the problems and utter failure with two documents I at least got one readable document out of this effort, and the knowledge to at least have a chance of converting any scanned PDFs I come across in the future. I guess that's something.

   If anyone reading this has a better free solution I'd love to hear it, because rescanning the original paper documents and doing a better job scanning them this time isn't an option for me.

Wednesday, September 7, 2011

Android Game Console Emulators

   I've been playing around with game console emulators for Android on my HTC Inspire a lot lately. Spending more time gaming than I should be, but whatever. There will be time to do work and improve my programming later. Anyways, there's a bunch of different open source game console emulators that have been ported to Android and packaged into paid apps, but many of those have been kicked off the Android Market and you can get them for free, for now.