Monday, June 21, 2010

OCR for PDF Files and Images on Google Docs

Okay, this is basically a repost of an article on Google Operating System, but I figure since about 100% of my readership doesn't read such things, I might venture (to risk redundancy).

OCR stands for optical character recognition, and it basically translates into awesome.

Okay, to be more straightforward, OCR allows a computer to take a pdf or an image (jpeg, png, tiff, bmp) and extract the text from it.

This may sound underwhelming.  Keep in mind that for a computer, pdfs and images mean as much as a blank sheet of paper does to us (nothing).  For a computer to understand the chicken scratch on a page, it needs the 0101010101 behind each character.  OCR enables this phenomenon.

Unfortunately, as you can see from the following example, taken from our grocery list, the feature is not quite fully baked, leaving out critical formatting and missing the mark on a number of words and phrases:

All in good time, gentle readers, all in good time.







Between you and me, what blows my mind is that I read about this new feature just this afternoon, and it's already there.

However, I'm still waiting for the advent of the new Docs editor/format by default and the enhanced sharing features I told you about before . . .

4 comments:

Not sure said...

I'm just proud that I'm starting to understand all of the computer terms. I feel very sophisticated. Thanks for keeping me up on the technical terms

Daniel said...

Yes, I will condescend to teaching other people my vast stores of knowledge.

Heck, soon enough, I may even teach you how to say, "merci beaucoup"!

Brandon Brooks said...

I've been a subscriber to the Google Operating System blog for a while. They basically repost everything that the Google blog posts, so I don't think you have much to worry about in terms of redundancy.

Daniel said...

Interesting.

To me, it seems that Alex usually beats Google blogs to the punch.

But again, (as I implied in another comments section), which Google blog are you referring to? Docs?