Mar 252012
 

A previous post showed how to OCR Chinese texts using Adobe Acrobat Pro (OCRing is the process of recognizing text in an uneditable file, like a scan, and making it editable). While it works quite well, Acrobat Pro is a fairly expensive commercial product. The free Google Docs has the ability to OCR uploaded pdfs and image files in a number of languages, including both traditional and simplified Chinese. You enable it in the upload settings by clicking the upload icon, choosing Settings, then checking “Convert text …”

Enabling OCR in Google Docs

Once you have it enabled, when you upload a pdf (or image with text), you can select the language of the source file.

Selecting the language for OCR in Google Docs

Google Docs saves the editable text together with the source file in your account. A multi-page pdf is saved with the editable text interleaved between the pages. This makes it impossible to select all the editable text at once for pasting into a clean document; you have to select each page’s text separately.

In a couple of quick tests, the results from Google were on average nearly as good as those from Acrobat X, in some passages Adobe had more wrong characters, in some Google. While neither app is great with punctuation, Google frequently misreads the Chinese full stop ( 。) as either a small o or a zero, which Adobe generally does not. Google does better with quotation marks, though, generally rendering the smart quotes from the source file correctly, while Adobe changes them to ASCII quotes.

Stay tuned for fuller coverage of the OCR showdown.

  8 Responses to “OCR Chinese with Google Docs”

  1. [...] can OCR image files in a number of languages, including traditional and simplified Chinese (see the post on using Google Docs to OCR Chinese); no sign up is required. A pdf has to be converted to an image file before it can be uploaded. [...]

  2. [...] for performing OCR on Chinese texts, but the options all required a desktop or laptop computer (Google Docs, Adobe Acrobat, Sciweavers i2OCR). In this post, we’ll look at several options for OCRing [...]

  3. Has anyone tried this recently? I have been OCR’ing simplified Chinese from digital photos for over a year, but since Saturday, none of the characters convert. This is true even for images that worked well in the past.

    • I also have this problem with Google docs. I cannot get it to recognise OCR fom Chinese .pdf texts although it always worked well before. This seemed to co-incide with the release of google Drive. The OCR help in Google docs does now say that it only recognises Latin character fonts.

      • I just tried this again and OCR of Simplified Chinese in a pdf still works for me. When I logged into Google Docs, I got a message saying that Docs would soon become Drive, so perhaps CB is right in thinking that the problem has to do with the introduction of Google Drive and it still works for me because I don’t have Drive yet.

  4. Hi,
    I have tried to accomplish OCR for Chinese in Google but it has not worked for me yet. It just creates an image of my pdf and pastes it on the Google document. Is there something I’m doing wrong?

    thanks,

    Adam

    • Hi, Adam,
      I’m not sure if this is what you’re describing, but what Google Docs gives you when you upload a file for OCR is a doc with the image first and then the text on subsequent pages. I don’t mean to ask a dumb question, but did you scroll past the image?

      If there’s no text added to the end of the document, all I can suggest is to check that you have the settings properly enabled and for the right language.

      • Same thing happened to me. I did set it to convert to text. After the pasted image on document file was a blank page where no Chinese text has been converted. I tested the same with an English document and it worked, but not Chinese. Any tips?

 Leave a Reply

(required)

(required)

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>