Suppose you have a printed magazine and you want to have a digitized version of an article (for converting from traditional to simplified or vice versa, adding interlinear pinyin, annotating/highlighting, carrying around on your iPhone, or whatever). You can scan the page, but this will not allow you to make any sorts of changes or additions to the text. That’s where OCR (optical character recognition) comes in. OCR software  makes the text of the scan selectable for copying and editing.

To recognize characters, OCR software needs to be aware of the language (or at least the character set) that it is “reading.” Among its many features, Adobe Acrobat (not the free Adobe Reader) can perform OCR on Chinese texts in both traditional and simplified Chinese. Simply open a pdf scan in Acrobat and choose Document > OCR Text Recognition > Recognize Text Using OCR. In the dialogue box that appears, click the “Edit” button and choose the language of your scanned text. When it’s done, you will be able to select text, copy it, and paste it into a word processor, text editor, or whatever.

The process is not perfect: stray marks or other writing/printing on the scan will result in extra symbols or characters. Some characters may just not be read correctly. For example, Acrobat seemed to consistently fail to recognize  from a text printed in a Kaiti-like font. Punctuation was a bit of a mess, but find-and-replace can usually take care of that quickly. Proofreading and correcting will take a little time, but it should be nothing like retyping from scratch unless you type Chinese very quickly.

The downside is that Acrobat is quite expensive commercial software, but if you have other reasons to own Adobe Creative Suite, Acrobat Pro is included and if you are a student or have an academic affiliation, you should be able to get an educational discount.

If you have experience with other OCR software that supports Chinese, please leave a comment.

Please note that I am not suggesting you violate copyright. Making digital copies of texts that you own, and for your personal use, should be legal (like ripping a CD that you own in order to load the music onto your iOS device), but I cannot guarantee that.

