Illuminating the Dark Web

In its ongoing effort to index the world, Google adds searchability to scanned documents

Illuminating the Dark Web
furryscaly (CC Licensed)

When a government agency, medical office, or another institution scans a document and uploads it to a Web site, the images are not searchable -- they contain pictures of text, not the text itself. This is the so-called "Dark Web" -- its sinister-sounding name is just a reference to how difficult it is to search.

In late October, Google started addressing this. Using optical character recognition, the search engine will now convert images to text and include the results. The process is not a straightforward one: should "O" be read as the letter or the number? Is the text in English or another language? But the search engine crawls the Web at regular intervals, decrypting the vast storehouse of information.

In April, Google started including HTML form text in search results, and has been including PDF text as well. It's an ongoing effort to make sure all data on the Internet is searchable, not just the most common text.