An Unnecessarily complicated method to read text from an Image. A shallow dip into the pool of Optical Character Recognition.
Let me preface this with the simple method. Open your eyes and read. For most purposes, that's more than enough. But unfortunately, things aren't always that simple. For many purposes, we need to use computers for this, and it's not a task as simple or boring as you think it might be.
Generally, an OCR engine is used such as Tesseract, an Open Source Project by Google. But here we're looking at the method that the engine uses, not the implementation in our code.
When faced with any problem you need to encode into computer language, the first thing to think of would be "How would I, a human, solve it on my own?", and extrapolate the logic.
So, when you see the first letter of this sentence, how do you know it's the letter S?
If you said it's because you remember how the Letter S looks like, then how do you know this sentence began with the letter I, when it's in a different font?
Or that this sentence began with an O?
Ideally, we look at a few criteria before we come to a conclusion.
- Shape
- Pattern-type. Like is it a specific combination of straight lines and curves.
- Relative shape. Like, these dots are above and below this horizontal set of dots.
- Recognition
- We compare it with the set of the characters we know
- Context
- Is it a letter or a number? If the above clues can't tell us, then we look at the nearby characters and see what that can help us infer? Unless its something like usernames, a number doesn't appear between a few letters.
- Identification
- We still aren't 100% sure at this point sometimes, but we'll have a strong guess at this point. So we run our tasks with this at this point.
As you can see, it's not a step by step process, but rather a series of factors that you consider before you come to a decision.
Well, that's what Computer Scientists have done. They implemented our thought process in one form or another with the help of Machine learning.
There are two methods of Text Recognition that we can see.
The first involves comparing an image with a stored Glyph on a pixel by pixel You look at one pixel, and the one on top and bottom of that pixel, and look at how it compares with the stored Glyph. Based on that, you come out with a confidence level.
But this has it's own issues. Suppose you're using an image with higher resolution than the glyph you're comparing with. Suddenly, the number of pixels you're looking at would be too much and the pixel count isn't enough.
What if you try to look at patterns? A line hundred pixels tall connected at the top with 50 pixels wide line has a line length in a specific pattern. If we reduce or increase the resolution, the ratio of the lengths of the line won't change. That should always map to the letter T right? what if I change the Font? Or you're reading handwriting? There's no way that a handwritten text will have something to compare with. And YET OCR engines manage to read the handwritten text to a fairly impressive accuracy. How do they do that then?
That brings us to the second method. You look at the properties of lines, like closed loops, line bends, line intersections, line direction, etc, etc. No matter what font, T has a distinctive feature that it's a horizontal line intersecting with a vertical line about twice it's length. Similar logic can be extended to other characters as well.
As you might guess, this has it's own issues. A curved line. That could be a C or a U, depending on who's writing it and what font they're using. You can't exactly say that it is THIS character upon looking at it with a program. The best you can do is look for a probability of what the character could be.
With that, you generate a list of closely matching characters, that this one COULD be.
Then you look at the nearby characters and look at their probable matches, and try to make possible words with them (Simple permutations and combinations.) Then we look into a dictionary and see if any of those made-up words are real. Based on which words are real and in a dictionary, it updates its prediction. Once it updates predictions, it compares a few other factors, such as context within the paragraph, and then comes up with the final probability.
In the end, what comes out is the most probable out of this list, so it often shows errors in what it reads. This is an issue that can be fixed with lots of training and testing.
The last step, where we look for context clues with other nearby characters could be Greatly improved if we could make a good model of Natural language processing in Computation, but unfortunately, it's another problem Computer Scientists are currently working to solve.
There are many ways that this process can be made more accurate with just what we know so that it doesn't rely on the technology of the future.
Sometimes, maybe because of the scanning conditions or the ink faded away, or some other factor, the text isn't very clear to read. For this purpose, a bit of preprocessing can be done to the image. Such as Increasing the contrast, making it grayscale, Inverting the image, or all of them, whatever is required to make the text pop out in the image. This can help increase the accuracy of the text read in the image. Other methods such as removing all the non-text boxes can reduce the amount of image data that the engine has to look through, making the process faster.
Even to this day, fonts that are friendly to OCR are used where it would be expected that the document would be run through the system. The distinctive electronic script we associate with code was initially developed to be friendly with early OCR from days before good computers. Code such as Fortran is written in Capitalized and monospace fonts with thick and dark strokes were used, to be easily reproducible in the event of needing to run it through OCR. Methods such as Magnetic Ink Character Recognition (MICR) were developed to reduce the inputs the machine would have to process less data to find the characters.
Some Softwares such as Tesseract and Cuneiform use a two-pass process to read the text in an image. In the first pass, they do whatever we discussed before, and in the second, they use the identified characters as the template glyphs to try and better identify the characters they couldn't identify. This way, even low-resolution images can be read without much issue.
Many OCR engines such as Tesseract use Neural Networks to identify the characters, using the above parameters as hidden layers, and training them. Obviously, that'll require you to have a huge pool of test data, and this is where Captcha came in. Tesseract and Captcha, both being Google Run projects, were used together to solve two problems at once.
The purpose of Captcha was to ensure that only Humans would be able to access certain websites, in an era when web bots were on the rise. In development of their OCR, and Scanning text for Google Books and other purposes, Google runs into many many different words that aren't predicted well by their OCR engines. The text was run through two different Engines, and only if both returned the same string, that would be accepted. If the two engines disagreed, the word was marked as suspicious, to let a human read through the text and return their interpretation.
Captcha was a privacy feature, to guard against web bots and lets only humans browse the website. An effective test that Humans could answer but Computers couldn't was required. For that, they took these suspicious words. You might have wondered why many of those security checkboxes showed you what looked like scans from textbooks or handwritten notes. Those were words marked as suspicious while trying to digitize a library into Google Books. Two words were provided. One with the text that the engine managed to recognize, and another which the engine marked as suspicious. This way, you presented a problem that Humans could solve but machines couldn't. And manged prevent web bots from accessing the web page. It was a win-win until the scammers found a way around this, so it was slowly phased out. But even today, in the latest version of ReCaptcha, you can see them trying to crowdsource datasets for training their Image Processing Algorithms.
This mod of crowdsourcing the information is not just limited to google's efforts. Among others, Amazon has a system in place for this called Amazon Mechanical Turk to crowdsource interpretations of these suspicious characters.
Well, that was interesting to look at how text, both handwritten and in images are made into machine-readable text. But what is the purpose of this? Where do we see all this? Why bother with this?
Firstly, to archive old books and publications to be accessible even after all their physical copies are no longer available. Books and comics such as Heroes of Olympus, DC's New 52, Marvel's Secret Empire, and the more recent Star Wars books and comics were written fairly recently and had to be in the form of a word document or something similar before being published. That makes accessing the text in the books fairly easy. You don't recognize a word in the book? Hard press on it in the ebook and you immediately have a dictionary definition of it ready. Older books such as On the Origin of Species, HG Wells' The Invisible Man, The Early Sherlock Holmes novels, Principia Mathematica, or even Amazing Fantasy #15 where spiderman first appeared never went through a digital medium before being published. To preserve them digitally as ebooks, with all the features the modern ebooks get, it's essential to make digital scans isolating the text. The alternative for OCR scanning these books would be hiring people to type all the text with a keyboard, a much more expensive, laborious, and slower task.
Another use for this technology is to help the visually impaired. OCR technology, along with software that can read it aloud is critical in helping the Visually impaired interact with the technology of today, which is so heavily reliant on input and output via a display. A Smartphone with an App that can connect with an OCR API makes other text accessible as well, with the help of a text-to-speech synthesizer.
Even though OCR currently struggles with Latin script, and even more with other scripts, another use case has sprung up to aid tourists and travelers despite its imperfect accuracy. Google Translate has a feature where you can take the image of a piece of text, and it extracts and translates the text into the language of User's choice. This is an extremely useful feature that can help tourists find their way if they're lost, or just have an easier access to information if they don't have a method of communication with the locals.
While these are use cases that have sprung up with the advancements in the technology, there's a lot more that can be done with what's already there, and everywhere you look, there's something or the other using this technology in innovative methods that make our lives easier. Cars going through Toll booths are automatically logged with their vehicle number, reducing the work it'll take to manually maintain a log of all these vehicles passing through a booth every single day. This technology can be used to automatically add a contact into your contacts list by just seeing the business card. Scanning a Passport at an International Checkpoint reduces the time it takes through customs by a lot, as you don't have to wait for someone to write down or type in your passport details to maintain a record.
Optical Character Recognition is a long way from being perfect, but the benefits it can bring are most certainly worth the effort it will take. With the technology bridging the world together, this is one technology that can significantly lead the charge in making a unified world possible, connecting us with cultures we've never seen before, presenting us and others with opportunities that would never be available without the advent of such technology.
Congratulations! You just made a biology student understand optical character recognition
ReplyDelete