Let us start then by supposing the simplest sort of image, a binary array of pixels, say 256 by 256. Each pixel is either black (0) or white (1). Hardly any modern equipment actually produces such images unless you count photocopiers, and I have doubts about them. Video cameras and scanners usually produce larger arrays in many colours or grey levels. Monochrome (grey scale) cameras are getting hard to find, and scanners will be following this progression soon. So the images I shall start with are rather simple. The image in Fig.1.14 , reproduced here for your convenience, is an example of the kind of thing we might hope to meet.
Getting an image to this simple, binary state may take a lot
of work.
In some forms of image recognition, particularly
in Optical Character Recognition (OCR)
the first operation performed upon the image is
called thresholding and what it does is
to take a monochrome image and rewrite all the light
grey (above threshold) pixels to white,
and all the dark grey (below threshold) pixels
to black. How to make a judicious choice of
threshold is dealt with in image processing courses,
and will not be considered here.
Although a little unphysical, because real life
comes with grey levels
, the binary image is not without
practical importance.
The result of digitising and thresholding a piece of handwriting is not unlike Fig.2.1., where the word /the/ was hand drawn by mouse using the unix bitmap utility. It is reasonable to want to be able to read the characters, /t/, /h/, /e/ by machine. For handwriting of the quality shown, this is not too difficult. It is, however, far more difficult than reading printed script, and so we shall treat the printed case first.
In order to make the problem moderately realistic, we look at the output of a scanner, or at least a bit of such output, when applied to two pages of text taken from a book. (Fig.2.2.). The quality of the image is not, we regret to say, up to the quality of the book. Such images are, however, not uncommon, and it is worth noting a number of features.
The problem of reading such text is therefore far from trivial, even if we stipulate somewhat higher quality reproduction than is present in Fig.2.2. In doing some serious reading of a document such as a newspaper, large and uncomfortable problems of working out which bits of text belong together, which bits of an image are text and which bits are the pin-up girl, and similar high level segmentation issues arise. Any illusion that the problem of reading text automatically is easy should be abandoned now.