next up previous contents
Next: Telling the guys from Up: Measurement and Representation Previous: Measurement and Representation

From objects to points in space

If you point a video camera at the world, you get back an array of pixels each with a particular gray level or colour. You might get a square array of 512 by 512 such pixels, and each pixel value would, on a gray scale, perhaps, be represented by a number between 0 (black) and 255 (white). If the image is in colour, there will be three such numbers for each of the pixels, say the intensity of red, blue and green at the pixel location. The numbers may change from system to system and from country to country, but you can expect to find, in each case, that the image may be described by an array of `real' numbers, or in mathematical terminology, a vector in ${\fam11\tenbbb R}^n$ for some positive integer n. The number n, the length of the vector, can therefore be of the order of a million. To describe the image of the screen on which I am writing this text, which has 1024 by 1280 pixels and a lot of possible colours, I would need 3,932,160 numbers. This is rather more than the ordinary television screen, but about what High Definition Television will require.

An image on my monitor can, therefore, be coded as a vector in ${\fam11\tenbbb R}^{3,932,160} $. A sequence of images such as would occur in a sixty second commercial sequenced at 25 frames a second, is a trajectory in this space. I don't say this is the best way to think of things, in fact it is a truly awful way (for reasons we shall come to), but it's one way.

If you want to see that image recognition usually involves rather more subtle methods of making measurements, then a brief diversion is offered as a guided tour to the state of the art.

More generally, when a scientist or engineer wants to say something about a physical system, he is less inclined to launch into a haiku or sonnet than he is to clap a set of measuring instruments on it, whether it be an electrical circuit, a steam boiler, or the solar system.

This set of instruments will usually produce a collection of numbers. In other words, the physical system gets coded as a vector in ${\fam11\tenbbb R}^n$ for some positive integer n. The nature of the coding is clearly important, but once it has been set up, it doesn't change. By contrast, the measurements often do; we refer to this as the system changing in time. In real life, real numbers do not actually occur: decimal strings come in some limited length, numbers are specified to some precision. Since this precision can change, it is inconvenient to bother about what it is in some particular case, and we talk rather sloppily of vectors of real numbers.

I have known people who have claimed that ${\fam11\tenbbb R}^n$ is quite useful when n is 1, 2 or 3, but that larger values were invented by Mathematicians only for the purpose of terrorising honest engineers and physicists, and can safely be ignored. Follow this advice at your peril.

It is worth pointing out, perhaps, that the representation of the states of a physical system as points in ${\fam11\tenbbb R}^n$ has been one of the great success stories of the world. Natural language has been found to be inadequate for talking about complicated things. Without going into a philosophical discursion about why this particular language works so well, two points may be worth considering. The first is that it separates two aspects of making sense of the world, it separates out the `world' from the properties of the measuring apparatus, making it easier to think about these things separately. The second is that it allows the power of geometric thinking, incorporating metric or more generally topological ideas, something which is much harder inside the discrete languages. The claim that `God is a Geometer', based upon the success of geometry in Physics, may be no more than the assertion that geometrical languages are better at talking about the world than non-geometrical ones. The general failure of Artificial Intellligence paradigms to crack the hard problems of how human beings process information may be in part due to the limitations of the language employed (often LISP!)

In the case of a microphone monitoring sound levels, there are many ways of coding the signal. It can be simply a matter of a voltage changing in time, that is, n = 1. Or we can take a Fourier Transform and obtain a simulated filter bank, or we can put the signal through a set of hardware filters. In these cases n may be, typically, anywhere between 12 and 256.

The system may change in continuous or discrete time, although since we are going to get the vectors into a computer at some point, we may take it that the continuously changing vector `signal' is discretely sampled at some appropriate rate. What appropriate means depends on the system. Sometimes it means once a microsecond, other times it means once a month.

We describe such dynamical systems in two ways; frequently we need to describe the law of time development, which is done by writing down a formula for a vector field, or as it used to be called, a system of ordinary differential equations. Sometimes we have to specify only some particular history of change: this is done formally by specifying a map from ${\fam11\tenbbb R}$ representing time to the space ${\fam11\tenbbb R}^n$ of possible states. We can simply list the vectors corresponding to different times, or we may be able to find a formula for calculating the vector output by the map when some time value is used as input to the map. It is both entertaining and instructive to consider the map:

\begin{displaymath}
f: {\fam11\tenbbb R}\longrightarrow {\fam11\tenbbb R}^2 \end{displaymath}

\begin{displaymath}
% latex2html id marker 570
t \leadsto \left(\begin{array}
{c} cos(t) \\  
sin(t) \end{array} \right) \end{displaymath}

If we imagine that at each time t between 0 and $2\pi $ a little bug is to be found at the location in ${\fam11\tenbbb R}^2$ given by f(t), then it is easy to see that the bug wanders around the unit circle at uniform speed, finishing up back where it started, at the location % latex2html id marker 635
$ \left(\begin{array}
{c} 1 \\  0 \end{array} \right) 
$ after $2\pi $ time units. The terminology which we use to describe a bug moving in the two dimensional space ${\fam11\tenbbb R}^2$ is the same as that used to describe a system changing its state in the n-dimensional space ${\fam11\tenbbb R}^n$. In particular, whether n is 2, 3 or a few million, we shall refer to a vector in ${\fam11\tenbbb R}^n$ as a point in the space, and we shall make extensive use of the standard mathematician's trick of thinking of pictures in low dimensions while writing out the results of his thoughts in a form where the dimension is not even mentioned. This allows us to discuss an infinite number of problems at the same time, a very smart trick indeed. For those unused to it this is breathtaking, and the hubris involved makes beginners nervous, but one gets used to it.


 
Figure 1.1: A bug marching around the unit circle according to the map f.
\begin{figure}
\vspace{8cm}
\special {psfile=patrecfig1.ps}\end{figure}

This way of thinking is particularly useful when time is changing the state of the system we are trying to recognise, as would happen if one were trying to tell the difference between a bird and a butterfly by their motion in a video sequence, or more significantly if one is trying to distinguish between two spoken words. The two problems, telling birds from butterflies and telling a spoken `yes' from a `no', are very similar, but the representation space for the words is much higher than for the birds and butterflies. `Yes' and `no' are trajectories in a space of dimension, in our case, 12 or 16, whereas the bird and butterfly move in a three dimensional space and their motion is projected down to a two dimensional space by a video camera. We shall return to this when we come to discuss Automatic Speech Recognition.

Let us restrict attention for the time being, however, to the static case of a system where we are not much concerned with the time changing behaviour. Suppose we have some images of characters, say the letters



A


and
B


Then each of these, as pixel arrays, is a vector of dimension up to a million. If we wish to be able to say of a new image whether it is an A or a B, then our new image will also be a point in some rather high dimensional space. We have to decide which group it belongs with, the collection of points representing an A or the collection representing a B. There are better ways of representing such images as we shall see, but they will still involve points in vector spaces of dimension higher than 3.

So as to put our thoughts in order, we replace the problem of telling an image of an A from one of a B with a problem where it is much easier to visualise what is going on because the dimension is much lower. We consider the problem of telling men from women.


next up previous contents
Next: Telling the guys from Up: Measurement and Representation Previous: Measurement and Representation
Mike Alder
9/19/1997