If you point a video camera at the world, you
get back an array of pixels
each with a particular gray level or colour.
You might get a square array of
512 by 512 such pixels, and each pixel value would,
on a gray scale, perhaps, be
represented by a number between 0 (black) and
255 (white). If the image is in
colour, there will be three such numbers for each
of the pixels, say the intensity
of red, blue and green at the pixel location.
The numbers may change from system
to system and from country to country, but you
can expect to find, in each case,
that the image may be described by an array of
`real' numbers,
or in mathematical terminology, a vector in
for some positive
integer n. The number n, the length of the
vector, can therefore be of the
order of a million.
To describe the image of the screen on which I
am writing this text, which has 1024 by 1280
pixels and a lot of possible colours, I would
need 3,932,160 numbers. This is rather more
than the ordinary television screen, but about
what High Definition Television will require.
An image on my monitor can, therefore, be coded
as a vector in
.
A sequence of images such
as would occur in a sixty second commercial sequenced
at 25 frames a second, is a
trajectory in this space. I don't say this is
the best way to think of things, in fact it is a truly awful way (for
reasons we shall come to), but it's one way.
If you want to see that image recognition usually involves rather more subtle methods of making measurements, then a brief diversion is offered as a guided tour to the state of the art.
More generally, when a scientist or engineer wants to say something about a physical system, he is less inclined to launch into a haiku or sonnet than he is to clap a set of measuring instruments on it, whether it be an electrical circuit, a steam boiler, or the solar system.
This set of instruments will usually produce a
collection of numbers. In other words, the
physical system gets coded as a vector in
for some positive integer n. The nature
of the coding is clearly important, but once it
has been set up, it doesn't change. By
contrast, the measurements often do; we refer
to this as the system changing in time. In
real life, real numbers do not actually occur:
decimal strings come in some limited
length, numbers are specified to some precision.
Since this precision can change, it
is inconvenient to bother about what it is in
some particular case, and we talk
rather sloppily of vectors of real numbers.
I have known people who have claimed that
is quite useful when n is 1, 2 or 3,
but that larger values were invented by Mathematicians
only for the purpose of terrorising
honest engineers and physicists, and can safely
be ignored. Follow this advice at
your peril.
It is worth pointing out, perhaps, that the representation
of the states of a physical
system as points in
has been one of the
great success stories of the world.
Natural language has been found to be inadequate
for talking about complicated things.
Without going into a philosophical discursion
about why this particular language
works so well, two points may be worth considering.
The first is that it separates
two aspects of making sense of the world, it separates
out the `world' from the
properties of the measuring apparatus, making
it easier to think about these things
separately. The second is that it allows the power
of geometric thinking, incorporating
metric or more generally topological ideas, something
which is much harder inside the
discrete languages. The claim that `God is a Geometer',
based upon the success of
geometry in Physics, may be no more than the assertion
that geometrical languages
are better at talking about the world than non-geometrical
ones. The general failure
of Artificial Intellligence paradigms to crack
the hard problems of how human beings
process information may be in part due to the
limitations of the language employed
(often LISP!)
In the case of a microphone monitoring sound levels, there are many ways of coding the signal. It can be simply a matter of a voltage changing in time, that is, n = 1. Or we can take a Fourier Transform and obtain a simulated filter bank, or we can put the signal through a set of hardware filters. In these cases n may be, typically, anywhere between 12 and 256.
The system may change in continuous or discrete time, although since we are going to get the vectors into a computer at some point, we may take it that the continuously changing vector `signal' is discretely sampled at some appropriate rate. What appropriate means depends on the system. Sometimes it means once a microsecond, other times it means once a month.
We describe such dynamical systems in two ways;
frequently we need to describe the law of
time development, which is done by writing down
a formula for a vector field, or as it used
to be called, a system of ordinary differential
equations. Sometimes we have to
specify only some particular history of change:
this is done formally by specifying a map
from
representing time to the space
of possible states. We can simply list the
vectors corresponding to different times, or we
may be able to find a formula for
calculating the vector output by the map when
some time value is used as input to the
map. It is both entertaining and instructive to
consider the map:
![]()

If we imagine that at each time t between 0
and
a little bug is to be found at the
location in
given by f(t), then it is
easy to see that the bug wanders around the
unit circle at uniform speed, finishing up back
where it started, at the location
after
time units. The
terminology which we use to describe a bug moving
in the two dimensional space
is the
same as that used to describe a system changing
its state in the n-dimensional space
. In particular, whether n is 2, 3 or
a few million, we shall refer to a vector in
as a point in the space, and we
shall make extensive use of the standard
mathematician's trick of thinking of pictures
in low dimensions while writing out
the results of his thoughts in a form where the
dimension is not even mentioned.
This allows us to discuss an infinite number of
problems at the same time, a very
smart trick indeed. For those unused to it this
is breathtaking, and the hubris
involved makes beginners nervous, but one gets
used to it.
This way of thinking is particularly useful when time is changing the state of the system we are trying to recognise, as would happen if one were trying to tell the difference between a bird and a butterfly by their motion in a video sequence, or more significantly if one is trying to distinguish between two spoken words. The two problems, telling birds from butterflies and telling a spoken `yes' from a `no', are very similar, but the representation space for the words is much higher than for the birds and butterflies. `Yes' and `no' are trajectories in a space of dimension, in our case, 12 or 16, whereas the bird and butterfly move in a three dimensional space and their motion is projected down to a two dimensional space by a video camera. We shall return to this when we come to discuss Automatic Speech Recognition.
Let us restrict attention for the time being, however, to the static case of a system where we are not much concerned with the time changing behaviour. Suppose we have some images of characters, say the letters
Then each of these, as pixel arrays, is a vector of dimension up to a million. If we wish to be able to say of a new image whether it is an A or a B, then our new image will also be a point in some rather high dimensional space. We have to decide which group it belongs with, the collection of points representing an A or the collection representing a B. There are better ways of representing such images as we shall see, but they will still involve points in vector spaces of dimension higher than 3.
So as to put our thoughts in order, we replace the problem of telling an image of an A from one of a B with a problem where it is much easier to visualise what is going on because the dimension is much lower. We consider the problem of telling men from women.