The first consideration is that of invariance:
if there are known to be groups or parts of groups
of
transformations under which the trajectory classification
is invariant, this information needs to be
taken into account. If, for example, the trajectory
classification is invariant under a monotone
reparametrisation of the path, then we can reduce
each trajectory to a canonical time and rate
by
reparametrising by path length. If there are operations
on the space
under which the classification
remains invariant, then the transformations should
be factored out. If, for example, adding any
value
to one co-ordinate does not change the classification,
then simply project down onto the other
co-ordinates and ignore the one that doesn't make
a difference. It can only complicate the
recognition. Of course, there ought to be some
good grounds for believing that the invariance
is as
claimed, if it is not, you could be making big
trouble for yourself.
The main reason for wanting to factor out invariance of as many sorts as possible is, as has been mentioned above, to get more data into the recognition system. To make this clear, if you have a hundred printed digits in each category from 0 to 9 but they can be anywhere on the page and if you didn't register them in some way, you would be obliged to learn the (shift) invariance, and this would be wasting data. Similarly, if you had them in different sizes but otherwise identical, it would be sensible to normalise them. This way, you increase the effective size of your training set by reducing the size of your parameter space.
In practice we may be obliged to learn the invariance because we don't know what it is in advance.
This may be done more or less efficiently. Starting out with the intention of finding out what the invariants are as a conscious aim is hard to beat as a prelude to an efficient way of finding them; relying on chance or the kindness of the gods is not recommended.
To focus ideas, suppose we have measured some properties of an aeroplane silhouette which describe the shape in some way. For the present we could suppose that it was done by computing moments of the set of pixels shown in Fig. 6.4
Let us suppose we have the plane coded as a point
in
for some reasonably large
. I shall assume that
no invariance has been built into the measurement
process. Now suppose we rotate the figure about
its centroid by some small angle and measure
the
resulting object as a new point in
. If
the change is sufficiently small then we may
expect that
the resulting point will be close to the original
point. Continue in this way with a sequence of
very
small rotations, and the measurement process will
embed the Lie group of rotations,
in
. If the aeroplane had some symmetry
under rotation, for example if it were a square
cross,
then the map of the circle group S1 into
would not be an embedding, the result of rotating
by
and integer multiples thereof would
take the object to the same point in
.
For
objects not having rotational symmetry, the map
embeds the one dimensional manifold of rotations,
otherwise, as has been said already, known as
the Lie group SO(2) and looking uncommonly like
the circle
, into
.
Similarly, if we scaled the object by different
amounts
between, say, 60% and 140%, we would be embedding
a copy of the interval [0.6, 1.4] in
.
And if we generated some shifts it would embed
a neighbourhood of the origin in
in
,
and
if we did the lot we should have a four dimensional
manifold embedded in
by the measurement
process. The embedding would not generally be
flat, so the dimension of the embedded set might
be four,
but it could normally be expected to go outside
a four dimensional affine subspace; after all,
the dimension
of the circle is one, but it would be quite normal
to find an elastic band taking up a two dimensional
piece of space, usually a bit of the carpet.
Now if we have a lot of different aeroplanes rotated,
shifted and scaled, we might want to work out
which aeroplane we were looking at, or we might
want to say something about the shape of the
manifold
of transformations. How could we detect the circle
embedded in the enclosing space
by the
rotations? Remember, all we have is a finite
sample of points. Well, the dimension is one,
which means
that locally we would find that the set of points
would lie along a line. If we were to take the
set of
points obtained by rotating the aeroplane, choose
one of them, take a small ball in the space and
compute the covariance matrix for the set of points
in the ball, then provided there are enough points
in the ball and the embedding is reasonably smooth,
we would find that all the eigenvalues except
for
the largest were very close to zero. And
this would be true for any initial such starting
point.
The number of non-negligible eigenvalues would
tell us something about the dimension of the
transformations in general.
This procedure is generally used for dimension reduction, not usually locally, and is known, inter alia, as Principal Component analysis. It is essential to do it locally if we want an honest estimate of the dimension of a non-flatly embedded manifold. If the dimension is not the same in different parts, it isn't a manifold. Many transformations may be expected to arise from manifolds, (Lie groups, or neighbourhoods thereof) although not all.
If vowel sounds are excised and embedded in a space of simulated filterbank coefficients, it is found that the set of trajectory fragments corresponding to a single vowel is pretty much a gaussian cluster in the space, the dimension of which is less than the enclosing space. The centres of the vowels lie in a lower dimensional region- almost a plane in fact, and the locations of the centres projected onto this plane look just like the diagrams of the vowel space which phoneticians draw. This suggests that the space of vowels is two dimensional, that it is nearly but not quite embedded flatly in the filterbank space, and that the variation between vocal tracts and patterns of enunciation occurs in a low dimensional complement of this space. Confirming this would be of some potential value in Speech Recognition, because one could normalise trajectories by shifting them and projecting them onto the appropriate manifold; the projection would not in general be a simple matter of finding the closest distance however, which is all that has been tried so far.
This method allows us to estimate the dimension
of the transformations under which the classification
is invariant, provided we have enough data. Indeed,
if a new vocal tract comes along which is
a transformed version of one we have heard before,
and if we hear enough identifiable phonemic
elements, we might learn the transformation and
then apply a learned transformation. For example,
by learning part of a scaling transformation on
aeroplane images, it has been found possible
to
extrapolate from a learnt range of 60% to 140%
of the size of an image, down to about 20%,
at which
level resolution becomes an issue. The sudden
transition from aeroplane images with different
amounts
of scaling to squeaky voices with different amounts
of squeak will not alarm the reflective
reader.
We are talking about the same thing abstractly.
I shall discuss this at greater length when we
come to
syntactic pattern recognition of which dynamic
pattern recognition is a special case. In the
case where
the embedding of the transformation group is flat,
we may not only compute its dimension, we can
factor
it out of the measurement space altogether by
taking a linear transformation of the variables
we use to
measure the state of the system, and then neglect
the last r, where r is the dimension of the
transformation group. If the transformation group
is embedded in a non-flat way, which is regrettably
the normal situation, we may seek for a non-linear
transformation of the space which will accomplish
the same thing. This requires that we understand
a little geometry. There are some exercises at
the end
of the chapter which will enable you to see how
to proceed in special cases.
The assumption that the transformations will derive from the action of a Lie group, or some neighbourhood of the identity in a Lie group, is overly optimistic. It is reasonable to suppose that this is largely the case however, and that many transformations can be approximated adequately in this way.
As well as factoring out the transformations under which the trajectory classification is invariant, or at least giving it a shot, there is another desideratum: stability under noise. The measurements of the state of the system at different times will generally be, in this imperfect world, contaminated with noise. This will have an effect on the job of estimating the dimension and structure of any manifold of transformations. It would be as well therefore to smooth the data first. Unfortunately, the smoothing process might easily eliminate the significant parts of the signal or trajectory. For example, in handwriting, cusps and regions of high curvature are usually highly informative, often being separators between different regimes. Having a curve blunted just when it is starting to point somewhere interesting would not be good for recognition. The assumptions behind most smoothing operations, that we have a stationary or at least slowly changing process which is afflicted by gaussian noise, may be grotesquely inappropriate. It may be necessary to divide the trajectory up into segments, each of which is relatively homogeneous, but where there are plainly distinct regimes. If one does this with cursive writing, one gets what might reasonably be called strokes, and one can also find such things in the speech signal. I shall discuss this at some length in the chapter on Syntactic Pattern Recognition.