next up previous contents
Next: Exercises Up: Continuous Dynamic Patterns Previous: Adaptive Filters, Kalman Filters

Fundamentals of dynamic patterns

Let us suppose that, as with Speech, we have decided what n things to measure, and have made a sequence of measurements for each of a number of events, such as will certainly occur if we are attempting to recognise words. We may suppose that these have been labelled appropriately, and next wish to take a new, unlabelled such trajectory in ${\fam11\tenbbb R}^n$ and classify it. What basic issues occur? What considerations need to be investigated, whether the problem is in speech or in any other form of dynamic pattern recognition?

The first consideration is that of invariance: if there are known to be groups or parts of groups of transformations under which the trajectory classification is invariant, this information needs to be taken into account. If, for example, the trajectory classification is invariant under a monotone reparametrisation of the path, then we can reduce each trajectory to a canonical time and rate by reparametrising by path length. If there are operations on the space ${\fam11\tenbbb R}^n$ under which the classification remains invariant, then the transformations should be factored out. If, for example, adding any value to one co-ordinate does not change the classification, then simply project down onto the other co-ordinates and ignore the one that doesn't make a difference. It can only complicate the recognition. Of course, there ought to be some good grounds for believing that the invariance is as claimed, if it is not, you could be making big trouble for yourself.

The main reason for wanting to factor out invariance of as many sorts as possible is, as has been mentioned above, to get more data into the recognition system. To make this clear, if you have a hundred printed digits in each category from 0 to 9 but they can be anywhere on the page and if you didn't register them in some way, you would be obliged to learn the (shift) invariance, and this would be wasting data. Similarly, if you had them in different sizes but otherwise identical, it would be sensible to normalise them. This way, you increase the effective size of your training set by reducing the size of your parameter space.

In practice we may be obliged to learn the invariance because we don't know what it is in advance.

This may be done more or less efficiently. Starting out with the intention of finding out what the invariants are as a conscious aim is hard to beat as a prelude to an efficient way of finding them; relying on chance or the kindness of the gods is not recommended.

To focus ideas, suppose we have measured some properties of an aeroplane silhouette which describe the shape in some way. For the present we could suppose that it was done by computing moments of the set of pixels shown in Fig. 6.4


  
Figure 6.4: An aeroplane silhouette.
\begin{figure}
\vspace{8cm}
\special {psfile=patrecfig6.5.ps}\end{figure}

Let us suppose we have the plane coded as a point in ${\fam11\tenbbb R}^n$ for some reasonably large $n \stackrel{\gt}{\sim} 20$. I shall assume that no invariance has been built into the measurement process. Now suppose we rotate the figure about its centroid by some small angle and measure the resulting object as a new point in ${\fam11\tenbbb R}^n$. If the change is sufficiently small then we may expect that the resulting point will be close to the original point. Continue in this way with a sequence of very small rotations, and the measurement process will embed the Lie group of rotations, $SO(2,{\fam11\tenbbb R}) \cong 
S^1$ in ${\fam11\tenbbb R}^n$. If the aeroplane had some symmetry under rotation, for example if it were a square cross, then the map of the circle group S1 into ${\fam11\tenbbb R}^n$ would not be an embedding, the result of rotating by $\pi \over 2$ and integer multiples thereof would take the object to the same point in ${\fam11\tenbbb R}^n$. For objects not having rotational symmetry, the map embeds the one dimensional manifold of rotations, otherwise, as has been said already, known as the Lie group SO(2) and looking uncommonly like the circle $S^1 = \{ z\in {\fam11\tenbbb C}: \vert\vert z\vert\vert =1\}$, into ${\fam11\tenbbb R}^n$. Similarly, if we scaled the object by different amounts between, say, 60% and 140%, we would be embedding a copy of the interval [0.6, 1.4] in ${\fam11\tenbbb R}^n$.

And if we generated some shifts it would embed a neighbourhood of the origin in ${\fam11\tenbbb R}^2$ in ${\fam11\tenbbb R}^n$, and if we did the lot we should have a four dimensional manifold embedded in ${\fam11\tenbbb R}^n$ by the measurement process. The embedding would not generally be flat, so the dimension of the embedded set might be four, but it could normally be expected to go outside a four dimensional affine subspace; after all, the dimension of the circle is one, but it would be quite normal to find an elastic band taking up a two dimensional piece of space, usually a bit of the carpet.

Now if we have a lot of different aeroplanes rotated, shifted and scaled, we might want to work out which aeroplane we were looking at, or we might want to say something about the shape of the manifold of transformations. How could we detect the circle embedded in the enclosing space ${\fam11\tenbbb R}^n$ by the rotations? Remember, all we have is a finite sample of points. Well, the dimension is one, which means that locally we would find that the set of points would lie along a line. If we were to take the set of points obtained by rotating the aeroplane, choose one of them, take a small ball in the space and compute the covariance matrix for the set of points in the ball, then provided there are enough points in the ball and the embedding is reasonably smooth, we would find that all the eigenvalues except for the largest were very close to zero. And this would be true for any initial such starting point. The number of non-negligible eigenvalues would tell us something about the dimension of the transformations in general.

This procedure is generally used for dimension reduction, not usually locally, and is known, inter alia, as Principal Component analysis. It is essential to do it locally if we want an honest estimate of the dimension of a non-flatly embedded manifold. If the dimension is not the same in different parts, it isn't a manifold. Many transformations may be expected to arise from manifolds, (Lie groups, or neighbourhoods thereof) although not all.

If vowel sounds are excised and embedded in a space of simulated filterbank coefficients, it is found that the set of trajectory fragments corresponding to a single vowel is pretty much a gaussian cluster in the space, the dimension of which is less than the enclosing space. The centres of the vowels lie in a lower dimensional region- almost a plane in fact, and the locations of the centres projected onto this plane look just like the diagrams of the vowel space which phoneticians draw. This suggests that the space of vowels is two dimensional, that it is nearly but not quite embedded flatly in the filterbank space, and that the variation between vocal tracts and patterns of enunciation occurs in a low dimensional complement of this space. Confirming this would be of some potential value in Speech Recognition, because one could normalise trajectories by shifting them and projecting them onto the appropriate manifold; the projection would not in general be a simple matter of finding the closest distance however, which is all that has been tried so far.

This method allows us to estimate the dimension of the transformations under which the classification is invariant, provided we have enough data. Indeed, if a new vocal tract comes along which is a transformed version of one we have heard before, and if we hear enough identifiable phonemic elements, we might learn the transformation and then apply a learned transformation. For example, by learning part of a scaling transformation on aeroplane images, it has been found possible to extrapolate from a learnt range of 60% to 140% of the size of an image, down to about 20%, at which level resolution becomes an issue. The sudden transition from aeroplane images with different amounts of scaling to squeaky voices with different amounts of squeak will not alarm the reflective reader. [*] We are talking about the same thing abstractly. I shall discuss this at greater length when we come to syntactic pattern recognition of which dynamic pattern recognition is a special case. In the case where the embedding of the transformation group is flat, we may not only compute its dimension, we can factor it out of the measurement space altogether by taking a linear transformation of the variables we use to measure the state of the system, and then neglect the last r, where r is the dimension of the transformation group. If the transformation group is embedded in a non-flat way, which is regrettably the normal situation, we may seek for a non-linear transformation of the space which will accomplish the same thing. This requires that we understand a little geometry. There are some exercises at the end of the chapter which will enable you to see how to proceed in special cases.

The assumption that the transformations will derive from the action of a Lie group, or some neighbourhood of the identity in a Lie group, is overly optimistic. It is reasonable to suppose that this is largely the case however, and that many transformations can be approximated adequately in this way.

As well as factoring out the transformations under which the trajectory classification is invariant, or at least giving it a shot, there is another desideratum: stability under noise. The measurements of the state of the system at different times will generally be, in this imperfect world, contaminated with noise. This will have an effect on the job of estimating the dimension and structure of any manifold of transformations. It would be as well therefore to smooth the data first. Unfortunately, the smoothing process might easily eliminate the significant parts of the signal or trajectory. For example, in handwriting, cusps and regions of high curvature are usually highly informative, often being separators between different regimes. Having a curve blunted just when it is starting to point somewhere interesting would not be good for recognition. The assumptions behind most smoothing operations, that we have a stationary or at least slowly changing process which is afflicted by gaussian noise, may be grotesquely inappropriate. It may be necessary to divide the trajectory up into segments, each of which is relatively homogeneous, but where there are plainly distinct regimes. If one does this with cursive writing, one gets what might reasonably be called strokes, and one can also find such things in the speech signal. I shall discuss this at some length in the chapter on Syntactic Pattern Recognition.


next up previous contents
Next: Exercises Up: Continuous Dynamic Patterns Previous: Adaptive Filters, Kalman Filters
Mike Alder
9/19/1997