In the second chapter I suggested ways of turning
an image or a part of an image, into a vector
of numbers. In the last chapter, I showed,
inter alia, how to model a collection of points
in
by gaussians and mixtures of gaussians.
If you have two or more categories of point
(paint them different colours) in
, and
if you fit a gaussian or mixture of gaussians
to each category, you can use the decision process (also
described in the last chapter) to decide, for
any new point, to which category it probably belongs.
It should be clear that in modelling the set of points belonging to one category by gaussians (or indeed any other family of distributions) we are making some assumptions about the nature of the process responsible for producing the data. The assumptions implicit in the gaussian mixture pdf are very modest and amount to supposing only that a large number of small and independent factors are producing unpredictable fluctuations about each of a small number of `ideal' or `template' stereotypes described by the measuring process. This is, frequently, not unreasonable: if we are reading printed text, we can suppose that there are several ideal shapes for a letter /A/, depending on whether it is italic, in Roman or sans-serif font, and that in addition there are small wobbles at the pixel level caused by quantisation and perhaps the noise of scanning. There should be as many gaussians for each letter as there are distinct stereotypes, and each gaussian should describe the perturbations from this ideal. So the approach has some attractions. Moreover, it may be shown that any pdf may be approximated arbitrarily closely by a mixture of gaussians, so even if the production process is more complex than the simple model suggested for characters, it is still possible to feel that the model is defensible.
If we take two clusters of points, each described
by a gaussian, there is, for any choice of
costs, a decision hypersurface separating the
two regions of
containing the data,
as in the figure. This
hypersurface is defined by a linear combination
of quadratics and hence is itself the zero of
some quadratic function. In the particular case when
both clusters have the same covariance matrix,
this reduces to a hyperplane. If the covariance
matrices are not very different, then a
hyperplane between the two regions will still
be a fair approximation in the region between
the
two clusters, which is usually the region we care
about. And you can do the sums faster with an
affine hyperplane, so why not use hyperplanes
to implement decision boundaries? Also, we
don't usually have any good grounds for believing
the clusters are gaussian anyway, and
unless there's a whole lot of data, our estimates
of the covariance matrices and centres are
probably shaky, so the resulting decision boundary
is fairly tentative, and approximating it
with a hyperplane is quick and easy. And for more
complicated regions, why not use piecewise
affine decision boundaries? Add to this the proposition
that neurons in the brain implement
affine subspaces as decision boundaries, and the
romance of neural nets is born.
In this chapter, I shall first outline the history and some of the romance of neural nets. I shall explain the Perceptron convergence algorithm, which tells you how to train a single neuron, and explain how it was modified to deal with networks of more than one neuron, back in the dawn of neural net research. Then I shall discuss the hiatus in Neural Net research caused by, or at least attributed to, Marvin Minsky, and the rebirth of Neural Nets. I shall explain layered nets and the Back-Propagation algorithm. I shall discuss the menagerie of other Articial Neural Nets (ANNs for short) and indicate where they fit into Pattern Recognition methods, and finally I shall make some comparisons with statistical methods of doing pattern Recognition.
There was an intereresting exchange on the net in 1990 which concerned the relative merits of statistical and neural net (NN) models. Part of it went as follows:
` ... I think NNs are more accessible because
the mathematics is so straightforward,
and the methods work pretty well even if you don't
know what you're doing (as opposed to many
statistical techniques that require some expertise
to use correctly)' ......
However, it seems just about no one has really
attempted a one-to-one sort of comparison using
traditional pattern recognition benchmarks. Just
about everything I hear and read is anecdotal.
Would it be fair to say that ``neural nets'' are
more accessible, simply because there is such
a
plethora of `sexy' user-friendly packages for
sale? Or is back-prop (for example) truly a
more
flexible and widely-applicable algorithm than
other statistical methods with uglier-sounding
names?
If not, it seems to me that most connectionists
should be having a bit of a mid-life crisis
about now.'
From Usenet News, comp.ai.neural-net, August 1990.
This may be a trifle naive, but it has a refreshing honesty and clarity. The use of packages by people who don't know what they're doing is somewhat worrying, if you don't know what you're doing, you probably shouldn't be doing it, but Statistics has had to put up with Social Scientists (so called) doing frightful things with SPSS and other statistical packages for a long time now. And the request for some convincing arguments in favour of Neural Nets is entirely reasonable and to be commended. The lack of straight and definitive answers quite properly concerned the enquirer.
The question of whether neural nets are the answer to the question of how brains work, the best known way of doing Artificial Intelligence or just the current fad to be exploited by the cynical as a new form of intellectual snake oil, merits serious investigation. The writers tend to be partisan and the evidence confusing. We shall investigate the need for a mid-life crisis in this chapter.