It is possible to perform both in each category, the text or the images. If, for example, we used kgrams on a text consisting of long strings of one letter followed by long strings of another, then in the middle of one string we should predict another letter the same, and the chunks (words) would all be strings of one letter. Dually, if we took sequences of forms and learnt that one form tended to predict another in a particular position reasonably close to it and in a similar orientation, we should be able to use extrinsic chunking of forms. We see also that storing the possibilities and their counts allows us to extract straight line segments and also `corners' as significant `features', by what is a clustering UpWrite instead of the chunking UpWrite we have used exclusively thus far in our discussion of images. Here, we do not use the entropy of the conditional probability distribution, we use the information supplied to the predictor by the data.
This is of particular significance when we contemplate alternative training procedures for generating recognition of such things as cubes. It was tacitly supposed that the system was trained first on line segments, then on faces consisting of parallelograms, and finally on cubes, so that it would have some regions of the UpWrite spaces at intervening levels already containing points which could be recognised. It is hardly feasible however to train a human face recogniser on noses, eyes and mouths separately, and we confront some problems when we try to train exclusively on cubes. The question is, what particular set of points representing line segments should be selected so as to find faces? Why not select the `Y' shaped triple of line segments which are a distinctive feature of a cube seen in general position instead? Indeed, why not? It seems likely that such `features' are indeed used by the human eye.
In order to extract a face of a cube, as a distinctive feature (i.e. as a point in a suitable UpWrite space), in a set of images of cubes, we would seem to need to perform a rather painful combinatorial search through the space of line segments, choosing some subset which recur often enough to be worth grouping into an entity or feature. This problem of entity formation has troubled psychologists and also neurophysiologists, by whom it is referred to as binding. It is an important issue, and I shall try to persuade the reader that the method outlined here reduces the matter to a formal problem in geometry and statistics.
Suppose our data consists entirely of binary images of linedrawings of cubes, each in general position, differing in respect of location, size, and also orientation with respect to the plane on which they are projected. Suppose, moreover, that each image is perfect, with only quantisation noise occurring. Then there is nothing to be gained from a syntactic decomposition at all, and it would be sensible to chunk the whole object on each occasion into one set of pixels, and to UpWrite this using higher order moments to obtain an embedding of the transformation group. The UpWrite would be all done in one step, and would hardly be justified in being called an UpWrite at all. The presumption is that new data will also look like a cube in general position, so there's nothing to classify. Similarly, if there are images of pyramids in general position and images of cubes in general position, then we could simply UpWrite each directly to a point in a moment space and obtain two manifolds, one for each object, with little difficulty. Recognition could be accomplished with another level of UpWrite or in other words by modelling each manifold crudely and computing distances.
Now let us suppose that the data set of cubes and pyramid line drawings is modified to contain noise at the level of the line segments, that is to say, some of the lines are too short, too long, slightly slanted, or slightly displaced from the ideal position. Now a direct moment description of the sets leads to very considerable variation at the UpWritten level. If, in addition, some of the lines are missing altogether, something that the human eye will recognise as a cube with a bit missing will not usually be correctly classified. The situation arises naturally in the case of handprinted characters, where variation at the stroke level can reduce recognition rates considerably.
In this case, there is a plain evolutionary advantage in finding intermediate structure, and it can be done by intrinsic chunking, or by extrinsic chunking. The former is simpler, but the latter is more powerful, since it tells a program trained on the data described, that at the lowest level of pixels, there is a structure which can be described locally, and which extends either linearly or so as to produce corners. We collect sets of quadratic forms and discover that they are locally correlated in one or the other of two simple ways. In general, we take a small neighbourhood of a point in the space, and describe it by UpWriting it. Now we take a ball in the UpWrite space and UpWrite the points in that. Now we look for clusters in this space. In the case of the cubes and pyramids, we discover that in the case where we are looking at, say, pairs or triples of ellipses, each describing a small region of the image, we get a cluster of ellipses in straight lines. We may get a smaller cluster of ellipses in right angled bends for cube images, and similarly for 120o bends if there are pyramids. The existence of the clusters tells us that there is structure and that it should be exploited. It also tells us how to exploit it.
We select then a prediction system extracted from the clustering, and use it to predict where the next ellipse should be in each image, and of course use the information supplied by the image to partition the pixels into line segments.
This procedure works at every level. If there
is noise at the level
of faces of a cube or pyramid, we look at the
sets of points in
corresponding to, say, a cube. There will
be nine of them.
Taking any neighbourhood of one big enough to
include at least one
other such point, and taking note of where it
is by computing
moments of the pair, we repeat on a lot of images
of cubes and
discover a cluster. This cluster might denote
the angled bends
between adjacent edges, the `Y' shaped feature
in the middle, or
the parallelism of opposite faces. All that is
necessary is that the
property consitute a cluster when data is accumulated
from a set of
images, in other words that it recur with sufficient
frequency.
If the neighbourhoods chosen are small, then we
may find ourselves
going through intervening stages where we group
together edges into
corners or opposite faces. If they are larger
we shall find faces,
and if even larger, we shall find only the whole
cube.
It is plain that there is a case for some kind of feedback from one level to its predecessor saying what is a sensible size of resolution radius. If the data is too surprising, maybe a smaller choice of radius at a preceding level would be a good idea.
Note that the combinatorial issue of which points to group together is solved by the following strategy:
Such an approach has been used in finding strokes making up the boundaries of moving solid objects such as heads.