This brief discussion of adaptive filtering is intended to place the general ideas of what they do and how they do it in some sort of context relevant to pattern recognition. The details of how to do the sums are easily obtained from the sources given in the bibliography, should that look to be a good idea.
The last subsection should have made it clear
that computing the values of an optimal MA filter
was essentially the
same idea as fitting a straight line through a
set of points. After all, if I give you data
consisting of
and ask for the line y = mx+c that fits the
data best, then you follow Gauss (by about two
hundred years; my how time flies when you're
having fun) and simply take the sum of squares:
(y1 - (mx1 + c))2 + y2 - (mx2 + c))2 + (y3 - (mx3 + c))2
and choose m and c so as to make this a minimum. This sum of squares we regard as the distance from the point
to the plane
. This is a minimum at a point
when the line joining
to
the point
is orthogonal to the plane and
hence to its basis vectors. This is precisely
what
we did with a filter having two coefficients.
Gauss was confronted with the problem of
continuously updating his estimate; he had the
batch mode least squares worked out by 1795,
and
devised an online least squares system quarter
of a century later. He also did rather a lot
of
other things in the same period.
If I give you a stream of (xi,yi) and leave you to find the best fit straight line to the data, you can wait until the stream stops and then start working or you can take a block of some number, do it for the block, and then work out how to pool your results with those for the next block of data. A block of data can of course consist of just one datum. If you have reason to suspect the best fit line may be changing in time, you may want to discount old data somewhat. When the model does not change its parameters in time we say it is stationary, a definition which raises as many questions as it answers, and a recursive least squares best fit algorithm which can discount the remote past can track a non-stationary process.
The various methods of doing adaptive filtering, are, in effect doing just this. The details can be found in some of the books given in the references.
In Kalman filtering the process is taken to a
high level. In the simplest case, the equation
y= mx+c is generalised so that y,x and c are
vectors and m is a matrix, and c is often taken
to be gaussian white noise that has been filtered
by a MA filter to autocorrelate the noise.
It remains true however that except for the complication
of an inner product other than the old
dot product to describe the varying degree of
credibility of past data and the autocorrelations
of the noise with time, what we have is basically
a least squares best fit problem done
sequentially rather than in batch mode. The choice
of a modified least squares can either be justified
by placing restrictions on the model class (which
are never in practice confirmed to be correct),
or not justified at all and regarded as a simple
and workable heuristic (as Gauss did) for getting
an answer of some sort which is justified
post hoc and traded in for a better method when
one turns up.
Statisticians seem to be tempted by the first
approach; they feel the comforts of philosophy
keenly. Engineers, as is well known, incline to
be brutal, insensitive oafs
who can't distinguish clearly between rationales and
rationalisations and don't trust either.
Kalman filtering is of considerable interest partly because instead of the vector c, above, being a noise vector to be got rid of, it may also be interpreted as a control function to be used to force the system from one state to another, so there is a theory of control which is dual to the theory of filters. This halves the amount of work you have to put in to understand a given amount of material, which is a good thing.