Nevertheless, constructing a program which can learn some of the structure is worth a try.
First write a program which takes the sentences as strings at the letter level. Construct a trigrammar, a list of consecutive triples of letters, including whitespace and punctuation. Use it to build a prediction model for the data, and determine how to chunk letters into words. And other things. Cope with the problem of new bigrams by the `fallback' method, ignore the first letter. If the last letter of the bigram has never been seen before, fallback all the way to letter frequency. Compute the entropy of the trigrammar model of the text. Compare with the mean information supplied to the model by a piece of text it hasn't seen before.
Repeat for the word level. It may be convenient
to have punctuation regarded as words. A
vocabulary of not more than about 100 words is
advisable. Again, compute the entropy of
the model of the data. If you obtain estimates
by trying it out on new data, It will be
necessary in both cases to cope with the situation
where a new symbol is encountered: it is suggested
that you reserve a
probability for
anything you haven't seen before, given that
you've
seen n different things so far.
Classify the words into types by eye. Names of children, for example, form a subcategory. UpWrite the original text at the word level into `bags', each bag containing the terms of a category such as childname. Now model the UpWritten stream. Determine the entropy of the model of the UpWritten stream. Determine the entropy of the model for the original stream consisting of the bag stream and the DownWrite.
Article < Adjective < Noun < Verb < Article < Noun < .
Now generate a list of possible words in each category, not more than twenty words in each. Finally generate sentences from any choice of a pdf for each `bag'.
The problem is to try to recover the bag structure from the sample. Try it as follows: Take a trigrammar and for any words ab list the sequents of ab and make them into a probability distribution over the alphabet. Now in the space of all such distributions, look for clusters. If you find at least one cluster, label it C and to assign it the group of pdfs in the cluster. Now perform a cluster UpWrite on the original stream by replacing whatever follows ab by C precisely when C is the cluster containing a distribution obtained from collecting sequents to ab.
Replace the old by the new stream, and do it again. How much data is required before the categories of `noun', `verb', emerge? What is the entropy of the model of the stream which first predicts bags then predicts elements in a bag by assuming the probabilities for choice of bag element are independent and the mean pdf for the cluster? How does this compare with a trigram word model?