next up previous contents
Next: Bibliography Up: Discrete Dynamic Patterns Previous: Graphs and Diagram Grammars

Exercises

1.
Get one of the children's reading primers giving details of the adventures of Janet and John or Dick and Dora, or Peter and Jane or some similar pair. Put the sentences into the computer. Alternatively find such a source on the net if you can, and then tell me what it is. [*] Note that teaching these sentences to a computer program is not at all the same as teaching them to a child because (a) the child has already acquired a spoken language and (b) the child can look at the pictures that come with the text and interpret them. The program can't.

Nevertheless, constructing a program which can learn some of the structure is worth a try.

First write a program which takes the sentences as strings at the letter level. Construct a trigrammar, a list of consecutive triples of letters, including whitespace and punctuation. Use it to build a prediction model for the data, and determine how to chunk letters into words. And other things. Cope with the problem of new bigrams by the `fallback' method, ignore the first letter. If the last letter of the bigram has never been seen before, fallback all the way to letter frequency. Compute the entropy of the trigrammar model of the text. Compare with the mean information supplied to the model by a piece of text it hasn't seen before.

Repeat for the word level. It may be convenient to have punctuation regarded as words. A vocabulary of not more than about 100 words is advisable. Again, compute the entropy of the model of the data. If you obtain estimates by trying it out on new data, It will be necessary in both cases to cope with the situation where a new symbol is encountered: it is suggested that you reserve a $\frac{1}{n}$ probability for anything you haven't seen before, given that you've seen n different things so far.

Classify the words into types by eye. Names of children, for example, form a subcategory. UpWrite the original text at the word level into `bags', each bag containing the terms of a category such as childname. Now model the UpWritten stream. Determine the entropy of the model of the UpWritten stream. Determine the entropy of the model for the original stream consisting of the bag stream and the DownWrite.

2.
Make up a bag grammar by choosing some particular sentence structure such as

Article < Adjective < Noun < Verb < Article < Noun < .

Now generate a list of possible words in each category, not more than twenty words in each. Finally generate sentences from any choice of a pdf for each `bag'.

The problem is to try to recover the bag structure from the sample. Try it as follows: Take a trigrammar and for any words ab list the sequents of ab and make them into a probability distribution over the alphabet. Now in the space of all such distributions, look for clusters. If you find at least one cluster, label it C and to assign it the group of pdfs in the cluster. Now perform a cluster UpWrite on the original stream by replacing whatever follows ab by C precisely when C is the cluster containing a distribution obtained from collecting sequents to ab.

Replace the old by the new stream, and do it again. How much data is required before the categories of `noun', `verb', emerge? What is the entropy of the model of the stream which first predicts bags then predicts elements in a bag by assuming the probabilities for choice of bag element are independent and the mean pdf for the cluster? How does this compare with a trigram word model?

3.
Automate the process of picking bags such as childname or animalname in the case of data taken from the children's primers. This could be described as `real data', almost. Do you get any compression using these methods? Do you get better results than an ngrammar for various n?


next up previous contents
Next: Bibliography Up: Discrete Dynamic Patterns Previous: Graphs and Diagram Grammars
Mike Alder
9/19/1997