If we suppose that the vocal tract has some fixed number of states and that any word consists of some definite number of them and we can decide which, but that the time spent in each state varies, and that sometimes states are skipped, then a diagram such as Fig. 6.1 becomes almost defensible against a weak and incoherent attack. If we take the view that it is a simple model which compresses strings and works, sort of, then the only useful attack upon it is to improve it, which has been done successfully.
It is found in the case of Hidden Markov Models, just as for Gaussian Mixture Modelling, that EM tends to be sensitive to initialisation, and for the former case, good initialisations for different words are passed between nervous looking individuals in seedy looking bars in exchange for used notes of small denomination. At least, nobody publishes them, which one might naively think people would. But this is based on the assumption that people publish papers and books in order to inform readers about useful techniques, and not to impress with one's supernal cleverness. To be fairer, initialisations are word and sample dependent, so you may just have to try a lot of different random ones and see which work best.