Start off with an arbitrarily chosen set of probabilities of choosing an arc from each node except the last, and another set of arbitarily chosen probabilities for each symbol being emitted on arriving at a node. Suppose you have to hand your sample of strings allegedly produced by this markov model. Now select the first string from the set of samples, and trace it through the diagram. If this cannot be done, some of your probabilities must have been made zero, which was naughty of you, go back and make them all positive. If it could be done in a lot of different ways, make a list of all of them. Since it is a finite length string and a finite diagram, there can only be a finite set of possibilities.
Each way of laying down the string from < to > through the diagram can have a probability computed for it, based on the arbitrarily assigned probabilities for transitions and emissions of symbols. The probabilities of choosing the sequences of arcs are readily computed by multiplying all the arc probabilities together, and the probability of having emitted a symbol at each arc, given that the arc was selected, can also be multiplied in at each symbol. So there is a net probability that the string was generated by any given parse, and we can add these together for all the different possible parses of the string. There are two approaches next, the Viterbi approach is to select the most probable parse and use that. The other approach due to Baum and Welch is to split up the string into fractional strings according to the probability of each parse, and lay down fractional strings over the diagram. If one imagines a diagram such as the above figure, and a string with symbols as beads on a piece of thread, the symbols have to get allocated to arcs or nodes in a sequence which lays out the thread over the diagram. In the case of the Baum-Welch method, we have lots of copies of the same threaded necklace of symbols, each with a weight attached to it, the sum of the weights being one. Each of them gets layed out on the diagram, making for lots of fractional threads everywhere in the Baum-Welch case, and just one in the Viterbi case. We maybe ought to make the threads we lay down on the diagram a nice striking colour, say scarlet, to contrast with the sober black of the network connections.
This is repeated for each string of the sample. This gives a diagram covered with a lot of scarlet threads. We then proceed to reestimate the probabilities using these threads. To reestimate the probability of travelling from A to C, we look at all the bits of string we have laid down, and count the total number leaving A. Of this total number (which for the Baum-Welch algorithm may not be an integer) we count the fraction going to C, and this fraction becomes the new probability assigned to the arc from A to C. Similarly for all the other arcs. Given that we go from A to C, we now count up the (possibly fractional) occurrences of each of the symbols that we have laid down along that path. The ratio of each symbol count to the total number of counts gives the probability reestimate for the symbol.
From the initial estimate and the sample of strings, we have a reestimate of the probabilities on the diagram. From the first, possibly ill chosen selection of probabilities, we have obtained a more plausible set which have done a better job of accounting for the actual data. We now do it again with the reestimate as our starting point. We repeat the process until the numbers don't change and claim that this gives a maximum likelihood model for the data given the network topology.
The Baum-Welch algorithm is clearly a particular case of the EM algorithm, which can be used for much more than just mixture modelling. The description given in English may be replaced either with some code or some algebra, but if you write your own code, making up a few little threaded necklaces of strings and working out where to put them will be much the quickest way. People of inferior intellects who have trouble with such arcana as reading and writing in natural languages may prefer algebra or C; such folk may find the above description confusing in which case they may find alternative formulations via the bibliography.
The process of effectively calculating all the probabilities may be organised, and the way of organising it is known as the forward-backward algorithm.