Wolfram Computation Meets Knowledge

Centennial of Markov Chains

On January 23, 1913 of the Julian calendar, Andrey A. Markov presented for the Royal Academy of Sciences in St. Petersburg his analysis of Pushkin’s Eugene Onegin. He found that the sequence of consonants and vowels in the text could be well described as a random sequence, where the likely category of a letter depended only on the category of the previous or previous two letters.

At the time, the Russian Empire was using the Julian calendar. The 100th anniversary of the celebrated presentation is actually February 5, 2013, in the now used Gregorian calendar.

WolframAlpha["(Julian calendar January 23, 1913) plus 100 years",  {{"CorrespondingGregorianDate", 1}, "ComputableData"}]

Tuesday, February 5, 2013

To perform his analysis, Markov invented what are now known as “Markov chains,” which can be represented as probabilistic state diagrams where the transitions between states are labeled with the probabilities of their occurrences.

Andrey A. Markov and a Markov Chain

Alice in Wonderland: A Case Study

Here we repeat the analysis that Markov applied to Pushkin’s text on Alice’s Adventures in Wonderland, by Lewis Carroll. To this end, let’s define a function computing frequencies of elements of a list, returning results as rules.

Frequencies[data_] :=   Block[{ta = Tally[data]}, ta[[All, 2]] /= Length[data]; Rule @@@ ta]

First, extract words from the text, making them lowercase:

Extract words from the text and make them lowercase

{i, down, the, rabbit, hole, alice, <<9957>>, a, wonderful, dream, it, had, been}

Split the text into a sequence of letters:

Short[AiWletters = Flatten[Characters[AiWwords]]]

{1, d, o, w, n, t, h, e, r, a, <<39227>>, m, i, t, h, a, d, b, e, e, n}

Then classify them as vowels or consonants:

Classifying vowels or consonants

(vcseq = VowelOrConsonant /@ AiWletters) // Short

{vowel, consonant, vowel, consonant, <<39239>>, consonant, vowel, vowel, consonant}

And compute the frequencies of vowels and consonants in the text:

vcfreq1 = Frequencies[vcseq] // N

Frequencies of vowels and consonants

Therefore, if we treat the text as a random sequence of either a vowel or a consonant, the probability of a vowel turning up is p = 0.3866.

Following Markov, let’s look at the frequencies of pairs of consecutive symbols:

vcfreq2 = Frequencies[Partition[vcseq, 2, 1]] // N

Frequencies of pairs of consecutive symbols

Then we find the probabilities of a vowel or a consonant given by what precedes it:

Find the probabilities of a vowel or a consonant given by what precedes it

Probabilities of a vowel or a consonant given by what precedes it

In his paper, Markov observed that the sequence of vowels and consonants agreed much better with the model where the probability of a vowel depended on the preceding characters than with the model where it did not. Moreover, he found that the model where the probability depended on the two preceding characters agreed yet better.

Empowered with Mathematica, we can continue this investigation. Markov found that pairs of consecutive vowel-consonants carry more information than the sequence of vowel-consonants itself. One measure of the information stored in the data is the Entropy. The greater the entropy, the more information the data contains. Let’s compute it for the sequences of k-tuples of vowel-consonants (known as k-grams) for different values of k.

Table[{k, Entropy[Partition[vcseq, k, 1]] // N}, {k, 1,     25}] // ListPlot

Entropy plot

The plot confirms Markov’s findings. Curiously, it also shows that 25-grams carry little more information that 20-grams.

The probabilistic 2-gram model describing the sequence is now known as a Markov chain process.

The Markov chain process describes the evolution of a probability distribution πn on a state space at step n. Assuming that the state space is finite (in this example it consists of two elements {“vowel”,”consonant”}), the probability distribution can be thought of as a vector, and the conditional transition probabilities can be arranged in a matrix P:

Probability distribution

In the case at hand, the transition matrix is:

MatrixForm[  tm = Outer[List, {"vowel", "consonant"}, {"vowel", "consonant"}] /.     condprobs]

Transition matrix

Assuming the initial state is a vowel (encoded as 1), the 2-gram model is defined in Mathematica as follows:

mproc = DiscreteMarkovProcess[1, tm]

DiscreteMarkovProcess[1, {{0.15666, 0.84334}, {0.531508, 0.468492}}]

With it, we can ask about the distribution of the distance between vowels in the text and compare the result with the data:

Mean[FirstPassageTimeDistribution[mproc, 1]]

2.58669

Mean[Differences[Pick[Range[Length[vcseq]], vcseq, "vowel"]]] // N

2.58667

Thus the Markov model accurately predicts that the average distance between vowels in the text is about 2.586 characters. The distributions of distances predicted by the model also agree well with those actually found in the text:

Using DiscretePlot

Plot of the first passage time

Building a histogram

Histogram

Text Generation

The text of Alice in Wonderland only uses 1,484 unique words:

DeleteDuplicates[AiWwords] // Length

1484

Repeating the same analysis with words, rather than vowel-consonants, we find that 4-grams carry essential information:

Building an entropy table

Entropy table

With the 4-gram model, the frequency of {w1, w2, w3, w4} encodes the probability of w4, given the three most recent words w1 w2 w3.

(gram4data =     Flatten[Frequencies /@ GatherBy[Partition[AiWwords, 4, 1], Most]] //      N) // Short[#, 3] &

Probability of recent words

We encode transitions {w1, w2, w3 } → {w2, w3, w4} in a directed graph, associating each edge with the “Probability” property, giving the conditional transition probability.

Building a directed graph

Directed graph

Assuming the initial probability vector given by consecutive word pair frequencies defines the discrete Markov chain process:

initialProbabilityVector =    SparseArray[    VertexList[gram4gr] /. Frequencies[Partition[AiWwords, 3, 1]]];

transitionMat =   WeightedAdjacencyMatrix[gram4gr, EdgeWeight -> "Probability"]

SparseArray[<9728>, {9036, 9036}]

mc = DiscreteMarkovProcess[initialProbabilityVector // N,    transitionMat]

DiscreteMarkovProcess[SparseArray[<9036>, {9036}], SparseArray[<9729>, {9036, 9036}]]

Now we define a function, assembling a sequence of k-grams that resulted from walking the graph into a text.

GramListToText[list_] :=   StringJoin[Riffle[Join[First[list], Rest[list][[All, -1]]], " "]]

We can now simulate the resulting Markov chain to generate a random 100-word text:

GramListToText[  Extract[VertexList[gram4gr],    Transpose[RandomFunction[mc, {0, 100}]["States"]]]]

tell you said alice she had grown to her full size by this time alice had found her way into a tidy little room with a table in the window and on it a fan and two or three pairs of tiny white kid gloves and the fan and the pair of white kid gloves in one hand and a piece of bread and butter just at this moment alice felt a very curious sensation she was beginning to get very tired of swimming about here o mouse the mouse looked at her rather inquisitively and seemed to quiver all over with fright

Speak[%]

Linguistic Analysis

The graph associated with 4-grams has visibly long linear subgraphs, seen as long threads of vertexes. Words occurring along these threads will always appear in combination in the randomly generated text. It is interesting to examine length of such unchanged sequences.

These long subgraphs can be singled out by removing from the original graph all the vertexes that share more than two edges.

lines = VertexDelete[gram4gr,    VertexList[gram4gr, x_ /; VertexDegree[gram4gr, x] > 2]]

Vertexes that share more than two edges

We use WeaklyConnectedComponents to extract vertexes of these lines. After sorting them in the order in which they appeared in the text, we recreate the longest six such sequences. Each is a passage from the text (minus punctuation) in which every four-word sequence occurs uniquely in the text:

lineData = WeaklyConnectedComponents[lines];

Sorting the vertexes in the order in which they appeared in the text

Longest six sequences

As you can see, the six sequences are actually quite long. In fact, they are exceptionally long. A median length of such fragments is eight words. One could continue this analysis on other pieces of literature and compare the results, but we’ll leave that for another time.

Finite Markov processes in Mathematica can be used to solve a wide range of applied and theoretical problems. There are many examples at the Wolfram Demonstrations Project to help you celebrate 100 years of Markov chains.

Download this post as a Computable Document Format (CDF) file.

Comments

Join the discussion

!Please enter your comment (at least 5 characters).

!Please enter your name.

!Please enter a valid email address.

10 comments

  1. Nice, though it would be harder to replicate Markov’s results with original Pushkin’s text as Mathematca isn’t able to ToLowerCase Cyrillic strings (except with Shifrin’s java workaround).

    Reply
  2. Brilliant!
    I am compelled to try this myself. Thank goodness I have Mathematica to help. I’m curious how far Wolfram Alfa will go on this endeavor.

    Reply
  3. I have written In[2] and In[3] into my notebook and hit shift enter. instead of seeing Out[3] I am getting an error message.
    StringSplit::strse: String or list of strings expected at position 1 in StringSplit[ExamplData[{ } , AliceInWonderland Text],RegularExpression[[\W_]+]]. >>

    Can somebody let me know what I’m doing wrong. Thanks.

    Reply
  4. Also, I am using version 8.

    Reply