Wolfram Blog
Oleksandr Pavlyk

Centennial of Markov Chains

February 4, 2013 — Oleksandr Pavlyk, Kernel Technology

On January 23, 1913 of the Julian calendar, Andrey A. Markov presented for the Royal Academy of Sciences in St. Petersburg his analysis of Pushkin’s Eugene Onegin. He found that the sequence of consonants and vowels in the text could be well described as a random sequence, where the likely category of a letter depended only on the category of the previous or previous two letters.

At the time, the Russian Empire was using the Julian calendar. The 100th anniversary of the celebrated presentation is actually February 5, 2013, in the now used Gregorian calendar.

WolframAlpha["(Julian calendar January 23, 1913) plus 100 years",  {{"CorrespondingGregorianDate", 1}, "ComputableData"}]

Tuesday, February 5, 2013

To perform his analysis, Markov invented what are now known as “Markov chains,” which can be represented as probabilistic state diagrams where the transitions between states are labeled with the probabilities of their occurrences.

Andrey A. Markov and a Markov Chain

Alice in Wonderland: A Case Study

Here we repeat the analysis that Markov applied to Pushkin’s text on Alice’s Adventures in Wonderland, by Lewis Carroll. To this end, let’s define a function computing frequencies of elements of a list, returning results as rules.

Frequencies[data_] :=   Block[{ta = Tally[data]}, ta[[All, 2]] /= Length[data]; Rule @@@ ta]

First, extract words from the text, making them lowercase:

Extract words from the text and make them lowercase

{i, down, the, rabbit, hole, alice, <<9957>>, a, wonderful, dream, it, had, been}

Split the text into a sequence of letters:

Short[AiWletters = Flatten[Characters[AiWwords]]]

{1, d, o, w, n, t, h, e, r, a, <<39227>>, m, i, t, h, a, d, b, e, e, n}

Then classify them as vowels or consonants:

Classifying vowels or consonants

(vcseq = VowelOrConsonant /@ AiWletters) // Short

{vowel, consonant, vowel, consonant, <<39239>>, consonant, vowel, vowel, consonant}

And compute the frequencies of vowels and consonants in the text:

vcfreq1 = Frequencies[vcseq] // N

Frequencies of vowels and consonants

Therefore, if we treat the text as a random sequence of either a vowel or a consonant, the probability of a vowel turning up is p = 0.3866.

Following Markov, let’s look at the frequencies of pairs of consecutive symbols:

vcfreq2 = Frequencies[Partition[vcseq, 2, 1]] // N

Frequencies of pairs of consecutive symbols

Then we find the probabilities of a vowel or a consonant given by what precedes it:

Find the probabilities of a vowel or a consonant given by what precedes it

Probabilities of a vowel or a consonant given by what precedes it

In his paper, Markov observed that the sequence of vowels and consonants agreed much better with the model where the probability of a vowel depended on the preceding characters than with the model where it did not. Moreover, he found that the model where the probability depended on the two preceding characters agreed yet better.

Empowered with Mathematica, we can continue this investigation. Markov found that pairs of consecutive vowel-consonants carry more information than the sequence of vowel-consonants itself. One measure of the information stored in the data is the Entropy. The greater the entropy, the more information the data contains. Let’s compute it for the sequences of k-tuples of vowel-consonants (known as k-grams) for different values of k.

Table[{k, Entropy[Partition[vcseq, k, 1]] // N}, {k, 1,     25}] // ListPlot

Entropy plot

The plot confirms Markov’s findings. Curiously, it also shows that 25-grams carry little more information that 20-grams.

The probabilistic 2-gram model describing the sequence is now known as a Markov chain process.

The Markov chain process describes the evolution of a probability distribution πn on a state space at step n. Assuming that the state space is finite (in this example it consists of two elements {“vowel”,”consonant”}), the probability distribution can be thought of as a vector, and the conditional transition probabilities can be arranged in a matrix P:

Probability distribution

In the case at hand, the transition matrix is:

MatrixForm[  tm = Outer[List, {"vowel", "consonant"}, {"vowel", "consonant"}] /.     condprobs]

Transition matrix

Assuming the initial state is a vowel (encoded as 1), the 2-gram model is defined in Mathematica as follows:

mproc = DiscreteMarkovProcess[1, tm]

DiscreteMarkovProcess[1, {{0.15666, 0.84334}, {0.531508, 0.468492}}]

With it, we can ask about the distribution of the distance between vowels in the text and compare the result with the data:

Mean[FirstPassageTimeDistribution[mproc, 1]]

2.58669

Mean[Differences[Pick[Range[Length[vcseq]], vcseq, "vowel"]]] // N

2.58667

Thus the Markov model accurately predicts that the average distance between vowels in the text is about 2.586 characters. The distributions of distances predicted by the model also agree well with those actually found in the text:

Using DiscretePlot

Plot of the first passage time

Building a histogram

Histogram

Text Generation

The text of Alice in Wonderland only uses 1,484 unique words:

DeleteDuplicates[AiWwords] // Length

1484

Repeating the same analysis with words, rather than vowel-consonants, we find that 4-grams carry essential information:

Building an entropy table

Entropy table

With the 4-gram model, the frequency of {w1, w2, w3, w4} encodes the probability of w4, given the three most recent words w1 w2 w3.

(gram4data =     Flatten[Frequencies /@ GatherBy[Partition[AiWwords, 4, 1], Most]] //      N) // Short[#, 3] &

Probability of recent words

We encode transitions {w1, w2, w3 } → {w2, w3, w4} in a directed graph, associating each edge with the “Probability” property, giving the conditional transition probability.

Building a directed graph

Directed graph

Assuming the initial probability vector given by consecutive word pair frequencies defines the discrete Markov chain process:

initialProbabilityVector =    SparseArray[    VertexList[gram4gr] /. Frequencies[Partition[AiWwords, 3, 1]]];

transitionMat =   WeightedAdjacencyMatrix[gram4gr, EdgeWeight -> "Probability"]

SparseArray[<9728>, {9036, 9036}]

mc = DiscreteMarkovProcess[initialProbabilityVector // N,    transitionMat]

DiscreteMarkovProcess[SparseArray[<9036>, {9036}], SparseArray[<9729>, {9036, 9036}]]

Now we define a function, assembling a sequence of k-grams that resulted from walking the graph into a text.

GramListToText[list_] :=   StringJoin[Riffle[Join[First[list], Rest[list][[All, -1]]], " "]]

We can now simulate the resulting Markov chain to generate a random 100-word text:

GramListToText[  Extract[VertexList[gram4gr],    Transpose[RandomFunction[mc, {0, 100}]["States"]]]]

tell you said alice she had grown to her full size by this time alice had found her way into a tidy little room with a table in the window and on it a fan and two or three pairs of tiny white kid gloves and the fan and the pair of white kid gloves in one hand and a piece of bread and butter just at this moment alice felt a very curious sensation she was beginning to get very tired of swimming about here o mouse the mouse looked at her rather inquisitively and seemed to quiver all over with fright

Speak[%]

To hear this audio, you must have Adobe Flash Player 10.0 or higher installed and JavaScript enabled. You can download the latest version of Adobe Flash Player here.

Linguistic Analysis

The graph associated with 4-grams has visibly long linear subgraphs, seen as long threads of vertexes. Words occurring along these threads will always appear in combination in the randomly generated text. It is interesting to examine length of such unchanged sequences.

These long subgraphs can be singled out by removing from the original graph all the vertexes that share more than two edges.

lines = VertexDelete[gram4gr,    VertexList[gram4gr, x_ /; VertexDegree[gram4gr, x] > 2]]

Vertexes that share more than two edges

We use WeaklyConnectedComponents to extract vertexes of these lines. After sorting them in the order in which they appeared in the text, we recreate the longest six such sequences. Each is a passage from the text (minus punctuation) in which every four-word sequence occurs uniquely in the text:

lineData = WeaklyConnectedComponents[lines];

Sorting the vertexes in the order in which they appeared in the text

Longest six sequences

As you can see, the six sequences are actually quite long. In fact, they are exceptionally long. A median length of such fragments is eight words. One could continue this analysis on other pieces of literature and compare the results, but we’ll leave that for another time.

Finite Markov processes in Mathematica can be used to solve a wide range of applied and theoretical problems. There are many examples at the Wolfram Demonstrations Project to help you celebrate 100 years of Markov chains.

Download this post as a Computable Document Format (CDF) file.

Leave a Comment

10 Comments


Alex Isakov

Nice, though it would be harder to replicate Markov’s results with original Pushkin’s text as Mathematca isn’t able to ToLowerCase Cyrillic strings (except with Shifrin’s java workaround).

Posted by Alex Isakov    February 5, 2013 at 2:29 am
John Barton

Brilliant!
I am compelled to try this myself. Thank goodness I have Mathematica to help. I’m curious how far Wolfram Alfa will go on this endeavor.

Posted by John Barton    February 6, 2013 at 11:33 am
Brian Beckman

Curious why In[5] has an optional pattern for an empty string (my test of Select[AiLetters, #===""&] turned up no instances, and also curious why Length[Out[6]] is different from Length[Out[4]] (i.o.w., why there should be a different number of “vowel” / “consonant” labels than input characters). The differences are not statistically significant, but I haven’t been able to track down the stragglers.

Posted by Brian Beckman    February 10, 2013 at 9:20 am
    Oleksandr Pavlyk

    The presence of “” in the test is not needed, as you correctly pointed out. Length[vcseq] is equal to Length[AiWletters]. The numbers displayed in the elated form of Out[4] and Out[6] indicate the number of terms hidden. In Out[4] 20 characters are showing, with 39,227 hidden, and in Out[6] 8 labels are showing with 29,239 hidden. Total number of elements are equal.

    Posted by Oleksandr Pavlyk    February 10, 2013 at 9:32 am
      Brian Beckman

      Thanks! That clears that up!

      I noticed a little later, in the word analysis, via SortBy[gram4data, #[[2]] &] // Take[#, 30] &, that some punctuation was left in. For instance, from position 19 down, some 4-grams are as follows:
      {“said”, “the”, “king”, “when”} -> 0.0769231,
      {“said”, “the”, “king”, “you”} -> 0.0769231,
      {“i”, “don”, “t”, “care”} -> 0.111111,
      {“i”, “don”, “t”, “keep”} -> 0.111111,
      creating some “false” words like “don”. A tweaked rule like the following gets rid of apostrophes:
      Short[
      AiWwords =
      ToLowerCase[
      StringSplit[
      StringReplace[
      ExampleData[{"Text", "AliceInWonderland"}],
      “‘” -> “”],
      RegularExpression["[\\W_]+”]]]]

      Posted by Brian Beckman    February 10, 2013 at 10:15 am
Brian Beckman

By the way, I just wanted to say that this is a magnificent blog. I had done a similar treatment some time ago, building the text-generating random process by hand (coincidentally using Alice as my source :). It’s much shorter and clearer (and probably faster) using the new built-ins!

Posted by Brian Beckman    February 10, 2013 at 10:43 am
Michael Stern

I will admit, the very first thing I did with Mathematica 9 was test out the new Markov process capabilities with analysis of, then random generation of stylistic matches for, the songs of Bob Dylan and Shakespeare’s sonnets.

Posted by Michael Stern    February 15, 2013 at 12:09 pm
Alex W

I have written In[2] and In[3] into my notebook and hit shift enter. instead of seeing Out[3] I am getting an error message.
StringSplit::strse: String or list of strings expected at position 1 in StringSplit[ExamplData[{ } , AliceInWonderland Text],RegularExpression[[\W_]+]]. >>

Can somebody let me know what I’m doing wrong. Thanks.

Posted by Alex W    April 4, 2013 at 5:23 pm
Alex W

Also, I am using version 8.

Posted by Alex W    April 4, 2013 at 5:25 pm


Leave a comment

Loading...

Or continue as a guest (your comment will be held for moderation):