Wolfram Blog
Jofre Espigule-Pons

Analyzing Shakespeare’s Texts on the 400th Anniversary of His Death

April 21, 2016 — Jofre Espigule-Pons, Consultant, Technical Communications and Strategy Group

Putting some color in Shakespeare’s tragedies with the Wolfram Language

After four hundred years, Shakespeare’s works are still highly present in our culture. He mastered the English language as never before, and he deeply understood the emotions of the human mind.

Have you ever explored Shakespeare’s texts from the perspective of a data scientist? Wolfram technologies can provide you with new insights into the semantics and statistical analysis of Shakespeare’s plays and the social networks of their characters.

William Shakespeare (April 26, 1564 (baptized)–April 23, 1616) is considered by many to be the greatest writer of the English language. He wrote 154 sonnets, 38 plays (divided into three main groups: comedy, history, and tragedy), and 4 long narrative poems.

Shakespeare's works

I will start by creating a nice WordCloud from one of his famous tragedies, Romeo and Juliet. You can achieve this with just a couple lines of Wolfram Language code.

First, you need to get the text. One possibility is to import the public-domain HTML versions of the complete works of William Shakespeare from this MIT site:

Importing text for Romeo and Juliet

Then make a word cloud from the text, deleting common stopwords like “and” and “the”:

Romeo and Juliet word cloud

As you can see, DeleteStopwords does not delete all the Elizabethan stopwords like “thou,” “thee,” “thy,” “hath,” etc. But I can delete them manually with StringDelete. And with some minor extra effort, you can improve the word cloud’s style a great deal:

Improving the style of a word cloud

Now let’s analyze a tragedy more deeply. Wolfram|Alpha already offers a lot of computed data about Shakespeare’s plays. For example, if you type “Othello” as Wolfram|Alpha input, you will get the following result:

Information on Othello in Wolfram|Alpha

If you want to visualize the interactions among the characters of this tragedy via a social network, you can achieve this with ease using the Wolfram Language. As I did earlier with the word cloud, I need to first import the texts. In this case I want to work with all the acts and scenes from Othello separately:

Seperating the acts and scenes in Othello

Since I want to import and save the scenes for later use in the same notebook’s folder, I can do the following:

Saving the scenes for later use in the same notebook's folder

In order to create the Graph, I first need all the character names, which will be displayed as vertices. I can gather the names by noticing that each dialog line is preceded by a character name in bold, which in HTML is written like this: <b>Name</b>. Thus it is straightforward to get an ordered list containing all character names (“speakers”) from each dialog line using StringCases:

Using StringCases to get a list of character names from each dialog line

Then, using Union and Flatten, I can obtain the names of all the characters in the tragedy of Othello:

Using Union and Flatten to obtain the character names in Othello

Once I have the vertices, I need to create the edges of the graph. In this case, an edge between two vertices will represent the connection between two characters that are separated by less than two lines within the dialog (similar to the Demonstration by Seth Chandler that analyzes the networks in Shakespeare’s plays). For that purpose, I will use SequenceCases to create all the edges, i.e. pairs of lines separated by less than two lines:

Using SequenceCases to create all the edges

Before creating the graph, I need to delete the edges that are duplicated or are equivalent, like OTHELLO↔IAGO and IAGO↔OTHELLO, and the edges connecting to themselves, i.e. IAGO↔IAGO:

Deleting duplicate edges or equivalents

Finally, you can specify the size of the vertices with the VertexSize option. For example, I want the vertices’ sizes to be proportional to the number of lines per character. I can get the number of lines per character with Counts:

Lines per character using Counts

After this, I can use a logarithmic function to rescale the vertices to a reasonable size. I will also improve the design with VertexStyle and VertexLabels.

Since the code is getting more cumbersome, I will omit it and show only the result (for those interested in the details of the code, you can find them in the attached notebook). Also note that in the final result I’m excluding the vertex “All” since it is not a real character in the dialog:

Interactions among characters in Othello

So far, so good. Having the social network from a Shakespeare play written more than four hundred years ago is quite cool, but I’m still not 100% satisfied. I would like to visualize when these interactions occur within the dialog itself. One way to achieve this is by representing each main speaker with a different-colored bar:

Representing each main character with a different-colored bar

Note: linesColor is a list of colors representing the lines in one scene, and linesLength is the list of the lines’ StringLength with a rescaling function. These functions involve some TextManipulation, like I did earlier to obtain the character names from the HTML version of the play. If you wish, you can see their construction in the attached notebook:

Play progress grid construction
Play progress grid construction

Additionally, I can mark when a particular character says a particular word—for example, the word “love” (note: the variable words is the list of words per line in the scene, created with the new function TextWords; see the attached notebook for details):

Marking when a particular character says a particular word
Marking when a particular character says a particular word
Marking when a particular character says a particular word

Now I can combine all of this with the social network graph and have a colorful and compact infographic about a Shakespeare tragedy:

Othello social network graph

Dialog lines with the word love

There are so many other interesting things that I would like to explore about Shakespeare’s works and life. But I will finish with a map representing the locations at which his plays occur. I hope you got a glance of what is possible to achieve with the Wolfram Language. The only limits are our imagination:

Mapping the locations at which Shakespeare's plays occurred

For a few places, the Interpreter fails to find a GeoPosition, so I used Cases to obtain all the successfully interpreted locations:

Mapping the locations at which Shakespeare's plays occurred

Finally, I’m using Geodisk to depict geopositions by disks with a radius proportional to the number of times each location appears in Shakespeare’s plays:

Map of locations where Shakespeare's plays occur

Many fellow Wolfram users expressed keen interest in and came up with astonishing approaches to Shakespeare’s corpus analysis on Wolfram Community. We hope this blog will inspire you to join that collaborative effort exploring the mysteries of Shakespeare data.

Download this post as a Computable Document Format (CDF) file.

Leave a Comment


Jesse Friedman

This is excellent work! I’ve done some analysis of Romeo and Juliet myself using Jon Bosak’s excellent Shakespeare 2.00 dataset (http://xml.coverpages.org/bosakShakespeare200.html) It has all of the Bard’s plays marked up in XML, with each line separated and the speaker identified. It makes computational analysis much easier. I’d strongly suggest anyone interested in text analysis check it out and play around with it.

Posted by Jesse Friedman    April 21, 2016 at 3:13 pm
Michael Stern

That’s utterly charming.

Posted by Michael Stern    April 26, 2016 at 8:53 am

This is Literally techsavy!

Posted by Paul    May 9, 2016 at 1:52 pm

Leave a comment


Or continue as a guest (your comment will be held for moderation):