Analyzing Shakespeare’s Texts on the 400th Anniversary of His Death
Putting some color in Shakespeare’s tragedies with the Wolfram Language
After four hundred years, Shakespeare’s works are still highly present in our culture. He mastered the English language as never before, and he deeply understood the emotions of the human mind.
Have you ever explored Shakespeare’s texts from the perspective of a data scientist? Wolfram technologies can provide you with new insights into the semantics and statistical analysis of Shakespeare’s plays and the social networks of their characters.
William Shakespeare (April 26, 1564 (baptized)–April 23, 1616) is considered by many to be the greatest writer of the English language. He wrote 154 sonnets, 38 plays (divided into three main groups: comedy, history, and tragedy), and 4 long narrative poems.
I will start by creating a nice WordCloud from one of his famous tragedies, Romeo and Juliet. You can achieve this with just a couple lines of Wolfram Language code.
First, you need to get the text. One possibility is to import the public-domain HTML versions of the complete works of William Shakespeare from this MIT site:
Then make a word cloud from the text, deleting common stopwords like “and” and “the”:
As you can see, DeleteStopwords does not delete all the Elizabethan stopwords like “thou,” “thee,” “thy,” “hath,” etc. But I can delete them manually with StringDelete. And with some minor extra effort, you can improve the word cloud’s style a great deal:
Now let’s analyze a tragedy more deeply. Wolfram|Alpha already offers a lot of computed data about Shakespeare’s plays. For example, if you type “Othello” as Wolfram|Alpha input, you will get the following result:
If you want to visualize the interactions among the characters of this tragedy via a social network, you can achieve this with ease using the Wolfram Language. As I did earlier with the word cloud, I need to first import the texts. In this case I want to work with all the acts and scenes from Othello separately:
Since I want to import and save the scenes for later use in the same notebook’s folder, I can do the following:
In order to create the Graph, I first need all the character names, which will be displayed as vertices. I can gather the names by noticing that each dialog line is preceded by a character name in bold, which in HTML is written like this: <b>Name</b>. Thus it is straightforward to get an ordered list containing all character names (“speakers”) from each dialog line using StringCases:
Then, using Union and Flatten, I can obtain the names of all the characters in the tragedy of Othello:
Once I have the vertices, I need to create the edges of the graph. In this case, an edge between two vertices will represent the connection between two characters that are separated by less than two lines within the dialog (similar to the Demonstration by Seth Chandler that analyzes the networks in Shakespeare’s plays). For that purpose, I will use SequenceCases to create all the edges, i.e. pairs of lines separated by less than two lines:
Before creating the graph, I need to delete the edges that are duplicated or are equivalent, like OTHELLO↔IAGO and IAGO↔OTHELLO, and the edges connecting to themselves, i.e. IAGO↔IAGO:
Finally, you can specify the size of the vertices with the VertexSize option. For example, I want the vertices’ sizes to be proportional to the number of lines per character. I can get the number of lines per character with Counts:
After this, I can use a logarithmic function to rescale the vertices to a reasonable size. I will also improve the design with VertexStyle and VertexLabels.
Since the code is getting more cumbersome, I will omit it and show only the result (for those interested in the details of the code, you can find them in the attached notebook). Also note that in the final result I’m excluding the vertex “All” since it is not a real character in the dialog:
So far, so good. Having the social network from a Shakespeare play written more than four hundred years ago is quite cool, but I’m still not 100% satisfied. I would like to visualize when these interactions occur within the dialog itself. One way to achieve this is by representing each main speaker with a different-colored bar:
Note: linesColor is a list of colors representing the lines in one scene, and linesLength is the list of the lines’ StringLength with a rescaling function. These functions involve some TextManipulation, like I did earlier to obtain the character names from the HTML version of the play. If you wish, you can see their construction in the attached notebook:
Additionally, I can mark when a particular character says a particular word—for example, the word “love” (note: the variable words is the list of words per line in the scene, created with the new function TextWords; see the attached notebook for details):
Now I can combine all of this with the social network graph and have a colorful and compact infographic about a Shakespeare tragedy:
There are so many other interesting things that I would like to explore about Shakespeare’s works and life. But I will finish with a map representing the locations at which his plays occur. I hope you got a glance of what is possible to achieve with the Wolfram Language. The only limits are our imagination:
For a few places, the Interpreter fails to find a GeoPosition, so I used Cases to obtain all the successfully interpreted locations:
Finally, I’m using Geodisk to depict geopositions by disks with a radius proportional to the number of times each location appears in Shakespeare’s plays:
Many fellow Wolfram users expressed keen interest in and came up with astonishing approaches to Shakespeare’s corpus analysis on Wolfram Community. We hope this blog will inspire you to join that collaborative effort exploring the mysteries of Shakespeare data.
Download this post as a Computable Document Format (CDF) file.
This is excellent work! I’ve done some analysis of Romeo and Juliet myself using Jon Bosak’s excellent Shakespeare 2.00 dataset (http://xml.coverpages.org/bosakShakespeare200.html) It has all of the Bard’s plays marked up in XML, with each line separated and the speaker identified. It makes computational analysis much easier. I’d strongly suggest anyone interested in text analysis check it out and play around with it.
That’s utterly charming.
This is Literally techsavy!
Just been referred to this by the Wolfram U X-plorations webinar and after spending ages I still cannot see the code that Ruben requested and you said is in the CDF. Sorry to be a pain but can you explain exactly where or how this code can be seen? Many thanks.
Hello Linda,
It’s about 3/4ths of the way through the CDF. Evaluating line by line may help instead of evaluating the whole notebook. (You can download the file at the end of the blog post.)
– Wolfram Blog Team
Thanks team but this does not help at all. 3/4ths of the way through the CDF, it reads:
Since the code is getting more cumbersome, I will omit it and show only the result (for those interested in the details of the code, you can find them in the attached notebook).
Just as in the blog post above, there is no code visible in the CDF to evaluate, line by line, and why I asked ‘can you explain exactly where or how this code can be seen?’ What am I missing? If the ‘code’ is actually in the CDF why can neither I nor Ruben actually see it? Sorry, but I am still completely mystified and hope you can help me actually see the code you say is there. Many thanks.
Linda,
You’re correct, the snippet of code was removed as it’s quite large. I’ve posted it below to provide clarity.
vertexSizes =
Normal[Log[
1.4 + Counts[Flatten[lines]]/
Max[Counts[Flatten[lines]]]]] /. {("All" -> _) ->
Nothing, ("Herald" -> _) -> Nothing};
Graph[(edgesReduced /. "All" \[UndirectedEdge] _ -> Nothing),
VertexSize -> vertexSizes, VertexLabels -> {
"BIANCA" ->
Placed[Style["Bianca", Bold, FontSize -> 16], {2.3, -0.4}],
"BRABANTIO" ->
Placed[Style["Brabantio", Bold, FontSize -> 22], {2.1, -0.8}],
"CASSIO" ->
Placed[Style["Cassio", Bold, FontSize -> 22], {1, -0.8}],
"Clown" ->
Placed[Style["Clown", Bold, FontSize -> 16], {2.3, -0.4}],
"DESDEMONA" ->
Placed[Style["Desdemona", Bold, FontSize -> 20], {0.5, -0.5}],
"EMILIA" ->
Placed[Style["Emilia", Bold, FontSize -> 22], {2.2, -0.4}],
"GRATIANO" ->
Placed[Style["Gratiano", Bold, FontSize -> 16], {2.3, -0.6}],
"IAGO" -> Placed[Style["Iago", Bold, FontSize -> 22], {1.7, 0}],
"LODOVICO" ->
Placed[Style["Lodovico", Bold, FontSize -> 22], {1.2, -0.9}],
"MONTANO" ->
Placed[Style["Montano", Bold, FontSize -> 16], {1.6, -0.6}],
"OTHELLO" ->
Placed[Style["Othello", Bold, FontSize -> 28], {0.6, 1.5}],
"RODERIGO" ->
Placed[Style["Roderigo", Bold, FontSize -> 22], {.2, -1}],
"DUKE OF VENICE" ->
Placed[Style["Duke of Venice", Bold, FontSize -> 16], Above],
"Fourth Gentleman" ->
Placed[Style["Fourth Gentleman", Bold, FontSize -> 14], {-1., 1.7}]
}, VertexLabelStyle -> Directive[Bold, FontSize -> 14],
VertexStyle -> {"OTHELLO" -> RGBColor[1, 0.84, 0, 0.75],
"BRABANTIO" -> RGBColor[0.79, 0.38, 0, 0.63],
"DESDEMONA" -> RGBColor[0.73, 0.09, 0.89, 0.65],
"LODOVICO" -> RGBColor[0.28026441037696703`, 0.715, 0.62, 0.88],
"IAGO" -> RGBColor[0.363898, 0.71, 0.91, 0.85],
"RODERIGO" -> RGBColor[0.571589, 0.79, 0., 0.71],
"CASSIO" -> RGBColor[0.14, 0.15, 0.81, 0.65],
"EMILIA" -> RGBColor[1., 0.29, 0.76, 0.68]}, EdgeStyle -> Gray,
GraphStyle -> "BasicGray", ImageSize -> 1200]
– Wolfram Blog Team
Lovely, thanks, it works
Hi Ruben. Sorry to hear that you’re having difficulties. The code actually is in the CDF file available for download at the end of the blog. If there is something specific that you are looking for in addition to that code, please let me know. I don’t have any additional codes though.