Analyzing Shakespeare’s Texts on the 400th Anniversary of His Death
April 21, 2016 — Jofre Espigule-Pons, Machine Learning
Putting some color in Shakespeare’s tragedies with the Wolfram Language
After four hundred years, Shakespeare’s works are still highly present in our culture. He mastered the English language as never before, and he deeply understood the emotions of the human mind.
Have you ever explored Shakespeare’s texts from the perspective of a data scientist? Wolfram technologies can provide you with new insights into the semantics and statistical analysis of Shakespeare’s plays and the social networks of their characters.
William Shakespeare (April 26, 1564 (baptized)–April 23, 1616) is considered by many to be the greatest writer of the English language. He wrote 154 sonnets, 38 plays (divided into three main groups: comedy, history, and tragedy), and 4 long narrative poems.
First, you need to get the text. One possibility is to import the public-domain HTML versions of the complete works of William Shakespeare from this MIT site:
Then make a word cloud from the text, deleting common stopwords like “and” and “the”:
As you can see, DeleteStopwords does not delete all the Elizabethan stopwords like “thou,” “thee,” “thy,” “hath,” etc. But I can delete them manually with StringDelete. And with some minor extra effort, you can improve the word cloud’s style a great deal:
Now let’s analyze a tragedy more deeply. Wolfram|Alpha already offers a lot of computed data about Shakespeare’s plays. For example, if you type “Othello” as Wolfram|Alpha input, you will get the following result:
If you want to visualize the interactions among the characters of this tragedy via a social network, you can achieve this with ease using the Wolfram Language. As I did earlier with the word cloud, I need to first import the texts. In this case I want to work with all the acts and scenes from Othello separately:
Since I want to import and save the scenes for later use in the same notebook’s folder, I can do the following:
In order to create the Graph, I first need all the character names, which will be displayed as vertices. I can gather the names by noticing that each dialog line is preceded by a character name in bold, which in HTML is written like this: <b>Name</b>. Thus it is straightforward to get an ordered list containing all character names (“speakers”) from each dialog line using StringCases:
Once I have the vertices, I need to create the edges of the graph. In this case, an edge between two vertices will represent the connection between two characters that are separated by less than two lines within the dialog (similar to the Demonstration by Seth Chandler that analyzes the networks in Shakespeare’s plays). For that purpose, I will use SequenceCases to create all the edges, i.e. pairs of lines separated by less than two lines:
Before creating the graph, I need to delete the edges that are duplicated or are equivalent, like OTHELLO↔IAGO and IAGO↔OTHELLO, and the edges connecting to themselves, i.e. IAGO↔IAGO:
Finally, you can specify the size of the vertices with the VertexSize option. For example, I want the vertices’ sizes to be proportional to the number of lines per character. I can get the number of lines per character with Counts:
Since the code is getting more cumbersome, I will omit it and show only the result (for those interested in the details of the code, you can find them in the attached notebook). Also note that in the final result I’m excluding the vertex “All” since it is not a real character in the dialog:
So far, so good. Having the social network from a Shakespeare play written more than four hundred years ago is quite cool, but I’m still not 100% satisfied. I would like to visualize when these interactions occur within the dialog itself. One way to achieve this is by representing each main speaker with a different-colored bar:
Note: linesColor is a list of colors representing the lines in one scene, and linesLength is the list of the lines’ StringLength with a rescaling function. These functions involve some TextManipulation, like I did earlier to obtain the character names from the HTML version of the play. If you wish, you can see their construction in the attached notebook:
Additionally, I can mark when a particular character says a particular word—for example, the word “love” (note: the variable words is the list of words per line in the scene, created with the new function TextWords; see the attached notebook for details):
Now I can combine all of this with the social network graph and have a colorful and compact infographic about a Shakespeare tragedy:
There are so many other interesting things that I would like to explore about Shakespeare’s works and life. But I will finish with a map representing the locations at which his plays occur. I hope you got a glance of what is possible to achieve with the Wolfram Language. The only limits are our imagination:
Finally, I’m using Geodisk to depict geopositions by disks with a radius proportional to the number of times each location appears in Shakespeare’s plays:
Many fellow Wolfram users expressed keen interest in and came up with astonishing approaches to Shakespeare’s corpus analysis on Wolfram Community. We hope this blog will inspire you to join that collaborative effort exploring the mysteries of Shakespeare data.