The Distance between “Zero” and “Hero”: Exploring Synonym Chains with Mathematica
There is an old word game where you try to get from one word to another through connections with other words. For example, you might get from “cold” to “stationary” via the word “frozen”, since “cold” and “frozen” are synonyms and “frozen” and “stationary” are synonyms, albeit for different meanings of the word “frozen”.
I thought of this game when I started to learn the new graph theory functions in Mathematica 8. We can think of the words in the English language as the vertices of one large graph and the synonym connections between them as the graph edges. If you do that, it looks like this:
So let’s see if we can generally solve this synonym chain problem.
First we need to create this graph representation of English and its synonym connectivity. I can get the raw data from the built-in WordData function that accesses Wolfram|Alpha‘s dictionary capabilities. The first problem is that the synonyms provided include idioms and phrases. While it might be useful to know that “zombie” means the same thing as “living dead”, I don’t think it’s in the spirit of the game to use phrases. So first I will create a function to eliminate these, so that we are left with only single words.

Next we need a function to pull together all the synonyms for all the alternative meanings of a word.

Now we just apply that function to every single word in the WordData source and de-duplicate the results. The  character indicates an undirected graph edge.
 character indicates an undirected graph edge.

Now that we have a graph representation of the synonyms, we can use the efficient FindShortestPath function to get between any two connected words in the fewest number of hops. The rest of this function is just presentation.

There is a lot of fun to be had out of this function, though sometimes the jumps that it makes can be hard to understand until you look up the more obscure meanings of some words in the dictionary. For me, the most fun is trying to connect seemingly opposite words. Some chains are very short, others quite long. Here are some examples:

Once you get bored of this game, the obvious question becomes, “What is the longest such shortest chain?” Or to put it another way, “Which two words are furthest apart?” If you click the GraphPlot visualization of the data at the top of the post to see the large version, you will see that most of the words are in small clusters, many of only two or three words, but there is one large, highly interconnected group of words. It seems quite likely that that is the place to start looking. So I’ll start by getting the subgraph represented by the largest group of connected components.

This subgraph has just under 20,000 vertices.


The next largest group contains only 37 vertices, so we can probably just look at this cluster.
Now, the brute-force method would be to call GraphDistanceMatrix on this and find all the shortest distances. If I were on a 64-bit platform, I might give that a go, but I would hit the memory ceiling on my 32-bit laptop, so I will have to use a more directed attack.
First, I will find some candidate words by measuring the eccentricity of each word. Eccentricity is the longest shortest distance between a vertex and all other vertices. It doesn’t return which partner vertex is that distance, but this way I don’t have to store all that data for the 20,000 words that are not candidates, or for all the words that are closer than the maximum. (Unfortunately, due to a bug in the initial release, you will need Mathematica 8.0.1 or higher to run this next line.)

By sorting the results, we can see here the 50 most eccentric words.


So now we can see that the longest shortest paths are 24 steps long. Here are the candidate words from this subgraph:


Now since this is an undirected graph, the word that partners with each of these words for the longest shortest path must also be from this set. So we just have to check the GraphDistance for each combination to find which word it pairs up with.


What we see is that each has multiple words that are the same 24 steps apart. But very quickly we see that they form groups: “dejectedness”, “dispiritedness”, “downheartedness”, and “low-spiritedness” are all synonyms of “lowness”, and “aby”, “abye”, and “expiate” are all synonyms of “atone”. So once we have seen one example, the other 11 possible ways to join the two groups are uninteresting. Here is one of them:


“Heterosexuality” and “heterosexualism” are obviously synonyms, but “nonrecreational” and “mulishness” are not related, so we have two more distinct cases here:




There is another way to look at this. Instead of looking for the longest shortest path, we could look for the longest longest path. Of course we could go round in circles forever, so we probably want the longest chain that only visits each word once. If such a tour exists for our subgraph, then we can find it with FindHamiltonianCycle, but a quick test shows that there is no such path.


So the challenge would be to find the largest Hamiltonian subgraph of this graph. Since the result will probably form a synonym chain that is thousands of words long, and too big for this blog, I will leave that as a challenge to the reader!

 
     
     
    
These step-by-step are very useful for me. I’m trying to learn Mathematica. How long do you think it would take someone to have a similar level of knowledge of Mathematica as you have, Jon? Is there a way of learning Mathematica that you can recommend?
Thanks
This is a cool post. I will try this once upgrade to 64bit and install M8.0.1. But! I would use phrase and make the coumpound words adjectival or adverbial.
so zombie -> lving-dead-> vampire -> blood-sucking -> ?. (avoid lawyer jokes). Neverthelss I would embrace the hypen in this case and maybe find a longer shortest-run chain. I would embrace homograph nonrecreational -> non-re-creational (the act of not creating again) dead end. See wiki on hyphens.
A similar game is trying to connect two actors by linking them by their costars. If you had access to a database of movies and actors, this same approach would knock that out.
What is “FindHamiltonianTour” referring to? Perhaps FindHamiltonianCycle?
Great postm thank you! It would be nice to be able to do the same in other languages too. Any chance that the WordData database will be extended to other languages at some point?
@Eric. Well spotted. I did mean the command FindHamiltonianCycle
http://reference.wolfram.com/mathematica/ref/FindHamiltonianCycle.html
Great post! And testimony just how code-efficient and succinct M can be!
Does this address Andy’s (Mar 17) question?
http://www.imdb.com/help/show_leaf?usedatasoftware
I’ve been working on a version of this that instead of using synonym changes, simply changes one letter at a time. Ex. dear->bear->bean->lean->leak (not necessarily the shortest, but you get the point). It’s coming along.