Wolfram Computation Meets Knowledge

Word Facts

Recent versions of Mathematica introduced useful data functions related to linguistics. In Version 7, Mathematica‘s integrated string manipulation and visualization functions provide a powerful platform for computational linguistics. Both DictionaryLookup and WordData give programmatic access to curated data that’s ready for computation.

DictionaryLookup has been extended to cover more languages, while WordData has information on word meaning, structure, and usage, as well as cognitive and grammatical relationships between words. Let’s look at a range of examples, starting with some interesting and amusing word facts.

DictionaryLookup currently contains word lists for 27 different languages:

In[1]:= DictionaryLookup[All]

The longest word in all its 27 languages is the Finnish term for a fancy four-color press:

In[2]:= Last@SortBy[First /@ Reverse /@(SortBy[DictionaryLookup[{#, All}],StringLength] & /@ DictionaryLookup[All]), StringLength]

And we can plot the distribution of English words starting with each letter:

In[3]:= ListPlot[Length[DictionaryLookup[# ~~ ___]] & /@ CharacterRange["a", "z"], Filling -> Axis, AxesLabel -> {"first letter", "number of words"}, Ticks -> {Thread[{Table[i, {i, Length[CharacterRange["a", "z"]]}], CharacterRange["a", "z"]}], Automatic, {-1, 1}}]

These are, for example, all of DictionaryLookups‘s palindromes that are both English and French words:

In[4]:= Intersection[DictionaryLookup[{"English", x__ /; x === StringReverse[x]}], DictionaryLookup[{"French", x__ /; x === StringReverse[x]}]]

Remember that some words in WordData are considered too obscure for DictionaryLookup, and that unlike DictionaryLookup, WordData is English-only.

By using both WordData and DictionaryLookup we can find some hidden word facts, such as the English words with the most syllables:

In[5]:= w = {#, Length[#]} & /@ (WordData[#, "Hyphenation"] & /@ DictionaryLookup[]);

In[6]:= Last[Last[SortBy[w, Last]]];

In[7]:= First /@ Select[w, #[[2]] == % &]

or find that the word “break” has 76 definitions in WordData and DictionaryLookup and that the word “set” has 44 definitions in WordData:

In[8]:= Last@Sort[{Length[WordData[#]], #} & /@ DictionaryLookup[]]

In[9]:= Length[WordData["set"]]

This is the distribution of the number of words in WordData with a given number of meanings:

In[10]:= ListLogPlot[Sort[Tally[Length[WordData[#]] & /@ WordData[]]], AxesLabel -> {"number of meanings", "number of words"}]

and these are the words in WordData that have more than 50 different possible meanings:

In[11]:= Select[{#, Length[WordData[#]]} & /@ WordData[], #[[2]] > 50 &]

These words occur as the maximum number of different parts of speech:

In[12]:= WordData[][[ Ordering[Length[WordData[#, "PartsOfSpeech"]] & /@ WordData[], -5]]]

In[13]:= # -> WordData[#, "PartsOfSpeech"] & /@ %

We can find 692 antonym pairs in English where both words have the same length:

In[14]:= Select[{#, With[{w = WordData[#, "Antonyms", "List"]}, If[Length[w] > 0, First[SortBy[w, StringLength]], w]]} & /@ WordData[], (#[[2]] =!= {} && (StringLength[#[[1]]] === StringLength[#[[2]]])) &];

In[15]:= Length[%]

Here is a random selection out of the 692:

In[16]:= RandomChoice[%%, 10]

We can learn that “underground” and “underfund” are the English words in DictionaryLookup that begin and end with the letters “und”:

In[17]:= DictionaryLookup[StartOfString ~~ "und" ~~ ___ ~~ "und" ~~ EndOfString, IgnoreCase -> True]

and that there are only five English words in there that end in “-dous”:

In[18]:= DictionaryLookup[___ ~~ "dous" ~~ EndOfString, IgnoreCase -> True]

and that these are its only words in English that begin and end with the same four letters:

In[19]:= Select[DictionaryLookup[], StringMatchQ[#, StringExpression[w_, x_, y_, z_, ___, w_, x_, y_, z_] ] &]

With a slight change we can do the same for any of the other 26 languages in DictionaryLookup. In Spanish there are, like in English, nine words that begin and end with the same four letters:

In[20]:= Select[DictionaryLookup[{"Spanish", All}], 
 StringMatchQ[#, StringExpression[r_, w_, x_, y_, z_, ___, r_, w_, x_, y_, z_] ] &]

“Therein” contains twelve words without rearranging any of its letters:

In[21]:= Union@Flatten[ DictionaryLookup /@ Flatten[Table[{StringDrop[#, i], StringDrop[#, i]}, {i, 0, StringLength[#]}] & /@ Table[StringTake["therein", j], {j, StringLength["therein"]}]]]

The same word can form 24 words if we allow letters to be skipped but not reordered:

In[22]:= Union@Flatten[ DictionaryLookup /@ (StringJoin /@ Rest[Subsets[StringSplit["therein", ""]]])]

Only the word “neither” can be formed by rearranging “therein”:

In[23]:= Flatten[DictionaryLookup /@ (StringJoin /@ Rest[Permutations[StringSplit["therein", ""]]])]

and there are four English words that can be formed using only vowels:

In[24]:= DictionaryLookup[RegularExpression["aeiou]*"], IgnoreCase -> True]

There are several more in the French language, though:

In[25]:= DictionaryLookup[{"French", RegularExpression["[aeiou]*"]}, IgnoreCase -> True]

For query methods, we can use string expressions, regular expressions, or a combination of both. For example, this is a way to get the longest English word with a repeated vowel by using a regular expression and a pure function:

In[26]:= Last@SortBy[Flatten[DictionaryLookup[RegularExpression["[a-z]+" <> # <> "{2}+[a-z]+"]] & /@ {"a", "e", "i", "o", "u"}], StringLength]

The only DictionaryLookup English words with three consecutive double letters are those containing “bookkeep,” found using a string expression:

In[27]:= Select[DictionaryLookup[], StringMatchQ[#, StringExpression[___, y_, y_, x_, x_, z_, z_, ___] ] &]

and there are six English words in it with four different consecutive vowels:

In[28]:= Flatten[DictionaryLookup[RegularExpression["[a-z]+" <> # <> "[a-z]*"]] & /@ (StringJoin /@ Permutations[{"a", "e", "i", "o", "u"}, {4}])]

We can ask whether the word “queue” is pronounced the same way when the last four letters are removed (this says it is):

In[29]:= WordData["queue", "PhoneticForm"] == WordData["q", "PhoneticForm"]

There are 862 sets of words that have the same pronunciation. A total of 1796 English words are pronounced exactly the same as at least one other word:

In[30]:= res = Split[Sort[Reverse /@ Select[SortBy[{#, WordData[#, "PhoneticForm"]} & /@ DictionaryLookup[], Last &], (! MatchQ[#[[2]], Missing["NotAvailable"]]) && (StringLength[#[[1]]] > 1) &]], #1[[1]] == #2[[1]] &];

In[31]:= Length[Map[Last, Cases[res, x_ /; Length[x] > 1], {2}]]

In[32]:= Length[Flatten[Map[Last, Cases[res, x_ /; Length[x] > 1], {2}]]]

and at least 68 sets with more than two words with the very same pronunciation:

In[33]:= Sort@Map[Last, Cases[res, x_ /; Length[x] > 2], {2}]

In[34]:= Length[%]

Ed Pegg Jr has written a Rhyme Finder Demonstration using Mathematica‘s WordData:

Screenshot from Rhyme Finder Demonstration

We can keep on asking DictionaryLookup questions. These are the only words in it with a double consecutive “u” in English:

In[35]:= DictionaryLookup[__ ~~ "uu" ~~ ___]

and there are at least 20 English words with no vowels:

In[36]:= First /@ (Select[{#, WordData[#]} & /@ Pick[DictionaryLookup[], LowerCaseQ[#] && StringFreeQ[#, {"a", "e", "i", "o", "u"}, IgnoreCase -> True] & /@ DictionaryLookup[]], #[[2]] != {} &])

with “rhythms” being the longest:

In[37]:= Last@SortBy[Pick[DictionaryLookup[], LowerCaseQ[#] && StringFreeQ[#, {"a", "e", "i", "o", "u"}, IgnoreCase -> True] & /@ DictionaryLookup[]], StringLength]

We can take advantage of Mathematica‘s visualization capabilities and plot a vowel distribution table of those languages in DictionaryLookup that use the Latin alphabet, and find that among these languages, vowel distribution is very similar—in most, “a” is more common than “e”, “e” more common than “i”, and so on, with Latin having the most equal distribution:

In[38]:= SortBy[Tally[Flatten[Characters /@ DictionaryLookup[{#, All}]]], Last] & /@ DictionaryLookup[All];

In[39]:= Select[Thread[{DictionaryLookup[All], Sort[Select[#, MatchQ[#[[1]], Alternatives @@ {"a", "e", "i", "o", "u"}] &]] & /@ %}], #[[2]] != {} &];

In[40]:= BarChart[Map[Last, Last /@ %, {2}], ChartLabels -> (Placed[{"a", "e", "i", "o", "u"}, Center]), ChartLegends -> {First /@ %, None}, ChartLayout -> "Percentile", ChartStyle -> {"Rainbow", None}, PerformanceGoal -> "Speed", PlotLabel -> "Vowel distribution", ImageSize -> 270]

By using ExampleData we can find that word statistics vary not only by language, but by whether the words are chosen from a dictionary or from written text.

Screenshot from Multilanguage Word Lengths Demonstration

We can verify Zipf’s law as applied to word and letter frequencies.

Screenshot from Zipf's Law Applied to Word and Letter Frequencies Demonstration

And by using WordData, we can find the synonyms of a word, the synonyms of those synonyms, et cetera. Plotting the connections between all those synonyms produces a 3D semantic network. Here are all the words related to “computer” out to four degrees of separation.

In[41]:= GraphPlot3D[Union@Flatten[Rest[NestList[Union[Flatten[Thread[# -> WordData[#, "Synonyms", "List"]] & /@ Last /@ #]] &, {"" -> "computer"}, 4]]], VertexRenderingFunction -> (Text[#2, #1, Background -> White] &)]

It is amazing how many experiments one can imagine and easily perform by using these built-in data functions. But even more exciting are the types and numbers of problems that can be addressed in fields like computational linguistics and natural language processing with Mathematica.

Download this notebook