# The Data Science of MathOverflow

February 1, 2019 — Andrew Steinacher, Lead Developer, Wolfram|Alpha Scientific Content

This post discusses new Wolfram Language features from the upcoming release of Version 12. Copyable input expressions and a downloadable notebook version of this post will be available when Version 12 is released.

## New Archive Conversion Utility in Version 12

Soon there will be 100,000 questions on MathOverflow.net, a question-and-answer site for professional mathematicians! To celebrate this event, I have been working on a Wolfram Language utility package to convert archives of Stack Exchange network websites into Wolfram Language entity stores.

The archives are hosted on the Internet Archive and are updated every few months. The package, although not yet publicly available, will be released in the coming weeks as part of Version 12 of the Wolfram Language—so keep watching this space for more news about the release!

Although some data analysis can be done with tools such as the Stack Exchange Data Explorer, queries are usually limited in size or computation time, as well as to text-only formats. Additionally, they require some knowledge of SQL. But with a local copy of the data, much more can be done, including images, plots and graphs.

With the utility package operating on a local archive, it’s easy to perform much deeper data analysis using all of the built-in tools in the Wolfram Language. In particular, Version 12 of the Wolfram Language adds support for RDF and SPARQL queries, as well as useful constructs such as FilteredEntityClass and SortedEntityClass.

For professional mathematicians who already use Mathematica and the Wolfram Language, this utility allows for seamless investigation into the data on MathOverflow.net or any Stack Exchange network site. Feel free to follow along with me as I do some of this investigation by running the code in a notebook, or just sit back and enjoy the ride as we explore MathOverflow.net with the Wolfram Language!

## Importing a MathOverflow EntityStore

The entity stores created by the utility package allow for quick access to the data in a format that’s easy for Wolfram Language processing, such as queries using the Entity framework, machine learning functionality, visualization, etc.

&#10005

downloadedFile=URLDownload[CloudObject["StackExchange2EntityStore/mathoverflow.net.mx"],NotebookDirectory[]]

&#10005

store=Import[downloadedFile];

The store is quite large, consisting of nearly three million entities in several entity types:

&#10005

entityStoreMetaData=AssociationMap[<|"Entity Count"->Length[store[#,"Entities"]],"Property Count"->Length[store[#,"Properties"]]|>&,store[]]//ReverseSortBy[Lookup["Entity Count"]];

&#10005

Dataset[entityStoreMetaData]

&#10005

Total[#"Entity Count"&/@entityStoreMetaData]

Lastly, we need to register the EntityStore for use in the current session:

This returns a list of all of the new entity types from the EntityStore that are now available through EntityValue (you can access them by registering the EntityStore via EntityRegister).

For those who are familiar with the Stack Exchange network, these types may be very familiar. But for those who are not, or if you just need a refresher, here’s a basic rundown of a few of the different types:

The remaining types not listed are beyond the scope of my post, but you can learn more about them in the README on the archives, or by visiting the frequently asked questions on any Stack Exchange network site.

## Accessing MathOverflow.net Posts

Now that the EntityStore is loaded, we can access it through the Entity framework.

Let’s look at some random posts:

&#10005

RandomEntity["StackExchange.Mathoverflow:Post",3]

The “Post” entities are formatted with the post type (Q for question, A for answer), the user who authored the post in square brackets, a short snippet of the post and a hyperlink (the blue ») to the original post on the web.

Many of the other entity types format similarly—this is to give proper context, allow for manual exploration on the site itself and give attribution to the original authors (they created the content on the site, after all).

Taking just one of these posts, we can find a lot of information about it with a property association:

&#10005

Entity["StackExchange.Mathoverflow:Post", "272527"][{accepted answer,answer count,body,closed date,comment count,comments,community owned date,creation date,duplicate posts,favorite count,id,last activity date,last edit date,last editor,linked posts,owner,post type,score,tags,title,URL,view count},"PropertyAssociation"]

For example, one may be interested in the posts for a given tag, such as set theory.

We can find how many set theory questions have been asked:

&#10005

EntityValue[EntityClass["StackExchange.Mathoverflow:Post",{"Tags"->Entity["StackExchange.Mathoverflow:Tag", "SetTheory"],"PostType"->Entity["StackExchange:PostType", "1"] }],"EntityCount"]

We can even see the intersections of different tags, such as set theory and plane geometry:

&#10005

EntityValue[EntityClass["StackExchange.Mathoverflow:Post",{"Tags"->ContainsAll[{Entity["StackExchange.Mathoverflow:Tag", "SetTheory"],Entity["StackExchange.Mathoverflow:Tag", "PlaneGeometry"]}],"PostType"->Entity["StackExchange:PostType", "1"] }],"Entities"]

It’s important to note that as of this writing, the archives have not been updated to include the 100,000th question, so we can see that there are only 98,165 questions as of December 2, 2018:

&#10005

EntityClass["StackExchange.Mathoverflow:Post","PostType"->Entity["StackExchange:PostType", "1"]]["EntityCount"]

&#10005

EntityValue[Entity["StackExchange.Mathoverflow:Post"],"LastPostTime"]

Of course, there is a seemingly endless number of queries one can make on this dataset.

A few ideas that I had were to find and analyze:

• The distribution of post scores (specifically the (nearly) 100k questions)
• Word distributions and frequencies (e.g. -grams)
• snippets ( being the language frequently used on Stack Exchange to format equations and other relations)
• Mathematical propositions (e.g. theorems, lemmas, axioms) mentioned in posts
• Famous mathematicians and propositions that are named after them

Let’s tackle these one at a time.

## Analyzing Posts

### Post Score Distributions

Since there are over 237,000 posts on MathOverflow in total, the distribution of their scores must be very large.

Let’s look at this distribution, noting that some post scores can be negative if they are downvoted by users in the community:

&#10005

allScores=EntityValue["StackExchange.Mathoverflow:Post",EntityProperty["StackExchange.Mathoverflow:Post", "Score"]];

&#10005

postScoreDistribution=Tally[allScores];

&#10005

ListPlot[postScoreDistribution,PlotRange -> Full,PlotTheme->"Detailed",ImageSize->400]

That’s hard to read—it looks better on a log-log scale, and it becomes mostly straight beyond the first several points:

&#10005

ListLogLogPlot[postScoreDistribution,PlotRange -> All,PlotTheme->"Detailed",ImageSize->400]

Let’s focus on the positive post scores below 50:

&#10005

scoresBelowFifty=EntityValue[EntityClass["StackExchange.Mathoverflow:Post","Score"->Between[{1,50}]],"Score"];

&#10005

ListPlot[Tally[scoresBelowFifty],PlotRange->Full,Filling->Axis,PlotTheme->"Detailed",ImageSize->400]

It looks like it might be a log-normal distribution, so let’s find the fitting parameters for it:

&#10005

distributionParameters=FindDistributionParameters[scoresBelowFifty,LogNormalDistribution[μ,σ]]

Plotting both on the same (normalized) scale shows they agree quite well:

&#10005

With[
{pdf=PDF[LogNormalDistribution[μ,σ]/.distributionParameters,x]},
Show[
{
Plot[pdf,{x,0,50},
PlotRange -> All,PlotTheme->"Detailed",
PlotLegends->Placed[{pdf},{Right,0.75}],
PlotStyle->Red,ImageSize->400
],
ListPlot[
{#1,#2/Length[scoresBelowFifty]}&@@@Tally[scoresBelowFifty],
PlotRange->All,
Filling->Axis,
PlotStyle->Blue
]
}
]
]

We can repeat this analysis on the (almost) 100k questions:

&#10005

allQuestionScores=EntityValue[EntityClass["StackExchange.Mathoverflow:Post","PostType"->Entity["StackExchange:PostType", "1"]],EntityProperty["StackExchange.Mathoverflow:Post", "Score"]];

&#10005

allQuestionScores//Length

&#10005

questionScoreDistribution=Tally[allQuestionScores];

&#10005

ListLogLogPlot[questionScoreDistribution,PlotRange -> All,PlotTheme->"Detailed",ImageSize->400]

&#10005

questionScoresBelowFifty=EntityValue[EntityClass["StackExchange.Mathoverflow:Post",{"PostType"->Entity["StackExchange:PostType", "1"],"Score"->Between[{1,50}]}],EntityProperty["StackExchange.Mathoverflow:Post", "Score"]];
questionScoreDistributionParameters=FindDistributionParameters[questionScoresBelowFifty,LogNormalDistribution[μ,σ]]

&#10005

With[
{pdf=PDF[LogNormalDistribution[μ,σ]/.questionScoreDistributionParameters,x]},
Show[
{
Plot[pdf,{x,0,50},
PlotRange -> All,PlotTheme->"Detailed",
PlotLegends->Placed[{pdf},{Right,0.75}],
PlotStyle->Red,ImageSize->400
],
ListPlot[
{#1,#2/Length[questionScoresBelowFifty]}&@@@Tally[questionScoresBelowFifty],
PlotRange->All,
Filling->Axis,
PlotStyle->Blue
]
}
]
]

### Words Much More Common in Mathematics Than Normal Language

There are many words that appear in mathematics that are not found in typical English. Some examples include names of mathematicians (e.g. Riemann, Euler, etc.) or words that have special meanings (integral, matrix, group, ring, etc.).

We can start to investigate these words by gathering all of the post bodies:

&#10005

postBodies=EntityValue["StackExchange.Mathoverflow:Post","Body"];

We’ll need to create functions to normalize strings (normalizeString) and extract sentences (extractSentences), removing HTML tags and replacing any equations with “”:

 ✕ normalizeString=StringReplace[ { Shortest["$$"~~__~~"$$"]:>"\[ScriptCapitalM]\[ScriptCapitalA]\[ScriptCapitalT]\[ScriptCapitalH]", Shortest["$"~~__~~"$"]->"\[ScriptCapitalM]\[ScriptCapitalA]\[ScriptCapitalT]\[ScriptCapitalH]", Shortest["\\begin{equation*}"~~__~~"\\end{equation*}"]->"\[ScriptCapitalM]\[ScriptCapitalA]\[ScriptCapitalT]\[ScriptCapitalH]", Shortest["\\begin{align}"~~__~~"\\end{align}"]->"\[ScriptCapitalM]\[ScriptCapitalA]\[ScriptCapitalT]\[ScriptCapitalH]", Shortest[""]->"", Shortest[""]->"", Shortest[""]->"", "<"~~("/"|"")~~("blockquote"|"strong"|"span"|"sup"|"sub"|"ul"|"ol"|"li"|"em"|"p"|"i"|"b"|"a")~~">"->"", "–"->"-", "\""->"", " "->"", Whitespace->" " } ]; extractSentences=RightComposition[ normalizeString, StringSplit[#,"."|","|"?"|";"|":"|"("|")"|"+"]& ];

We’ll also need to extract, count up and sort the words from all of the post bodies:

&#10005

wordToCount=postBodies//RightComposition[
extractSentences,
Flatten/*StringSplit/*ToLowerCase,
Flatten,
Counts/*ReverseSort
];

This gives a list of almost 400k words:

&#10005

wordToCount//Length

We can trim it down to just the top 500 words, being careful to remove some extra noise with websites, equations, inequalities and single letters:

&#10005

topWordToCount=KeySelect[
wordToCount[[;;1000]],
Not[StringMatchQ[#, "*'*"|"*=*"|"*<*"|"*www*"|"http"|"https"|"--"]]&&StringLength[#]>1&
]//Take[#,UpTo[500]]&;

Note that “” is the most common word, since all equations were replaced with it:

&#10005

Dataset[topWordToCount[[;;20]]]

It’s useful to visualize it as a word cloud:

&#10005

WordCloud[topWordToCount]

Removing “” and stopwords like “the”, “is” and “of” from the data will avoid some clutter:

&#10005

topWordToCountNoStopwords=KeySelect[
wordToCount[[;;1000]],
Not[StringMatchQ[#, "*'*"|"*=*"|"*<*"|"*www*"|"http"|"https"|"\[ScriptCapitalM]\[ScriptCapitalA]\[ScriptCapitalT]\[ScriptCapitalH]"|"--"]]&&StringLength[#]>1&
];
topWordToCountNoStopwords=KeyTake[topWordToCountNoStopwords,topWordToCountNoStopwords//Keys//DeleteStopwords]//Take[#,UpTo[500]]&;

&#10005

Dataset[topWordToCountNoStopwords[[;;20]]]

Now the results are more interesting and meaningful:

&#10005

WordCloud[topWordToCountNoStopwords]

Of course, we can take this analysis further. We can get the frequencies for the top words in usual English with WordFrequencyData:

&#10005

wordToEnglishFrequency=WordFrequencyData[Keys[topWordToCountNoStopwords]];//AbsoluteTiming

Normalize the counts of the words on MathOverflow, and then join the two as coordinates in 2D frequency space:

&#10005

wordToMOFrequency=N[topWordToCountNoStopwords/Total[topWordToCountNoStopwords]];

&#10005

wordToFrequencyCoordinates={wordToEnglishFrequency,wordToMOFrequency}//Merge[Identity]//Select[FreeQ[_Missing]];

&#10005

wordToFrequencyCoordinates[[;;3]]

We can visualize these coordinates, adding a red region to the plot for words more commonly used in typical English than in MathOverflow posts (below = ), and a gray region for words that are more commonly used in MathOverflow posts by less than a factor of 10 (below = 10 ).

This arbitrary factor allows us to narrow down the words that are much more common to MathOverflow than typical English, which appear in the white region (above = 10 ):

&#10005

Show[
LogLogPlot[{x,10x},{x,5*10^-11,0.02},
PlotStyle->{Red,Gray},
Filling->{2->{1},1->Bottom},
FillingStyle->{1->Red,2->Gray},
PlotStyle->PointSize[0.002],
PlotRange->{{8*10^-11,0.02},{3*10^-4,0.02}},
ImageSize->Large,
PlotTheme->"Detailed",
FrameLabel->{"Fraction of English","Fraction of MathOverflow.net"}
],
ListLogLogPlot[
wordToFrequencyCoordinates,
PlotStyle->PointSize[0.002],
PlotRange->{{8*10^-11,0.02},{3*10^-4,0.02}},
ImageSize->Large,
PlotTheme->"Detailed"
]
]

We can take this another step further by looking at the words in the white region that are much more likely to occur on MathOverflow than they are in typical English:

&#10005

wordsMuchMoreCommonInMO=wordToFrequencyCoordinates//Select[Apply[Divide]/*LessThan[1/10]];
wordsMuchMoreCommonInMO//Length

&#10005

ListLogLogPlot[wordsMuchMoreCommonInMO,ImageSize->500,PlotTheme->"Detailed",PlotStyle->PointSize[0.002],FrameLabel->{"Fraction of English","Fraction of MathOverflow.net"}]

Of course, an easy way to visualize this data is in a word cloud, where the words are weighted by combining their frequency of use via Norm:

&#10005

WordCloud[Norm/@wordsMuchMoreCommonInMO,WordSpacings->2]

### Analysis of n-Grams

Of course, individual words are not the only way to analyze the MathOverflow corpus.

We can create a function to compute -grams using Partition and recycling extractSentences from earlier:

&#10005

ClearAll[getNGrams];
getNGrams[n_Integer?Positive]:=RightComposition[
extractSentences,
Map[
ToLowerCase/*StringSplit/*Map[Counts[Partition[#,n,1]]&]/*Merge[Total]
],
Merge[Total],
ReverseSort,
(* Keep only the top 10,000 to save memory *)
Take[#,UpTo[10000]]&
];

Next, we’ll need to build a function to show the -grams in tables and word clouds, both with and without math (since putting them together would clutter the results a bit):

&#10005

ClearAll[showNGrams];
showNGrams[nGramToCount_Association]:=Module[
{phraseToCount,phraseToCountWithMath,phraseToCountWithoutMath},

phraseToCount=KeyMap[StringRiffle,nGramToCount];
phraseToCountWithMath=Take[phraseToCount,UpTo[200]]//Normal//Select[#,Not@StringFreeQ[#[[1]],"\[ScriptCapitalM]\[ScriptCapitalA]\[ScriptCapitalT]\[ScriptCapitalH]"]&,50]&//Association;
phraseToCountWithoutMath=Take[phraseToCount,UpTo[200]]//Normal//Select[#,StringFreeQ[#[[1]],"\[ScriptCapitalM]\[ScriptCapitalA]\[ScriptCapitalT]\[ScriptCapitalH]"]&,50]&//Association;

Print@WordCloud[Take[phraseToCountWithMath,UpTo[50]],ImageSize->500,WordSpacings->2];
Print[
Row[
{
Column[{Style["↑ Including \[ScriptCapitalM]\[ScriptCapitalA]\[ScriptCapitalT]\[ScriptCapitalH] ↑",24,FontFamily->"Source Code Pro"],Dataset[Take[phraseToCountWithMath,UpTo[20]]]}],
Column[{Style["↓ Without \[ScriptCapitalM]\[ScriptCapitalA]\[ScriptCapitalT]\[ScriptCapitalH] ↓",24],Dataset[Take[phraseToCountWithoutMath,UpTo[20]]]}]
},
Spacer[20]
]
];
Print@WordCloud[Take[phraseToCountWithoutMath,UpTo[50]],ImageSize->500,WordSpacings->2];
];

### 3-grams

Looking at the 3-grams, there are lots of “The number of…”, “The set of…”, “is there a…” and more.

There are definitely signs of “if and only if,” but they’re not well captured here since we’re looking at 3-grams. They should show up later in the 4-grams, anyway.

There is a lot of of “Let be…”, “ is a,” and similar—it’s clear that MathOverflow users frequently use for mathematical notation:

&#10005

postBodies//getNGrams[3]//showNGrams

### 4-grams

Expanding on the 3-grams, the 4-grams give several mathy phrases like “if and only if,” “on the other hand,” “is it true that” and “the set of all.”

We also see more proof-like phrases like “Let be a,” “ such that ” and similar.

It’s interesting how the two word clouds begin to show the split of “proof-like” phrases and “natural language” phrases:

&#10005

postBodies//getNGrams[4]//showNGrams

### 5-grams

We see similar trends with the 5- and 6-grams:

&#10005

postBodies//getNGrams[5]//showNGrams

### 6-grams

&#10005

postBodies//getNGrams[6]//showNGrams

Moving past natural language processing, another way to analyze the MathOverflow site is as a network.

We can create a network of MathOverflow users that communicate with each other. One way to do this is to connect two users if one user posts an answer to another user’s question. In this way, we can create a directed graph of MathOverflow users.

Although it’s possible to do this graph-like traversal and matching with the usual EntityValue syntax, it could get somewhat messy.

To start, we can write a symbolic representation of a SPARQL query to find all connections between question writers and the writers of answers, and then do some processing to turn it into a Graph:

&#10005

Needs["GraphStore"]

&#10005

questionerToAnswererGraph=Entity["StackExchange.Mathoverflow:Post"]//RightComposition[
SPARQLSelect[
{
RDFTriple[SPARQLVariable["post"],post type,Entity["StackExchange:PostType", "1"]],
SPARQLPropertyPath[SPARQLVariable["post"],{owner},SPARQLVariable["questioner"]],
],
Map[Apply[DirectedEdge]],
Graph
]

From the icon of the output, we can see it’s a very large directed multigraph. Networks of this size have very little hope of being visualized easily, so we should find a way to reduce the size of it.

We can trim down the size by writing a similar SPARQL query that limits us to posts with a few numerical mathematics post tags:

&#10005

questionerToAnswererGraphSmaller=Entity["StackExchange.Mathoverflow:Post"]//RightComposition[
SPARQLSelect[
{
RDFTriple[SPARQLVariable["post"],post type,Entity["StackExchange:PostType", "1"]],
Alternatives[
RDFTriple[SPARQLVariable["post"],tags,Entity["StackExchange.Mathoverflow:Tag", "NumericalLinearAlgebra"]],
RDFTriple[SPARQLVariable["post"],tags,Entity["StackExchange.Mathoverflow:Tag", "NumericalAnalysisOfPde"]],
RDFTriple[SPARQLVariable["post"],tags,Entity["StackExchange.Mathoverflow:Tag", "NumericalIntegration"]],
RDFTriple[SPARQLVariable["post"],tags,Entity["StackExchange.Mathoverflow:Tag", "RecreationalMathematics"]]
],
SPARQLPropertyPath[SPARQLVariable["post"],{owner},SPARQLVariable["questioner"]],
],
Map[Apply[DirectedEdge]],
Graph[#,GraphStyle->"LargeGraph"]&
]

This graph is much smaller and can be more reasonably visualized. For simplicity, let’s focus only on the largest (weakly) connected component:

&#10005

questionerToAnswererGraphSmallerConnected=First@WeaklyConnectedGraphComponents[questionerToAnswererGraphSmaller]

### Questioner-Answerer Communities by Geographic Region

We can group the vertices of the graph (MathOverflow users) by geography by using the location information users have entered into their profiles.

Here, we can use Interpreter["Location"] to handle a variety of input forms, including countries, cities, administrative divisions (such as states) and universities:

&#10005

userToLocation=EntityValue[VertexList[questionerToAnswererGraphSmallerConnected],"Location","NonMissingEntityAssociation"];
userToLocation=AssociationThread[Keys[userToLocation],Interpreter["Location"]@RemoveDiacritics@Values[userToLocation]];

The results are pretty good, giving over 250 approximate locations:

&#10005

userToLocation//Values//CountsBy[Head]

Of course, these individual locations are not that helpful, as they are very localized. We can use GeoNearest to find the nearest geographic region as a basis for determining groups for the users:

&#10005

ClearAll[getRegion];
getRegion[location_GeoPosition]:=First@getRegion[{location}];
getRegion[locations:{__GeoPosition}]:=First[#,Missing["NotAvailable"]]&/@DeleteCases[GeoNearest[GeoVariant["GeographicRegion","Center"],locations],Entity["GeographicRegion", "World"],Infinity];

Next, we group users into communities based on this geographic information:

&#10005

userToGeographicRegion=DeleteMissing@AssociationThread[Keys[#],getRegion@Values[#]]&[Select[userToLocation,MatchQ[_GeoPosition]]];
geographicRegionToUsers=GroupBy[userToGeographicRegion,Identity,Keys];

&#10005

Length/@geographicRegionToUsers

Lastly, we can use CommunityGraphPlot to build a graphic that shows the geographic communities of the questioner-answerer network:

&#10005

regionToPointColor=Lookup[
<|
Entity["GeographicRegion", "Europe"]->Darker@,Entity["GeographicRegion", "NorthAmerica"]->Darker[Green,0.5],Entity["GeographicRegion", "Australia"]->Orange,Entity["GeographicRegion", "Asia"]->Purple,Entity["GeographicRegion", "SouthAmerica"]->Darker[Red,0.25]
|>,
#,Brown]&;
regionToLabelPlacement=Lookup[
<|
Entity["GeographicRegion", "Europe"]->Below,Entity["GeographicRegion", "NorthAmerica"]->After,Entity["GeographicRegion", "Asia"]->Below
|>,
#,Above]&;
regionToRotation=Lookup[<|Entity["GeographicRegion", "NorthAmerica"]->-(π/2)|>,#,0]&;
styledCommunities=KeyValueMap[
Function[
{region,users},
With[
{regionPointColor=regionToPointColor[region]},
Labeled[
Style[users,regionPointColor],
Rotate[
Style[
FontColor->ResourceFunction["FontColorFromBackgroundColor"][regionPointColor],
FontSize->12
],
regionToRotation[region]
],
regionToLabelPlacement[region]
]
]
],
allRegionsToUsers
];
ImageSize->500,
Method->"Hierarchical",
VertexStyle->PointSize[Medium],
CommunityRegionStyle->(Opacity[0.1,regionToPointColor[#]]&/@Keys[allRegionsToUsers]),
CommunityBoundaryStyle->(Opacity[1,Black]&/@Keys[allRegionsToUsers])
]

### Post Owner-Commenter Network

Of course, we could do a similar analysis on connections between post owners and their commenters for posts tagged with “linear-programming”:

&#10005

postOwnerToCommenterGraph={
Entity["StackExchange.Mathoverflow:Post"],Entity["StackExchange.Mathoverflow:Comment"]
}//RightComposition[
SPARQLSelect[
{
SPARQLPropertyPath[SPARQLVariable["post"],{owner},SPARQLVariable["owner"]],
RDFTriple[SPARQLVariable["post"],tags,Entity["StackExchange.Mathoverflow:Tag", "LinearProgramming"]],
}->{SPARQLVariable["owner"],SPARQLVariable["commenter"]}
],
Lookup[#,{SPARQLVariable["commenter"],SPARQLVariable["owner"]}]&,
Map[Apply[DirectedEdge]],
Graph[#,GraphStyle->"LargeGraph"]&
]

However, further analysis on this network will be left as an exercise for the reader.

## Analyzing TEX Snippets

, in its various forms, has been around for over 40 years, and is widely used in math and science for typesetting.

On MathOverflow, there are not many posts without it, so exploring snippets can give interesting insights into the content available on the site.

### Extract TEX Snippets

First, we need to extract the snippets from post bodies. Consider a simple example post:

&#10005

Entity["StackExchange.Mathoverflow:Post", "40686"]["Body"]

We can write a function to extract the snippets in a string, noting the two main input forms ("" or "\\begin{…}…\\end[...]"):

&#10005

ClearAll[extractTeXSnippets];
extractTeXSnippets[s_String] :=
Module[
{dd,d,o},

dd=StringCases[s,Shortest["$$"~~__~~"$$"]];
d=StringCases[StringReplace[s,Shortest["$$"~~__~~"$$"]:>""],Shortest["$"~~__~~"$"]];
o=StringCases[
s,
Alternatives[
Shortest["\\begin{equation*}"~~__~~"\\end{equation*}"],
Shortest["\\begin{align}"~~__~~"\\end{align}"]
]
];

StringReplace[Join[dd, d,o], {".$":>"$"}]
]

Testing this on the simple example gives the snippets wrapped in dollar signs:

&#10005

extractTeXSnippets[body]

### Format TEX Snippets

Of course, once we have snippets, it would be valuable to format them into actual typesetting that’s easier on the eyes than the raw code.

We can write a quick function to do this with proper formatting:

&#10005

blackboardBoldRules=character to double struck;
frakturGothicRules=character to gothic;
formatTeXSnippet[s_String] :=
Which[
StringMatchQ[s, "$\\mathbb{"~~ _ ~~ "}$"],
StringReplace[s,"$\\mathbb{"~~ a_ ~~ "}$":> a]/.blackboardBoldRules,

StringMatchQ[s, "$\\mathbb "~~ _ ~~ "$"],
StringReplace[s,"$\\mathbb "~~ a_ ~~ "$":> a]/.blackboardBoldRules,

StringMatchQ[s, "$\\mathfrak{"~~ _ ~~ "}$"],
StringReplace[s,"$\\mathfrak{"~~ a_ ~~ "}$":> a]/.frakturGothicRules,

StringMatchQ[s, "$\\mathfrak "~~ _ ~~ "$"],
StringReplace[s,"$\\mathfrak "~~ a_ ~~ "$":> a]/.frakturGothicRules,

StringMatchQ[s, "$\\mathcal{"~~ _ ~~ "}$"],
Style[StringReplace[s,"$\\mathcal{"~~ a_ ~~ "}$":>a],FontFamily->"Snell Roundhand"],

StringMatchQ[s, "$\\mathcal "~~ _ ~~ "$"],
Style[StringReplace[s,"$\\mathcal "~~ a_ ~~ "$":>a],FontFamily->"Snell Roundhand"],

True,
StringReplace[s,"␣"->"_"]//RightComposition[
ImportString[#,"TeX"]&,
FirstCase[#,c:Cell[_,"InlineFormula"|"NumberedEquation"]:>DisplayForm[c],Missing["NotAvailable"],∞]&
]
] 

We can test the results on the previously extracted snippets:

&#10005

AssociationMap[formatTeXSnippet,extractTeXSnippets[body]]//KeyValueMap[List]//Grid[#,Frame->All]&

We can also test them on a completely different post:

&#10005

Entity["StackExchange.Mathoverflow:Post", "40686"]["Body"] //
extractTeXSnippets // AssociationMap[formatTeXSnippet] //
KeyValueMap[List] // Grid[#, Frame -> All, Alignment -> Left] &

### Set Up TEX Snippets Property

This system works well, so we should make it easier to use. We can do this by hooking up these functions as a property for posts, keeping the formatting function separate so that analysis can still be done on the raw strings:

&#10005

EntityProperty["StackExchange.Mathoverflow:Post","TeXSnippets"]["Label"]="TEX snippets";
EntityProperty["StackExchange.Mathoverflow:Post","TeXSnippets"]["DefaultFunction"]=Function[entity,
extractTeXSnippets[entity["Body"]]
];

Now we can just call the property on an entity instead:

&#10005

Entity["StackExchange.Mathoverflow:Post", "67739"]["TeXSnippets"]//Map[formatTeXSnippet]

From here, it should be easy to extract all of the fragments from all of the posts:

&#10005

allTeXSnippets=EntityValue["StackExchange.Mathoverflow:Post","TeXSnippets"];

### Create a TEX Word Cloud

A simple way to analyze the snippets is to count up the number of times each fragment is used:

&#10005

teXToCount=allTeXSnippets//RightComposition[Flatten,Counts,ReverseSort];

There are almost one million unique snippets used in the post bodies on MathOverflow:

&#10005

teXToCount//Length

We can also make a simple word cloud from the top 100 snippets:

&#10005

teXToCount[[;;100]]//KeyMap[formatTeXSnippet]//WordCloud[#,MaxItems->All,ImageSize->400]&

It’s easy to see that there are a lot of single-letter snippets. But there are a lot more interesting things hiding beyond these top 100. Let’s take a look at a few different cases!

### Integrals

Integrals are fairly easy to find with some simple string pattern matching:

&#10005

integralToCount=KeySelect[teXToCount,StringMatchQ[("$"|"$$")~~(Whitespace|"")~~"\\int"~~__~~""]]; integralToCount//Length Looking at the top 50 gives some interesting results—some very simple, and some rather complex: &#10005 WordCloud[integralToCount[[;;50]]//KeyMap[formatTeXSnippet],ImageSize->400,WordSpacings->2] ### Analyze Equations Another interesting subset of snippets to consider is equations. Again, we can find these with some string pattern matching that requires an equals sign: &#10005 equations=KeySelect[teXToCount,StringMatchQ[(""|"$$")~~(Whitespace|"")~~__~~" = "~~__~~"$"]];
equations//Length

Visualizing the top 50 gives mostly single-letter variable assignments to numbers:

&#10005

WordCloud[equations[[;;50]]//KeyMap[formatTeXSnippet],ImageSize->400,WordSpacings->2]

### Equations of the Form <letter> = <number>

If we look at the single-letter variable assignments, we can find the minimum and maximum values of <number> for each <letter>.

Note that this includes a list of special characters, such as \alpha:

&#10005

teXRepresentationToCharacter=TeX to character;

&#10005

specialLettersPattern=Alternatives@@Keys[teXRepresentationToCharacter];

&#10005

variableEqualsNumberDistributions=equations//Keys//StringCases[("$$"|"")~~lhs:(specialLettersPattern|LetterCharacter)~~(Whitespace|"")~~"="~~(Whitespace|"")~~rhs:NumberString~~(Whitespace|"")~~("$$"|"$"):>Rule[ToUpperCase[lhs/.teXRepresentationToCharacter],ToExpression[rhs]]]//Merge[Counts]; &#10005 KeySort[MinMax/@Keys/@variableEqualsNumberDistributions]//KeyValueMap[Prepend[N@#2,#1<>" | "<>ToLowerCase[#1]]&]//Grid[#,Frame->All]& It’s interesting to see that most letters are positive, but S is strangely very negative. It’s also interesting to note the very large scale of U, V and W. Perhaps not surprisingly, N is the most common letter, though its neighbor O is the least common: &#10005 ReverseSort[Total/@variableEqualsNumberDistributions]//BarChart[#//KeySort//Reverse,ChartLabels->Automatic,PlotTheme->"Detailed",ImageSize->400,BarOrigin->Left,AspectRatio->1.5]& Trimming these single-variable assignments out of the original equation word cloud makes the results a bit more diverse: &#10005 WordCloud[ KeySelect[equations,StringMatchQ[("$$"|"")~~(specialLettersPattern|LetterCharacter)~~(Whitespace|"")~~"="~~(Whitespace|"")~~NumberString~~(Whitespace|"")~~("$$"|"$")]/*Not][[;;50]]//KeyMap[formatTeXSnippet],
ImageSize->400,
WordSpacings->2
]

### Equations of the Form <letter> = <letter>

It’s interesting to see that there are a lot of letters assigned to (or compared with) another letter. We can make a simple graph that connects two letters in these equations, again taking into account special characters like \alpha:

&#10005

letterGraph=Keys[equations]//StringCases[("$$"|"")~~lhs:(specialLettersPattern|LetterCharacter)~~(Whitespace|"")~~"="~~(Whitespace|"")~~rhs:LetterCharacter~~(Whitespace|"")~~("$$"|"$"):>(DirectedEdge[lhs,rhs]/.teXRepresentationToCharacter)]//Flatten//Counts//KeySortBy[First]; The graph, without combining upper and lowercase letters, is quite messy: &#10005 Graph[ Union@Cases[Keys[letterGraph],_String,Infinity], Keys[letterGraph], GraphLayout->"CircularEmbedding", EdgeWeight->Normal[letterGraph], EdgeStyle->Normal[Opacity[N@(#/Max[letterGraph])]&/@letterGraph], VertexLabels->Placed[Automatic,Center], VertexLabelStyle->Directive[Bold,Small], VertexSize->Large, PlotTheme->"Web", ImageSize->400 ] If we combine the upper and lowercase letters, the graph becomes a little bit cleaner: &#10005 upperCaseLetterGraph=Merge[Normal[letterGraph]/.c_String:>ToUpperCase[c],Total]; Graph[ Union@Cases[Keys[upperCaseLetterGraph],_String,Infinity], Keys[upperCaseLetterGraph], GraphLayout->"CircularEmbedding", EdgeWeight->Normal[upperCaseLetterGraph], EdgeStyle->Normal[Opacity[N@(#/Max[letterGraph])]&/@upperCaseLetterGraph], VertexLabels->Placed[Automatic,Center], VertexLabelStyle->Bold, VertexSize->Large, PlotTheme->"Web", ImageSize->400 ] If we again remove these equation types, the word cloud becomes even cleaner: &#10005 WordCloud[ KeySelect[equations,StringMatchQ[ Alternatives[ ("$$"|"")~~(specialLettersPattern|LetterCharacter)~~(Whitespace|"")~~"="~~(Whitespace|"")~~NumberString~~(Whitespace|"")~~("$$"|"$"),
("$$"|"")~~(specialLettersPattern|LetterCharacter)~~(Whitespace|"")~~"="~~(Whitespace|"")~~LetterCharacter~~(Whitespace|"")~~("$$"|"$") ] ]/*Not][[;;50]]//KeyMap[formatTeXSnippet], ImageSize->400, WordSpacings->2 ] ### Functional Equations Another interesting subset of equations to look into is functional equations. With a little bit of string pattern matching, we can find many examples: &#10005 functionalEquations=KeySelect[equations,StringMatchQ[___~~(f:_)~~"("~~__~~")"~~__~~(f:_)~~"("~~__~~")"~~___]]; functionalEquations//Length &#10005 functionalEquations[[;;16]]//Keys//Map[formatTeXSnippet]//Multicolumn[#,Frame->All]& By focusing on functional equations that have one function with arguments on the left side of an equals sign, we get fewer results: &#10005 functionalEquations2=KeySelect[equations,StringMatchQ[("$$"|"")~~(f:LetterCharacter)~~"("~~__~~")"~~(Whitespace|"")~~"="~~(Whitespace|"")~~__~~(f:_)~~"("~~__~~")"~~___~~("$$"|"$")]];
functionalEquations2//Length

&#10005

functionalEquations2[[;;16]]//Keys//Map[formatTeXSnippet]//Multicolumn[#,Frame->All]&

However, we’ll need to go further to find equations that are easier to work with. Let’s limit ourselves to single-letter, single-argument functions:

&#10005

functionalEquations3=KeySelect[equations,StringMatchQ[("$$"|"")~~(f:LetterCharacter)~~"("~~x:LetterCharacter~~")"~~(Whitespace|"")~~"="~~(Whitespace|"")~~__~~(f:_)~~"("~~x:_~~")"~~___~~("$$"|"$")]]; functionalEquations3//Length &#10005 functionalEquations3[[;;16]]//Keys//Map[formatTeXSnippet]//Multicolumn[#,Frame->All]& This is much more pointed, but we can go further. If we limit ourselves to functional equations with only one equals sign with single, lowercased arguments that only consist of a single head and argument (modulo operators and parentheses), we find just six equations: &#10005 functionalEquations4=KeySelect[ equations, StringMatchQ[s:(("$$"|"")~~(f:LetterCharacter)~~"("~~x:LetterCharacter?LowerCaseQ~~")"~~(Whitespace|"")~~"="~~(Whitespace|"")~~__~~(f:_)~~"("~~x:_~~")"~~___~~("$$"|"$"))/;(StringCount[s,"="]===1&&StringFreeQ[s,"\\"~~LetterCharacter..]&&Complement[Union[Characters[s]],{"$","^","(",")","-","+","=","{","}","."," ","0","1","2","3","4","5","6","7","8","9"}]===Sort[{f,x}])] ]; functionalEquations4//Length &#10005 functionalEquations4//Keys//Map[formatTeXSnippet]//Column[#,Frame->All]& Interestingly, there are only two functionally unique equations in this list: f(x) = 1 + x f(x)^2 f(x) = 1 + x^2 f(x)^2 If we clean up these functional equations, we can put them through Interpreter["TeXExpression"] to get actual Wolfram Language representations of them: &#10005 interpretedFunctionalEquations=Interpreter["TeXExpression"][StringReplace[f:LetterCharacter~~"("~~x:LetterCharacter~~")^"~~("{"|"")~~n:DigitCharacter~~("}"|""):>"("<>f<>"("<>x<>")"<>")^"<>n]/@Keys[functionalEquations4]]/.C[x_]:>F[x] Finally, we can solve these equations with RSolve: &#10005 AssociationMap[ Replace[eqn:((f_)[x_]==rhs_):>RSolve[eqn,f[x],x]], interpretedFunctionalEquations ] ### Analyze “Big O” Notation Arguments Moving past equations, another common notation among mathematicians is big O notation. Frequently used in computational complexity and numerical error scaling, this notation should surely appear somewhat frequently on MathOverflow. Let’s take a look by finding snippets wrapped in O that consist of a single argument and have equal numbers of open-and-close parentheses: &#10005 bigONotationArguments=StringCases[Keys[teXToCount],Shortest["O("~~args__~~")"]/;StringFreeQ[args,","|";"]&&StringCount[args,"("]===StringCount[args,")"]:>"$$"<>args<>"$$"]//Flatten//Counts//ReverseSort; bigONotationArguments//Length The results are varied: &#10005 bigONotationArguments[[;;100]]//KeyMap[formatTeXSnippet]//WordCloud[#,WordSpacings->3]& One can note that many of these results are functionally equivalent—they differ only in the letter chosen for the variable. We can clean these cases up with a little bit of effort: &#10005 normalizeBigOStrings=StringReplace[ { (* Any constant number  1 *) "$$"~~NumberString~~"$$"->"$$1$$", "$$"~~LetterCharacter~~"$$"->"$$n$$", "$$"~~LetterCharacter~~"^"~~exp:(DigitCharacter|("{"~~__~~"}"))~~"$$":>"$$n^"<>exp<>"$$", "$$"~~LetterCharacter~~"\\log "~~LetterCharacter~~"$$":>"$$n\\log n$$", "$$"~~b:DigitCharacter~~"^"~~x_~~"/"~~x_~~"$$":>"$$"<>b<>"^n/n$$", "$$"~~numerator:DigitCharacter~~"/"~~LetterCharacter~~"$$":>"$$"<>numerator<>"/n$$", "$$"~~factor:DigitCharacter~~LetterCharacter~~"$$":>"$$"<>factor<>"n$$", "$$1/"~~LetterCharacter~~"^"~~exp:(DigitCharacter|("{"~~__~~"}"))~~"$$":>"$$1/n^"<>exp<>"$$", "$$1/|"~~LetterCharacter~~"|^"~~exp:(DigitCharacter|("{"~~__~~"}"))~~"$$":>"$$1/|n|^"<>exp<>"$$", "$$"~~LetterCharacter~~"^"~~exp:(DigitCharacter|("{"~~__~~"}"))~~"\\log "~~LetterCharacter~~"$$":>"$$n^"<>exp<>"\\log n$$", "$$|"~~LetterCharacter~~"|^"~~exp:(DigitCharacter|("{"~~__~~"}"))~~"$$":>"$$|n|^"<>exp<>"$$", "$$\\"~~op:("log"|"dot")~~Whitespace ~~LetterCharacter~~"$$":>"$$\\"<>op<>" n$$", "$$\\log|"~~LetterCharacter~~"|$$"->"$$\\log|n|$$", "$$\\sqrt{"~~LetterCharacter~~"}$$":>"$$\\sqrt{n}$$", "$$"~~LetterCharacter~~"/\\log "~~LetterCharacter~~"$$":>"$$n/\\log n$$" }/.LetterCharacter->(specialLettersPattern|LetterCharacter) ]; &#10005 normalizedBigOArguments=Normal[bigONotationArguments]//RightComposition[ GroupBy[#,First/*normalizeBigOStrings->Last,Total]&, ReverseSort ]; Now the data is much cleaner: &#10005 normalizedBigOArguments[[;;15]]//KeyMap[formatTeXSnippet]//Dataset &#10005 normalizedBigOArguments[[;;100]]//Keys//Map[formatTeXSnippet]//Multicolumn And the word cloud looks much nicer: &#10005 wc=normalizedBigOArguments[[;;50]]//KeyMap[formatTeXSnippet]//WordCloud[#,WordSpacings->3,ImageSize->400,ScalingFunctions->Sqrt]& Lastly, since these are arguments to O, let’s set the word cloud as an argument of O to make a nice picture: &#10005 Style[HoldForm[O][wc],90]//TraditionalForm ## Mentioned Propositions and Mathematicians Another way to analyze MathOverflow is to look at the mathematical propositions and famous mathematicians that are mentioned in the post bodies. An easy way to do this is to use more entity stores to keep track of the different types. ### Mathematical Propositions: Build EntityStore To begin, let’s set up an EntityStore for mathematical propositions and their types. Specifically, we can set up "MathematicalPropositionType" for “base” words like “theorem,” “hypothesis” and “conjecture,” and "MathematicalProposition" for specific propositions like the “mean value theorem” and “Zorn’s lemma.” The proposition types will serve as a means of programmatically finding the specific propositions, so we’ll need to pre-populate "MathematicalPropositionType" with entities, but we can leave it empty of entities for now—we’ll populate that type in the store by processing the post bodies, but we’ll do that next. Note that I’ve added some properties to keep track of the propositions found in each post. Specifically, "Wordings" will hold an Association with strings for the keys and the counts of each of those strings for the values. Additionally, we’ll set up "MentionedPostCount" to keep track of the number of times a post is mentioned: &#10005 propositionStore=EntityStore[ { "MathematicalPropositionType"-><| "Entities"-><| "Theorem"-><|"Label"->"theorem"|>, "Hypothesis"-><|"Label"->"hypothesis"|>, "Principle"-><|"Label"->"principle"|>, "Conjecture"-><|"Label"->"conjecture"|>, "Thesis"-><|"Label"->"thesis"|>, "Lemma"-><|"Label"->"lemma"|>, "Corollary"-><|"Label"->"corollary"|>, "Axiom"-><|"Label"->"axiom"|> |>, "Properties"-><| "Label"-><|"Label"->"label"|> |> |>, "MathematicalProposition"-><| "Entities"-><||>, "Properties"-><| "PropositionType"-><| "Label"->"proposition type" |>, "Wordings"-><| "Label"->"wordings", "DefaultFunction"->Function[<||>], "FormattingFunction"->ReverseSort |>, "Label"-><| "Label"->"label", "DefaultFunction"->Function[entity,entity["Wordings"]//Keys//First] |>, "MentionedPostCount"-><| "Label"->"mentioned post count", "DefaultFunction"->Function[0] |> |> |> } ]; EntityUnregister/@propositionStore[]; EntityRegister[propositionStore] ### Populate Now that the EntityStore is set up and registered, we can use the properties I set up in the store. Let’s start with a list of theorems that don’t have names in them: &#10005 $specialTheorems={"prime number theorem","central limit theorem","implicit function theorem","spectral theorem","incompleteness theorem","universal coefficient theorem","intermediate value theorem","mean value theorem","uniformization theorem","inverse function theorem","four color theorem","binomial theorem","index theorem","fundamental theorem of algebra","residue theorem","dominated convergence theorem","open mapping theorem","ergodic theorem","fundamental theorem of calculus","h-cobordism theorem","closed graph theorem","modularity theorem","adjoint functor theorem","geometrization theorem","primitive element theorem","fundamental theorem of arithmetic","fixed point theorem","4-color theorem","four colour theorem","isotopy extension theorem","proper base change theorem","well-ordering theorem","loop theorem","slice theorem","odd order theorem","isogeny theorem","group completion theorem","convolution theorem","reconstruction theorem","equidistribution theorem","contraction mapping theorem","principal ideal theorem","ergodic decomposition theorem","orbit-stabilizer theorem","4-colour theorem","tubular neighborhood theorem","three-squares theorem","martingale representation theorem","purity theorem","triangulation theorem","multinomial theorem","graph minor theorem","strong approximation theorem","universal coefficients theorem","localization theorem","positive mass theorem","identity theorem","cellular approximation theorem","transfer theorem","bounded convergence theorem","fundamental theorem of symmetric functions","subadditive ergodic theorem","annulus theorem","rank-nullity theorem","elliptization theorem"};

Next, we can build a function that will introduce new "MathematicalProposition" entities, keeping track of how often they are mentioned, their types and specific wordings for later use in cleaning things up.

Note that we strip off any possessives and remove special characters via RemoveDiacritics:

&#10005

ToCamelCase[s_String]:=First@ToCamelCase[{s}];
ToCamelCase[s:{__String}]:=StringSplit[RemoveDiacritics[s]]//Map[Capitalize/*StringJoin];
toStandardName=ToCamelCase[StringReplace[StringRiffle@StringTrim[StringSplit[#],"'s"],{"'"->"",Except[LetterCharacter|DigitCharacter]->" "}]]&;
With[
{propositionTypePattern=Alternatives@@EntityValue[EntityList["MathematicalPropositionType"],label]},
{
entity=Entity["MathematicalProposition",toStandardName[proposition]]
},

(* Keep track of mentions *)
entity["MentionedPostCount"]=Replace[entity["MentionedPostCount"],{i_Integer:>i+1,_->0}];

(* Keep track of specific wordings and their counts *)
entity["Wordings"]=Replace[Replace[entity["Wordings"],Except[_Association]-><||>],a_Association:>Append[a,proposition->Lookup[a,proposition,0]+1]];

(* Extract PropositionType *)
entity["PropositionType"]=StringCases[proposition,propositionTypePattern,IgnoreCase->True]//Replace[{{x_String,___}:>Entity["MathematicalPropositionType",Capitalize[ToLowerCase[x]]],_->Missing["NotAvailable"]}];

entity
]
];

Note that there are currently no proposition entities:

&#10005

EntityList["MathematicalProposition"]

But if we run the list of special theorems through the function…

&#10005

addPropositionEntity/@$specialTheorems; … then there are proposition entities defined: &#10005 EntityList["MathematicalProposition"]//Take[#,UpTo[5]]& We should reset counters for the introduced entities to keep things uniform (the list I provided was fabricated—those strings did not come from actual posts, so not resetting these values may throw off the numbers a bit): &#10005 #["MentionedPostCount"]=0;&/@EntityList["MathematicalProposition"]; #["Wordings"]=Association[#["Label"]->1];&/@EntityList["MathematicalProposition"]; Of course, we can go further and detect other forms of propositions. Specifically, let’s look for propositions of the following forms: 1. One of the special theorems we just introduced 2. <person name> theorem” (and similar) 3. “theorem of <person name>” (and similar) When we find these propositions, we can add them as entities to the proposition EntityStore (via addPropositionEntity), as well as store them with the posts so lookups are faster (as they will already be stored in memory through the EntityStore). To start, we’ll need to do some normalization. Here’s a useful function that uses a list of words that should always be lowercased: &#10005 $lowercaseWords= list of words that should be lowercased;
lowerCaseSpecificWords= StringReplace[(#->ToLowerCase[#])&/@$lowercaseWords]; Additionally, here’s a list of ordinals and how to normalize them (including “Last”—for example, as in “Fermat’s last theorem”): &#10005 $ordinalToWord=<|"1st"->"First","2nd"->"Second","3rd"->"Third","4th"->"Fourth","5th"->"Fifth","6th"->"Sixth","7th"->"Seventh","8th"->"Eighth","9th"->"Ninth","10th"->"Tenth","last"->"Last"|>;
$ordinals=Join[ Flatten[{ToLowerCase[#],#}&/@Values[$ordinalToWord]],
List@@Keys[$ordinalToWord] ]//ReverseSortBy[StringLength]; Now we can create a function to extract propositions from strings, normalize them with normalizeString from earlier and then create new "MathematicalProposition" entities using addPropositionEntity: &#10005 With[ { propositionTypePattern=EntityList["MathematicalPropositionType"]//label//Join[#,Capitalize/@#]&//Apply[Alternatives], upperCaseWordPattern=(WordBoundary|Whitespace ~~(_?UpperCaseQ ~~ (LetterCharacter| "-"|"'")..)~~WordBoundary|Whitespace), anyCaseWordPattern=(WordBoundary|Whitespace ~~((Alternatives@@$ordinals)|( (LetterCharacter| "-"|"'")..))~~WordBoundary|Whitespace),
possibleOrdinalPattern=(Alternatives@@$ordinals~~Whitespace)|"" }, extractNamedPropositions=RightComposition[ normalizeString, DeleteDuplicates@Join[ (* Case 1: E.g. "central limit theorem" *) StringCases[#,Alternatives@@$specialTheorems,IgnoreCase->True],

StringCases[
#,
Alternatives[

(* Case 3: "(nth) Theorem of Something (Something (Something))" *)
Shortest[(WordBoundary|Whitespace)~~possibleOrdinalPattern~~propositionTypePattern ~~ Whitespace~~"of"~~Longest@Repeated[upperCaseWordPattern,3]],

(* Case 2: "Something (something (something)) Theorem" *)
Shortest[(WordBoundary|Whitespace)~~(x:upperCaseWordPattern)~~Longest@Repeated[anyCaseWordPattern,2]~~propositionTypePattern]/;(Not@StringMatchQ[StringTrim[x],"The"|"A"|"Use",IgnoreCase->True])
]
]
]&,

(* Remove cases with useless words in them *)
Select[
With[
(* Ignore "of" so that case #3 is allowed *)
{split=DeleteCases[StringSplit[ToLowerCase[#],Whitespace|"-"],"of"]},
split===DeleteCases[DeleteStopwords[split],"using"|"like"|"phd"|"understanding"|"finally"|"concerning"|"regarding"|"ℳℋ"|""|"satisfies"|"following"|"stated"|"usually"|"implies"|"hence"|"course"|"assuming"|"wikipedia"|"article"|"usual"|"actually"|"analysis"|"entitled"|"apply"]
]&
],

StringReplace[Normal@\$ordinalToWord],
lowerCaseSpecificWords,
StringTrim,
DeleteDuplicates,
]
];

Let’s try the function on a simple example:

&#10005

Entity["StackExchange.Mathoverflow:Post", "40686"]["Body"]//extractNamedPropositions

We can see that an entity was added to the store:

&#10005

EntityList["MathematicalProposition"][[-3;;]]

We can also see that its properties were populated:

&#10005

Entity["MathematicalProposition", "MartinAxiom"]["PropertyAssociation"]

Of course, we can automate this a bit more by introducing this function as a property for MathOverflow posts that will store the results in the EntityStore itself:

&#10005

EntityProperty["StackExchange.Mathoverflow:Post","NamedPropositions"]["Label"]="named propositions";
EntityProperty["StackExchange.Mathoverflow:Post","NamedPropositions"]["DefaultFunction"]=Function[entity,entity["NamedPropositions"]=extractNamedPropositions[entity["Body"]]];

Let’s test out the property on the same Entity as before:

&#10005

Entity["StackExchange.Mathoverflow:Post", "40686"]["NamedPropositions"]

We can see that the in-memory store has been populated:

&#10005

Entity["MathematicalProposition"]["EntityStore"]

Pro tip: in case you want to continue to work on an EntityStore you’ve been modifying in-memory in a future Wolfram Language session, you can Export Entity["type"]["EntityStore"] to an MX file and then Import it in the new session. Just don’t forget to register it with EntityRegister!

At this point, we can now gather propositions mentioned in all of the MathOverflow posts, taking care to reset the counters again to avoid contamination of the results:

&#10005

(* Reset counters again to avoid contaminating the results *)
#["MentionedPostCount"]=0;&/@EntityList["MathematicalProposition"];
#["Wordings"]=Association[#["Label"]->1];&/@EntityList["MathematicalProposition"];

Note that this will take a while to run (it took about 20 minutes on my machine), but it will allow for a very thorough analysis of the site’s content.

After processing all of the posts, there are now over 10k entities in the proposition EntityStore:

&#10005

postToNamedPropositions=EntityValue["StackExchange.Mathoverflow:Post","NamedPropositions","EntityAssociation"];

&#10005

EntityValue["MathematicalProposition","Entities"]//Length

### Data Cleanliness

Having kept track of the wordings for each proposition was a good choice—now we can see that proposition entities will format with the most commonly used wording. For example, look at Stokes’ theorem:

&#10005

Entity["MathematicalProposition", "StokesTheorem"]["Wordings"]

It’s named after George Gabriel Stokes, and so the correct possessive form ends in “s’,” not “’s,” despite about 15 percent of mentions using the incorrect form.

I’ll admit that this normalization is not perfect—when someone removes the first “s” altogether, it is picked up in a different entity:

&#10005

Entity["MathematicalProposition", "StokeTheorem"]["Wordings"]

Rather than spend a lot of time and effort to normalize these small issues, I’ll move on and work around these problems for now.

### Proposition Analysis

Now that we have a lot of data on the propositions mentioned in the post bodies, we can visualize the most commonly mentioned propositions in a word cloud:

&#10005

topPropositionToCount=ReverseSort@EntityValue[EntityClass["MathematicalProposition",{"MentionedPostCount"->TakeLargest[100]}],"MentionedPostCount","EntityAssociation"];

&#10005

WordCloud[topPropositionToCount[[;;20]],ImageSize->400,WordOrientation->{{π/4,-(π/4)}}]

It seems that the prime number theorem is the most commonly mentioned:

&#10005

topPropositionToCount[[;;20]]//Dataset

We can also see that about two-thirds of all propositions are theorems:

&#10005

propositionTypeBreakdown=EntityValue["MathematicalProposition","PropositionType"]//Counts//ReverseSort

&#10005

PieChart[propositionTypeBreakdown,ChartLabels->Placed[Automatic,"RadialCallout"],ImageSize->400,PlotRange->All]

### Mathematicians

Now that we have the propositions, we can look for mathematician names in the propositions.

### Build

To start, we can find all of the labels for the propositions:

&#10005

propositionToLabel=EntityValue["MathematicalProposition","Label","EntityAssociation"];

Next, we’ll need to find the words in the labels, drop proposition types and stopwords and count them up (taking care to not separate names that start with “de” or “von”):

&#10005

commonWordsInPropositions=propositionToLabel//RightComposition[
Values,
StringReplace[prefix:"Van der"|"de"|"De"|"von"|"Von"~~(Whitespace|"-")~~name:(_?UpperCaseQ~~LetterCharacter):>StringReplace[prefix,Whitespace->"_"]<>"_"<>name],
StringSplit/*Flatten/*DeleteStopwords,
Counts/*ReverseSort,
KeyDrop[Flatten@EntityValue[EntityClass["MathematicalPropositionType",All],{"CanonicalName","Label"}]]
];

Next, we can look for groups of mathematician names joined by dashes (taking care to remove words that are obviously not names):

&#10005

namePattern=((_?UpperCaseQ|("von"|"de"))~~(LetterCharacter|"_")..);
mathematiciansJoinedByDashes=Select[Keys[commonWordsInPropositions],StringMatchQ[namePattern~~Repeated["-"~~namePattern,{1,Infinity}]]]//StringSplit[#,"-"]&//Flatten//Union//StringReplace["_"->" "];
mathematiciansJoinedByDashes=DeleteCases[mathematiciansJoinedByDashes,"Anti"|"Foundation"|"Max"|"Flow"|"Min"|"Cut"];
mathematiciansJoinedByDashes//Length

Another source of names is possessive words, as in “Zorn’s lemma”:

&#10005

possesiveNames=Keys[commonWordsInPropositions]//Select[StringEndsQ["'s"|"s'"]];
possesiveNames//Length

After some cleanup (e.g. removing inline snippets and references to the Clay Mathematics Institute’s Millennium Prize problems), most of these are likely last names for mathematicians:

&#10005

mathematicansByPossessives=StringTrim[possesiveNames,"'s"|"s'"]//RightComposition[
StringSplit[#,"-"]&,
Flatten/*DeleteDuplicates,
StringReplace["_"->" "],
Select[StringMatchQ[#,_?UpperCaseQ~~(_?LowerCaseQ)~~___]&],
Select[StringFreeQ["ℳℋ"]]
];
mathematicansByPossessives=DeleteCases[mathematicansByPossessives,"Clay"|"Millennium"];
mathematicansByPossessives//Length

Combining these two lists should result in a fairly complete list of mathematician names:

&#10005

allMathematicians=Union[mathematiciansJoinedByDashes,mathematicansByPossessives];
allMathematicians//Length

The results seem pretty decent:

&#10005

Multicolumn[RandomSample[allMathematicians,16]]

Now we need to find possible ways to write down the names of each mathematician.

After looking through the data, I found a few cases that needed to be corrected manually. Specifically, a few names are written out for the famous Jacob Lurie and R. Ranga Rao, so they need to be corrected to clean up the results a bit:

&#10005

mathematicianToPossibleNames=GroupBy[allMathematicians,RemoveDiacritics/*StringReplace["_"->" "]/*toStandardName];
mathematicianToPossibleNames["Lurie"]=PrependTo[mathematicianToPossibleNames["Lurie"],"Jacob Lurie"];
mathematicianToPossibleNames["Ranga"]=PrependTo[mathematicianToPossibleNames["Ranga"],"Ranga Rao"];

Now, we need to construct an Association to point from individual name words to their mathematicians:

&#10005

possibleNameToMathematicians=mathematicianToPossibleNames//RightComposition[
Flatten,
GroupBy[Last->First],
Map[Entity["Mathematician",#]&/@Flatten[#]&]
];

Note that these entities do not format—we’ll create the EntityStore for them soon:

&#10005

possibleNameToMathematicians[[;;10]]

From here, we can break the proposition labels into words that we can use to look up mathematicians (taking care to fix a few cases that need special attention for full names written out):

&#10005

propositonToLabelWords=propositionToLabel//RightComposition[
Map[
StringReplace[prefix:("De"|"de"|"von"|"Von"|"Jacob"|"Alex"|"Yoon"|"Ranga")~~Whitespace|"-"~~name:(namePattern|"Ho"|"Lurie"|"Lee"|"Rao"):>prefix<>"_"<>name]/*(StringSplit[#,Whitespace|"-"]&)/*(StringTrim[#,"'s"|"s'"]&)/*StringReplace["_"->" "]
]
];

&#10005

propositionToMathematicians=DeleteDuplicates@Flatten@DeleteMissing@Lookup[possibleNameToMathematicians,#]&/@propositonToLabelWords;

With this done, we can add this data to the proposition store as a new property:

&#10005

EntityProperty["MathematicalProposition","NamedMathematicians"]["Label"]="named mathematicians";
KeyValueMap[(#1["NamedMathematicians"]=#2)&,propositionToMathematicians];

And by rearranging the data, we can build the data to create an EntityStore for mathematicians:

&#10005

mathematicanToPropositions=propositionToMathematicians//RightComposition[
Flatten,
GroupBy[Last->First]
];

&#10005

KeyTake[mathematicanToPropositions,{Entity["Mathematician", "DeFinnetti"],Entity["Mathematician", "Lurie"],Entity["Mathematician", "Ranga"]}]

### Populate

Using this data, we can now build an EntityStore for mathematicians, taking into account the data we’ve accumulated:

&#10005

mathematicanData=Association@KeyValueMap[
Rule[
#1,
<|
"Label"->First[#2],
"PossibleNames"->#2,
"NamedPropositions"->Lookup[mathematicanToPropositions,Entity["Mathematician",#1]]
|>
]&,
mathematicianToPossibleNames
];

&#10005

mathematicianStore=EntityStore[
{
"Mathematician"-><|
"Entities"->mathematicanData,
"Properties"-><|
"Label"-><|"Label"->"label"|>,
"PossibleNames"-><|"Label"->"possible names"|>,
"NamedPropositions"-><|"Label"->"named propositions"|>
|>
|>
}
];
EntityUnregister/@mathematicianStore[];
EntityRegister[mathematicianStore]

Now we can see the propositions named after a specific mathematician, such as Euler:

&#10005

EntityValue[Entity["Mathematician", "Euler"],"NamedPropositions"]

### Analysis

At this point, many interesting queries are possible.

As a start, let’s look at a network that connects two mathematicians if they appear in the same proposition name. It sounds somewhat similar to the Mathematics Genealogy Project.

An example might be the “Grothendieck–Tarski axiom”:

&#10005

Entity["MathematicalProposition", "GrothendieckTarskiAxiom"]["NamedMathematicians"]

First, grab all of the named mathematicians for each proposition, taking only those that have at least two:

&#10005

groupedMathematicianData=Select[EntityValue["MathematicalProposition","NamedMathematicians","EntityAssociation"],Length[#]>=2&];

&#10005

groupedMathematicianData[[;;10]]

With a little bit of analysis, we can see that the majority of propositions have two mathematicians, fewer have three, there are a few groups of four and there is only one group of five:

&#10005

groupedMathematicianData//Values//CountsBy[Length]//KeySort

Here is the group of five, mentioned in this answer:

&#10005

TakeLargestBy[groupedMathematicianData,Length,1]

Here are the top 25 results:

&#10005

TakeLargestBy[groupedMathematicianData,Length,25]//Map[Apply[UndirectedEdge]]//KeyValueMap[List]//Grid[#,Alignment->Center,Dividers->All]&

We can complete our task by constructing a network of all grouped mathematicians:

&#10005

groupedMathematicianEdges=Flatten[(UndirectedEdge@@@Subsets[#,{2}])&/@Values[groupedMathematicianData]];
groupedMathematicianEdges//Length

The network is quite large and messy.

&#10005

Graph[groupedMathematicianEdges,ImageSize->400]

With some more processing, we can add weights for repeated edges and use them to determine their opacity:

&#10005

weightedMathematicanEdges=ReverseSort[Counts[groupedMathematicianEdges]];

&#10005

connectedMathematiciansGraph=Graph[
Keys[weightedMathematicanEdges],
EdgeWeight->Normal[weightedMathematicanEdges],
EdgeStyle->Normal[Opacity[Sqrt@N[#/Max[weightedMathematicanEdges]]]&/@weightedMathematicanEdges],
PlotTheme->"LargeGraph",
ImageSize->400
]

Here are the most common pairings of mathematicians:

&#10005

weightedMathematicanEdges[[;;10]]

And here’s an easy way to see who has the most propositions named after them, which can be done with SortedEntityClass, new in Mathematica 12:

&#10005

mathematicianToPropositions=EntityValue[
SortedEntityClass["Mathematician",EntityFunction[entity,Length[entity["NamedPropositions"]]]->"Descending"],
"NamedPropositions",
"EntityAssociation"
];
mathematicianToPropositions//Length

Here are the top 25, along with the number of propositions in which they are mentioned on MathOverflow:

&#10005

Length/@mathematicianToPropositions[[;;25]]

Despite his common appearance in theorems, Euler is surprisingly very low on this list at number 61:

&#10005

Position[Keys[mathematicianToPropositions],Entity["Mathematician", "Euler"]]

And here’s the full distribution of mentions (on a log scale for easier viewing):

&#10005

Histogram[
Length/@Values[mathematicianToPropositions],
PlotRange->Full,
ImageSize->400,
PlotTheme->"Detailed",
ScalingFunctions->"Log",
FrameLabel->{"# of Named Propositions","# of Mathematicians"}
]`

## Do It Yourself

Be sure to explore further using the newest functions coming soon in Version 12 of the Wolfram Language!

Although I used a lot of Wolfram technology to explore the data on MathOverflow.net, here are some open areas that still remain to be explored:

• What fraction of all questions are unanswered?
• Find named mathematical structures (e.g. integral, group, Riemann sum, etc.)
• Align entities to MathWorld for further investigation
• Investigate the time series ratio of user count to post count

In fact, you can download and register the entity stores used in this post here:

Brush up on your math skills with Fast Introduction for Math Students

 Andrew Thank you very much for providing such an interesting blog. A number of questions: Is it possible to get the notebook for this blog, rather than having to retype everything? Secondly how does this link in with the recent attempts by WRI and Eric from MathWorld to create an EntityStore on Algebraic Topology, which was covered in a number of WTC 2017 talks? Lastly, if it possible to create EntityStores from the Stack Exchange archives , is it also possible to create EntityStores on mathematical theorems from the corresponding Wikipedia articles using WikipediaSearch and WikipediaData ? Thanks again for your blog and any responses to my questions Michael Posted by Michael Kelly    February 1, 2019 at 3:57 pm
 Thanks for reading! I’m glad you enjoyed it! I am not familiar with the MathWorld EntityStore you’ve mentioned, but if anything, parts of this blog can used as a stepping stone to create your own EntityStores based on any data source. Concerning EntityStores created from Wikipedia: I myself haven’t done this, but it sounds like an interesting project, and could probably be done without too much trouble. I’d be curious to see what results can be found from this data! Posted by Wolfram Blog    February 4, 2019 at 12:18 pm
 Well done Andrew! Pretty impressive data processing functionalities of Version 12. I can’t wait to get a copy of it. Love the wordsMuchMoreCommonInMO graphics. Posted by Thomas Bruckner    February 2, 2019 at 4:50 pm
 (1) As useful to Mathematica users, perhaps, as MathOverflow.net, would be an entity store of https://mathematica.stackexchange.com. (2) Is there a button somewhere that I’m missing to allow downloading this blog post as a notebook? Or at the very least, cause double clicking displayed code in this blog post to be copyable as text, so that it can be pasted into a notebook. Just having a bunch of .jpg’s showing code is very unhelpful! Posted by Murray Eisenberg    February 3, 2019 at 11:12 am
 Thanks for reading! I hope you enjoyed it! (1) Yes, I have looked at and analyzed mathematica.stackexchange.com, as well as several other SE sites, and I have found some interesting things! I hope to do another blog post, and in particular, cover mathematica.stackexchange.com in greater detail in the near future. (2) I certainly agree about the lack of a notebook and copyable code! Some functionality in the post requires the upcoming Version 12 (which, as of this writing, is unreleased). I didn’t want to give out the code as some would not work properly in older versions (e.g. the SPARQL query parts). Once Version 12 does is released, the notebook and copyable code will be available. Sorry for the inconvenience! Posted by Wolfram Blog    February 4, 2019 at 12:17 pm
 could we have https://gamedev.stackexchange.com as an entity store? or maybe only https://gamedev.stackexchange.com/questions/tagged/unity Thank you, Posted by Luc Barthelet    February 5, 2019 at 1:59 pm
 Hi Luc, I made an EntityStore of gamedev.stackexchange.com with the current version of the utility and hosted it in the Wolfram Cloud. You can get it with this WL code (it’s around 600 MB large): gameDevSEStore = Import[CloudObject["https://www.wolframcloud.com/objects/andrews/StackExchange2EntityStore/gamedev.stackexchange.com.mx"]]; Then, you can register it for use in EntityValue with this: EntityRegister[gameDevSEStore] The basic features in the store should still work in Mathematica 11.3, and I’ve made a notebook to show a few simple things I found in it, including all posts with the “unity” tag: https://www.wolframcloud.com/objects/andrews/Published/gamedev.stackexchange.com Let me know if you have any questions! Thanks, Andrew Posted by Wolfram Blog    February 13, 2019 at 9:21 am