Wolfram Blog
Aaron Enright
Eric Weisstein

Computational Exploration of the Mathematics Genealogy Project in the Wolfram Language

August 2, 2018
Aaron Enright, Data Scientist, Wolfram|Alpha Socioeconomic Content
Eric Weisstein, Senior Researcher, Wolfram|Alpha Scientific Content

The Mathematics Genealogy Project (MGP) is a project dedicated to the compilation of information about all mathematicians of the world, storing this information in a database and exposing it via a web-based search interface. The MGP database contains more than 230,000 mathematicians as of July 2018, and has continued to grow roughly linearly in size since its inception in 1997.

In order to make this data more accessible and easily computable, we created an internal version of the MGP data using the Wolfram Language’s entity framework. Using this dataset within the Wolfram Language allows one to easily make computations and visualizations that provide interesting and sometimes unexpected insights into mathematicians and their works. Note that for the time being, these entities are defined only in our private dataset and so are not (yet) available for general use.

The search interface to the MGP is illustrated in the following image. It conveniently allows searches based on a number of common fields, such as parts of a mathematician’s name, degree year, Mathematics Subject Classification (MSC) code and so on:


For a quick look at the available data from the MGP, consider a search for the prolific mathematician Paul Erdős made by specifying his first and last names in the search interface. It gives this result:

Search results

Clicking the link in the search result returns a list of available data:

Available data

Note that related mathematicians (i.e. advisors and advisees) present in the returned database results are hyperlinked. In contrast, other fields (such as school, degree years and so on), are not. Clearly, the MGP catalogs a wealth of information of interest to anyone wishing to study the history of mathematicians and mathematical research. Unfortunately, only relatively simple analyses of the underlying data are possible using a web-based search interface.

Explore Mathematicians

For those readers not familiar with the Wolfram Language entity framework, we begin by giving a number of simple examples of its use to obtain information about the "MGPPerson" entities we created. As a first simple computation, we use the EntityValue function to obtain a count of the number of people in the "MGPPerson" domain:



Note that this number is smaller than the 230,000+ present in the database due to subsequent additions to the MGP. Similarly, we can return a random person:



Mousing over an “entity blob” such as in the previous example gives a tooltip showing the underlying Wolfram Language representation.

We can also explicitly look at the internal structure of the entity:



Copying, pasting and evaluating that expression to obtain the formatted version again:



We now extract the domain, canonical name and common name of the entity programmatically:



We can simultaneously obtain a set of random people from the "MGPPerson" domain:



To obtain a list of properties available in the "MGPPerson" domain, we again use EntityValue:



As we did for entities, we can view the internal structure of the first property:



We can also view the string of canonical names of all the properties:



The URL to the relevant MGP page is available directly as its own property, which can be done concisely as:



… with an explicit EntityProperty wrapper:



… or using a curried syntax:



We can also return multiple properties:



Another powerful feature of the Wolfram Language entity framework is the ability to create an implicitly defined Entity class:



Expanding this class, we obtain a list of people with the given surname:



To obtain an overview of data for a given person, we can copy and paste from that list and query for the "Dataset" property using a curried property syntax:


Entity["MGPPerson", "174871"]["Dataset"]

As a first simple computation, we use the Wolfram Language function NestGraph to produce a ten-generation-deep mathematical advisor tree for mathematician Joanna “Jo” Nelson:


NestGraph[#["AdvisedBy"]&,Entity["MGPPerson", "174871"],10,VertexLabels->Placed["Name",After,Rotate[#,30 Degree,{-3.2,0}]&]]

Using an implicitly defined EntityClass, let’s now look up people with the last name “Hardy”:



Having found the Hardy we had in mind, it is now easy to make a mathematical family tree for the descendants of G. H. Hardy, highlighting the root scholar:


With[{scholar=Entity["MGPPerson", "17806"]},
NestGraph[#["Advised"]&,scholar,2,VertexLabels->Placed["Name",After,Rotate[#,30 Degree,{-3.2,0}]&],ImageSize->Large,GraphLayout->"RadialDrawing"],

A fun example of the sort of computation that can easily be performed using the Wolfram Language is visualizing the distribution of mathematicians based on first and last initials:


Histogram3D[Select[Flatten[ToCharacterCode[#]]&/@Map[RemoveDiacritics@StringTake[#,1]&,DeleteMissing[EntityValue["MGPPerson",{"GivenName","Surname"}],1,2],{2}],(65<=#[[1]]<=90&&65<=#[[2]]<=90)&],AxesLabel->{"given name","surname"},Ticks->({#,#,Automatic}&[Table[{j,FromCharacterCode[j]},{j,65,90}]])]

As one might expect, mathematician initials (as well as those of all people in general) are not uniformly distributed with respect to the alphabet.

Explore Locations

The Wolfram Language contains a powerful set of functionality involving geographic computation and visualization. We shall make heavy use of such functionality in the following computations.

It is interesting to explore the movement of mathematicians from the institutions where they received their degrees to the institutions at which they did their subsequent advising. To do so, first select mathematicians who received a degree in the 1980s:



Find where their students received their degrees:



Assume the advisors were local to the advisees:



Now show the paths of the advisors:



Explore Degrees

We can also perform a number of computations involving mathematical degrees. As with the "MGPPerson" domain, we first briefly explore the contents of the "MGPDegree" domain and show how to access them.

To begin, show a count of the number of theses in the "MGPDegree" domain:



List five random theses from the "MGPDegree" domain:



Show available "MGPDegree" properties:



Return a dataset of an "MGPDegree" entity:


Entity["MGPDegree", "120366"]["Dataset"]

Moving on, we now visualize the historical numbers of PhDs awarded worldwide:



We can now make a fit to the number of new PhD mathematicians over the period 1875–1975:


fit=Fit[Select[{#1["Year"],1. Log[2,#2]}&@@@phddata,1875<#[[1]]<1975&],{1,y},y]

This gives a doubling time of about 1.5 decades:



Let’s write a utility function to visualize the number of degrees conferred by a specified university over time:



Look up the University of Chicago entity of the "University" type in the Wolfram Knowledgebase:


Interpreter["University"]["university of chicago"]

Show the number of degrees awarded by the University of Chicago, binned by decade:


DegreeCountHistogram[Entity["University", "UniversityOfChicago::726rv"],"Decades"]

... and by year:


DegreeCountHistogram[Entity["University", "UniversityOfChicago::726rv"],"Years",DateTicksFormat->"Year"]

Now look at the national distribution of degrees awarded. Begin by again examining the structure of the data. In particular, there exist PhD theses with no institution specified in "SchoolEntity" but a country specified in "SchoolLocation":



There also exist theses with more than a single country specified in "SchoolLocation":



Tally the countries (excluding the pair of multiples):



A total of 117 countries are represented:



Download flag images for these countries from the Wolfram Knowledgebase:



Create an image collage of flags, with the flags sized according to the number of math PhDs:



As another example, we can explore degrees awarded by a specific university. For example, extract mathematics degrees that have been awarded at the University of Miami since 2010:


"SchoolEntity"->Entity["University", "UniversityOfMiami::9c2k9"],
"Date"-> GreaterEqualThan[DateObject[{2010}]]}

Create a timeline visualization:



Now consider recent US mathematics degrees. Select the theses written at US institutions since 2000:


loc_?(ContainsExactly[{Entity["Country", "UnitedStates"]}]),DateObject[{y_?(GreaterEqualThan[2000])},___]

Make a table showing the top US schools by PhDs conferred:



Map schools to their geographic positions:



Visualize the geographic distribution of US PhDs :


GeoBubbleChart[geopositions,GeoRange->Entity["Country", "UnitedStates"]]

Show mathematician thesis production as a smooth kernel histogram over the US:


GeoSmoothHistogram[Flatten[Table[#1,{#2}]&@@@geopositions],"Oversmooth",GeoRange->GeoVariant[Entity["Country", "UnitedStates"],Automatic]]

Explore Thesis Titles

We now make some explorations of the titles of mathematical theses.

To begin, extract theses authored by people with the surname “Smith”:



Create a WordCloud of words in the titles:



Now explore the titles of all theses (not just those written by Smiths) by extracting thesis titles and dates:



The average string length of a thesis is remarkably constant over time:



The longest thesis title on record is this giant:



Motivated by this, extract explicit fragments appearing in titles:



... and display them in a word cloud:



Extract types of topological spaces mentioned in thesis titles and display them in a ranked table:


TextGrid[{StringTrim[#1],#2}&@@@Take[Select[Reverse[SortBy[Tally[Flatten[DeleteCases[StringCases[#2,Shortest[" ",((LetterCharacter|"_")..)~~(" space"|"Space ")]]&@@@tt,{}]]],Last]],
Not[StringMatchQ[#[[1]],(" of " | " in " |" and "|" the " | " on ")~~__]]&],12],Dividers->All,Alignment->{{Left,Decimal}}]

Explore Mathematical Subjects

Get all available Mathematics Subject Classification (MSC) category descriptions for mathematics degrees conferred by the University of Oxford and construct a word cloud from them:


WordCloud[DeleteMissing[EntityValue[EntityList[EntityClass["MGPDegree","SchoolEntity"->Entity["University", "UniversityOfOxford::646mq"]]],"MSCDescription"]],ImageSize->Large]

Explore the MSC distribution of recent theses. To begin, Iconize a list to use that holds MSC category names that will be used in subsequent examples:



Extract degrees awarded since 2010:


Length[degrees2010andlater=Cases[Transpose[{EntityList["MGPDegree"],EntityValue["MGPDegree","Date" ]}],{th_,DateObject[{y_?(GreaterEqualThan[2010])},___]}:>th]]

Extract the corresponding MSC numbers:



Make a pie chart showing the distribution of MSC category names and numbers:

Pie chart
Pie chart labels

With[{counts=Sort[Counts[degreeMSCs],Greater][[;;20]]},PieChart[Values[counts],ChartLegends->(Row[{#1,": ",#2," (",#3,")"}]&@@@(Flatten/@Partition[Riffle[Keys@counts,Partition[Riffle[(Keys@counts/.mscnames),ToString/@Values@counts],2]],2])),ChartLabels->Placed[Keys@counts,"RadialCallout"],ChartStyle->24,ImageSize->Large]]

Extract the MSC numbers for theses since 1990 and tally the combinations of {year, MSC}:



Plot the distribution of MSC numbers (mouse over the graph in the attached notebook to see MSC descriptions):


AxesLabel->{"MSC","year","thesis count"},Ticks->{None,Automatic,Automatic}]

Most students do research in the same area as their advisors. Investigate systematic transitions from MSC classifications of advisors’ works to those of their students. First, write a utility function to create a list of MSC numbers for an advisor’s degrees and those of each advisee:



For example, for Maurice Fréchet:


TextGrid[msctransition[Entity["MGPPerson", "17947"]]/.mscnames,Dividers->All]

Find MSC transitions for degree dates after 1988:













Explore Advisors

Construct a list of directed edges from advisors to their students:



Some edges are duplicated because the same student-advisor relationship exists for more than one degree:



For example:


(EntityValue[Entity["MGPPerson", "110698"],{"AdvisedBy","Degrees"}]/.e:Entity["MGPDegree",_]:>{e,e["DegreeType"]})

So build an explicit advisor graph by uniting the {advisor, advisee} pairs:



The advisor graph contains more than 3,500 weakly connected components:



Visualize component sizes on a log-log plot:



Find the size of the giant component (about 190,000 people):



Find the graph center of the second-largest component:



Visualize the entire second-largest component:



Identify the component in which David Hilbert resides:


FirstPosition[VertexList/@graphComponents,Entity["MGPPerson", "7298"]][[1]]

Show Hilbert’s students:


With[{center=Entity["MGPPerson", "7298"]},HighlightGraph[Graph[Thread[center->AdjacencyList[graphComponents[[1]],center]],VertexLabels->"Name",ImageSize->Large],center]]

As it turns out, the mathematician Gaston Darboux plays an even more central role in the advisor graph. Here is some detailed information about Darboux, whose 1886 thesis was titled “Sur les surfaces orthogonales”:


Entity["MGPPerson", "34254"] ["PropertyAssociation"]

And here is a picture of Darboux:


Show[WikipediaData["Gaston Darboux","ImageList"]//Last,ImageSize->Small]

Many mathematical constructs are named after Darboux:



... and his name can even be used in adjectival form:


StringCases[Normal[WebSearch["Darbouxian *",Method -> "Google"][All,"Snippet"]], "Darbouxian"~~" " ~~(LetterCharacter ..)~~" " ~~(LetterCharacter ..)]//Flatten//DeleteDuplicates // Column

Many well-known mathematicians are in the subtree starting at Darboux. In particular, in the directed advisor graph we find a number of recent Fields Medal winners. Along the way, we also see many well-known mathematicians such as Laurent Schwartz, Alexander Grothendieck and Antoni Zygmund:


{path1,path2,path3,path4}=(DirectedEdge@@@Partition[FindShortestPath[graphComponents[[1]],Entity["MGPPerson", "34254"],#],2,1])&/@
{Entity["MGPPerson", "13140"],Entity["MGPPerson", "22738"],Entity["MGPPerson", "43967"],Entity["MGPPerson", "56307"]}

Using the data from the EntityStore, we build the complete subgraph starting at Darboux:




advgenerations=Rest[NestList[adviseeedges,{Null->Entity["MGPPerson", "34254"]},7]];



It contains more than 14,500 mathematicians:



Because it is a complicated graph, we display it in 3D to avoid overcrowded zones. Darboux sits approximately in the center:



We now look at the degree centrality of the nodes of this graph in a log-log plot:



Let’s now highlight the path to that plot for Fields Medal winners:




Join[{Style[Entity["MGPPerson", "34254"],Orange,PointSize[Large]]},

Geographically, Darboux’s descendents are distributed around the whole world:


makeGeoPath[e1_e2_] :=
Column/@{{"degree date"},d1,d2}},Dividers->Center]]}]]

Here are the paths from the advisors’ schools to the advisees’ schools after four and six generations:




GeoGraphics[makeGeoPath /@ Flatten[Take[advgenerations, 6]],
  GeoBackground -> "StreetMapNoLabels", GeoRange -> "World"] // Quiet

Distribution of Intervals between the Date at Which an Advisor Received a PhD and the Date at Which That Advisor's First Student's PhD Was Awarded

Extract a list of advisors and the dates at which their advisees received their PhDs:



This list includes multiple student PhD dates for each advisor, so select the dates of the first students’ PhDs only:



Now extract a list of PhD awardees and the dates of their PhDs:



Note that some advisors have more than one PhD:



For example:


Entity["MGPPerson", "100896"]["Degrees"]

... who has these two PhDs:



While having two PhDs is not unheard of, having three is unique:



In particular:



Select the first PhDs of advisees and make a set of replacement rules to their first PhD dates:



Now replace advisors by their first PhD years and subtract from the year of their first students’ PhDs:



The data contains a small number of discrepancies where students allegedly received their PhDs prior to their advisors:



Removing these problematic points and plotting a histogram reveals the distribution of years between advisors’ and first advisees’ PhDs:



We hope you have found this computational exploration of mathematical genealogy of interest. We thank Mitch Keller and the Mathematics Genealogy Project for their work compiling and maintaining this fascinating and important dataset, as well as for allowing us the opportunity to explore it using the Wolfram Language. We hope to be able to freely expose a Wolfram Data Repository version of the MGP dataset in the near future so that others may do the same.

Leave a Comment



In[3]:= EntityValue["MGPPerson","EntityCount"]
Out[3]= Missing[UnknownType,MGPPerson]

Posted by Diana    August 3, 2018 at 4:39 pm
    Wolfram Blog

    Hi Diana. That doesn’t work because we haven’t made the data available in the Wolfram Data Repository yet. Once we do, one can use ResourceSearch to find the ResourceObject and import it into a Wolfram Language session.

    Posted by Wolfram Blog    August 7, 2018 at 8:40 am
Kurt Shatov

Great blog!
Two questions: a) Have you tried to do something similar for other disciplines (https://academictree.org) and compare various quantitative characteristics of the resulting trees (e.g. vertex degree distributions)?
b) What’s the distribution underlying the last histogram? It is a lognormal distribution (as one might conjecture from arxiv 1607.02952)?

Posted by Kurt Shatov    August 5, 2018 at 10:05 am
    Wolfram Blog

    Thanks for your feedback, Kurt!

    As for your first question, we have not looked at other disciplines. https://academictree.org/ looks fascinating. We’ll look into getting that into the Wolfram Data Repository.

    And for your second question, indeed, the distribution is well-approximated by a log-normal

    distribution as follows:

    In[159]:= hist = Histogram[data]

    In[170]:= fit =
    LogNormalDistribution[\[Mu], \[Sigma]]]
    Out[170]= {\[Mu] -> 2.35869, \[Sigma] -> 0.605468}

    In[172]:= Show[{Histogram[data, {1}, "PDF"],
    Plot[PDF[LogNormalDistribution[\[Mu], \[Sigma]], t] /. fit, {t, 0, 40}]}]

    Posted by Wolfram Blog    August 7, 2018 at 8:39 am
Barrie Stokes

Another tour de force blog from the Wofram staff!
A common question:is there, will there be, a Notebook of this blog? It’s always instructive to take a great Notebook and play with it.

Thanks Aaronm, thanks Eric.

Posted by Barrie Stokes    August 5, 2018 at 9:51 pm
    Wolfram Blog

    We hope to make a NB of this blog post once we are able to publish the MGP EntityStore in the Wolfram Data Repository, from which it will then be available to all.

    Posted by Wolfram Blog    August 6, 2018 at 2:33 pm
Mpsc Ganit

What are the reasons behind MGP data becomes more accessible and easily computable by entity framework?

Posted by Mpsc Ganit    August 6, 2018 at 5:12 am
    Wolfram Blog

    As an EntityStore, the MGP data becomes computable in the Wolfram Language, allowing us to do the analysis you see in the blog posting. We put the data into an EntityStore because a) the data seemed well suited for it and b) we wanted to show off what an EntityStore can do.

    Posted by Wolfram Blog    August 6, 2018 at 2:28 pm

Leave a comment in reply to Kurt Shatov


Or continue as a guest (your comment will be held for moderation):