Accessing the World with the Wolfram Language: External Identifiers and Wikidata
Wikidata is a large, community-curated repository of freely usable data. Version 12.1 of the Wolfram Language introduced dedicated functionality to access Wikidata. We came up with a new kind of entity: a fundamental building block called ExternalIdentifier, which I’ll explain in more detail shortly.
As a simple starting example, let’s retrieve the mass of the Moon according to Wikidata:
Engage with the code in this post by downloading the Wolfram Notebook
✕
WikidataData[ ExternalIdentifier["WikidataID", "Q405", <|"Label" -> "Moon", "Description" -> "only natural satellite of Earth"|>], ExternalIdentifier["WikidataID", "P2067", <|"Label" -> "mass", "Description" -> "mass (in colloquial usage also known as weight) of the item"|>]] |
Where is this data coming from? Who provides it? I’ll start at the beginning with a review of the history of Wikipedia, which led to the creation of a new sister project: Wikidata. Then I’ll demonstrate how the Wolfram Language facilitates access to this large and diverse data repository.
Quick History of Wikimedia
Wikipedia—the free encyclopedia that anyone can edit—was created 20 years ago. Some advantages over traditional encyclopedias (think: heavy paper books) include hyperlinks within the text to other articles and links to the same article in a different language, images, infoboxes and so on.
Around that time, the semantic web was already in the works, with the first recommendation standardizing its data model, the Resource Description Framework (RDF), being published in 2004. The semantic web aims to apply the successful concept of hyperlinks to (machine-readable) data, creating so-called “linked data.” By standardizing the underlying data model, the data can be moved freely between systems. By linking the data—achieved by representing concepts as URLs (more technically, IRIs)—datasets can be combined easily.
About 10 years ago, Wikipedia reached the status of the “go-to” resource for anyone wishing to learn about a subject and to find references for further research. Articles and links between those articles are available in hundreds of languages. However, keeping the information available in infoboxes, which was essentially repeated across language editions, in sync was tedious. Even renaming a single article meant updating potentially hundreds of links from other language editions. The valuable data present in the articles was not readily machine readable, making queries like “give me all neurotransmitters encoded by such-and-such a protein” hard to answer.
Wikimedia, the foundation behind Wikipedia, presented a new sister project about seven years ago: Wikidata, a multilingual website to host “items”—that is, identifiers starting with the letter Q, followed by an integer.
Wikidata Development
The first item ever created was Q1, representing the universe. Shortly after followed Q2, representing Earth, and so on. In the first development phase of Wikidata, an item was created for each existing Wikipedia page, in the order of some popularity measure.
Each item contains links to Wikipedia pages about the concept identified by the item, one per language. This facilitated managing the language links because changing the title of an article now required only one change in a central location. An item also contains for each language a label, a short description and alternative labels.
The second phase introduced the concept of properties: identifiers starting with the letter P followed by an integer. With properties, one can make statements, like (Q2, P31, Q3504248). For our English-speaking readers, this can be presented as (“Earth”, “instance of”, “inner planet”) using the English labels attached to the respective items and properties.
So now we have a website that stores machine-readable, multilingual data. It provides interfaces (APIs) that allow retrieving that data, one item at a time. While that is immensely useful on its own (and a clear improvement over having to extract data from a Wikipedia page), the real power becomes apparent with the Wikidata query service: it is an RDF (think: semantic web) database containing all the information about all the items.
SPARQL is the query language for RDF, just like SQL is the query language for relational databases. SPARQL can be used to query the Wikidata query service. In a previous blog, I explained the basics of SPARQL as well as the Wolfram Language’s symbolic representation of SPARQL, which facilitates writing programs that construct queries on the fly.
But you don’t have to read that (somewhat technical) blog post, nor do you have to learn SPARQL. Version 12.1 uses that same technology to build a function that is very easy to use (think: as easy as using Entity), from simple data retrieval and presentation to queries involving conditions.
Basics
Before giving a more theoretical explanation, let’s start with some examples.
WikidataData is the function to access data stored in Wikidata. For example, here is the mass of the Moon, shown previously:
✕
WikidataData[ ExternalIdentifier["WikidataID", "Q405", <|"Label" -> "Moon", "Description" -> "only natural satellite of Earth"|>], ExternalIdentifier["WikidataID", "P2067", <|"Label" -> "mass", "Description" -> "mass (in colloquial usage also known as weight) of the item"|>]] |
The items to feed into WikidataData can be discovered using WikidataSearch:
✕
WikidataSearch["moon"] |
Similarly, you can specify to look for properties:
✕
WikidataSearch["Property" -> "mass"] |
External Identifiers
Now what are those bluish boxes? Let’s take a step back and look at entities first.
Identifier Systems
The following expression stands for the concept “Moon”:
✕
Entity["PlanetaryMoon", "Moon"] |
When designing the entity framework, we could have chosen to represent the concept of the Moon simply as Entity[1234...]—that is, an arbitrary number (or string) that identifies the concept within the Wolfram ontology. However, we chose a two-component identifier consisting of a type (an “entity type”) and a name (a “canonical name”) with the requirement that a name is unique for a given type.
For the purpose of identifying a concept, the structure of an identifier does not matter. Advantages of partitioning the space of identifiers by type include, for instance, being able to list the properties applicable to a given entity.
The Wolfram ontology is a particular identifier system. Examples of other identifier systems include ISBN, DOI, ISO639-2 code (“language code”), LoC control number, MusicBrainz artist ID and so on. It is up to each identifier system to standardize the format of the identifier (from now on, ID), define their referent (that to which an ID refers) and mint new IDs, or nominate organizations to do so and to provide lookup services.
An ID alone, say 123, is meaningless, because a number of identifier systems use integers as IDs (OSM relation ID, PubChem SID and CID, Entrez Gene ID, etc.) and therefore refer to completely unrelated concepts.
The new symbol to represent an ID within an identifier system is ExternalIdentifier.
Represent the ID Q1 within the Wikidata identifier system:
✕
ExternalIdentifier["WikidataID", "Q1"] |
Data Providers
Entities have a compact syntax for retrieving data:
✕
Entity["PlanetaryMoon", "Moon"][ EntityProperty["PlanetaryMoon", "Mass"]] |
… which is a short form for:
✕
EntityValue[Entity["PlanetaryMoon", "Moon"], EntityProperty["PlanetaryMoon", "Mass"]] === % |
This raises the question of why the following is not supported:
✕
ExternalIdentifier["WikidataID", "Q405", <|"Label" -> "Moon", "Description" -> "only natural satellite of Earth"|>][ ExternalIdentifier["WikidataID", "P2067", <|"Label" -> "mass", "Description" -> "mass (in colloquial usage also known as weight) of the item"|>]] |
This is because the organization that defines an identifier system does not need to be the same as the one that hosts the data. In the case of Entity, Wolfram is responsible for both. But for, say, an ISBN, there is no canonical organization responsible for recording all the ISBNs that have been issued. That implies that for working with external identifiers, selecting a service (“choosing a lookup function”) is an explicit step.
In the case of ISBN, we got lucky because Wikidata does store ISBNs corresponding to Wikidata IDs. This mapping is used to make the following possible:
✕
WikidataData[ExternalIdentifier["ISBN10", "1-57955-008-8"], ExternalIdentifier["WikidataID", "P50", <|"Label" -> "author", "Description" -> "main creator(s) of a written work (use on works, not humans); \ use P2093 when Wikidata item is unknown or does not exist"|>]] |
In fact, of the around eight thousand properties (and growing) that are available in Wikidata, about half are of type “external ID”—that is, mappings into other identifier systems.
Metadata
As there is no canonical service associated with certain identifier systems, one cannot rely on illustrative labels to become available when needed. But one might still want to present some human-readable information (instead of just the ID) in certain situations.
ExternalIdentifier allows embedding arbitrary metadata:
✕
ExternalIdentifier["type", "abc", <|"notes" -> "some notes"|>] |
This metadata does not change the referent. The most important piece of metadata is the "Label", which is used for display:
✕
ExternalIdentifier["WikidataID", "Q2", <|"Label" -> "Earth"|>] |
URLs
For certain identifier types, a URL can be constructed. For such types, you can click inside the blue box (on the ID or label) to go to the website that describes the referent. The URL can be accessed like this:
✕
ExternalIdentifier["WikidataID", "Q2", <|"Label" -> "Earth"|>]["URL"] |
There is also a special kind of URL, the "ConceptURI", which identifies the concept within the semantic web (the same in this case):
✕
ExternalIdentifier["WikidataID", "Q2", <|"Label" -> "Earth"|>]["ConceptURI"] |
Such concept URIs are relevant when querying SPARQL endpoints.
Datatypes
The Wolfram Language allows for the representation of a variety of basic as well as complex “values.” Complex values are presented to the user in an easily recognizable style. A large set of functions for plotting, creating maps and timelines, arithmetic operations, sorting and so on readily supports those datatypes.
Wikidata also supports a variety of datatypes. Those are translated to corresponding Wolfram Language expressions by WikidataData, taking into account precision of numbers, precision and units of quantities, precision and calendar type of dates and coordinate systems (“planets”) for geographic positions.
Strings
An example of a basic value is a string:
✕
WikidataData[ ExternalIdentifier["WikidataID", "Q82274", <|"Label" -> "Plochingen", "Description" -> "municipality in Germany"|>], ExternalIdentifier["WikidataID", "P281", <|"Label" -> "postal code", "Description" -> "identifier assigned by postal authorities for the subject area \ or building"|>]] |
“Demonym” is an example of a so-called monolingual text property. Its value depends on the setting for the language options:
✕
WikidataData[ ExternalIdentifier["WikidataID", "Q2", <|"Label" -> "Earth", "Description" -> "third planet from the Sun in the Solar System"|>], ExternalIdentifier["WikidataID", "P1549", <|"Label" -> "demonym", "Description" -> "demonym (proper noun) for people or things associated with a \ given place, usually based off the placename; multiple entries with \ qualifiers to distinguish are used to list variant forms by reason of \ grammatical gender or plurality."|>]] |
✕
WikidataData[ ExternalIdentifier["WikidataID", "Q2", <|"Label" -> "Earth", "Description" -> "third planet from the Sun in the Solar System"|>], ExternalIdentifier["WikidataID", "P1549", <|"Label" -> "demonym", "Description" -> "demonym (proper noun) for people or things associated with a \ given place, usually based off the placename; multiple entries with \ qualifiers to distinguish are used to list variant forms by reason of \ grammatical gender or plurality."|>], Language -> "German"] |
URLs
Retrieve a URL of an image representing an item:
✕
WikidataData[ ExternalIdentifier["WikidataID", "Q5070208", <|"Label" -> "kangaroo", "Description" -> "marsupial indigenous to Australia"|>], "ImageURL"] |
As a small convenience, the image can be requested (saving you one Import call):
✕
WikidataData[ ExternalIdentifier["WikidataID", "Q5070208", <|"Label" -> "kangaroo", "Description" -> "marsupial indigenous to Australia"|>], "Image"] |
Geography
Geographic positions are represented as GeoPosition:
✕
WikidataData[ ExternalIdentifier["WikidataID", "Q64", <|"Label" -> "Berlin", "Description" -> "capital and largest city of Germany"|>], "GeoPosition"] |
This is for use in any geo-plotting or computation function:
✕
GeoListPlot[%] |
The automatic inclusion of the the coordinate system in the position allows, for instance, GeoListPlot to choose the right background.
✕
WikidataData[ ExternalIdentifier["WikidataID", "Q3298524", <|"Label" -> "Korolev", "Description" -> "crater on Mars"|>], "GeoPosition"] |
Here’s an example of a background chosen for Mars:
✕
GeoListPlot[%] |
Numbers and Quantities
Numbers can be considered quantities of dimension one (historically: “dimensionless quantities”). This is supported by the following identity:
✕
Quantity[1, "PureUnities"] === 1 |
Wikidata is following that system by having a single datatype to represent both with the quantity datatype. When users enter a value of a property with datatype quantity, the user interface optionally allows entering a unit. The units that can be entered in that unit field are Wikidata items. If no unit is entered, it defaults to Q199, the item representing the number 1. Being a unit, the number 1 has a unit symbol (which is typically omitted when writing down a value):
✕
WikidataData[ ExternalIdentifier["WikidataID", "Q199", <|"Label" -> "1", "Description" -> "natural number"|>], ExternalIdentifier["WikidataID", "P5061", <|"Label" -> "unit symbol", "Description" -> "Abbreviation of a unit for each language. If not provided, then \ it should default to English."|>]] |
It also has a conversion to other (coherent) SI units (1, trivially), and a statement about which quantities it measures:
✕
WikidataData[ ExternalIdentifier["WikidataID", "Q199", <|"Label" -> "1", "Description" -> "natural number"|>], {ExternalIdentifier[ "WikidataID", "P2370", <|"Label" -> "conversion to SI unit", "Description" -> "conversion of the unit into SI base unit(s)/SI derived unit"|>], ExternalIdentifier["WikidataID", "P111", <|"Label" -> "measured physical quantity", "Description" -> "value of a physical property expressed as number multiplied by \ a unit"|>]}] |
Let’s find some (direct) subclasses of the class of quantities of dimension one:
✕
WikidataData[ EntityClass[All, "SubclassOf" -> ExternalIdentifier["WikidataID", "Q126818", <|"Label" -> "dimensionless quantity", "Description" -> "quantity without an associated physical dimension"|>]], ExternalIdentifier["WikidataID", "P7973", <|"Label" -> "quantity symbol (LaTeX)", "Description" -> "symbol for a mathematical or physical quantity in LaTex"|>], "Association"] // Short |
But I’m digressing. The takeaway is that Wikidata contains its own quantities and units ontology, which is necessary and of fundamental importance for representing any quantity-valued statement in Wikidata.
Now an example:
✕
WikidataData[ ExternalIdentifier["WikidataID", "Q243", <|"Label" -> "Eiffel Tower", "Description" -> "tower located on the Champ de Mars in Paris, France"|>], ExternalIdentifier["WikidataID", "P2067", <|"Label" -> "mass", "Description" -> "mass (in colloquial usage also known as weight) of the item"|>]] |
When hovering over the result, a tooltip indicates “unit: metric tons.”
There are quite a few “tons” out there. The linguistic interface allows selecting among the possible interpretations of “ton” known to the Wolfram Language:
Wikidata also knows a few “tons”:
✕
WikidataSearch["ton"] |
This example illustrates that there is too much ambiguity in commonly used unit names to rely just on the name to unambiguously identify a unit. To solve the ambiguity issue, Wikidata links its unit items to other unit ontologies:
✕
WikidataData[ ExternalIdentifier["WikidataID", "Q191118", <|"Label" -> "tonne", "Description" -> "metric unit of mass equal to 1000 kg"|>], {ExternalIdentifier[ "WikidataID", "P2968", <|"Label" -> "QUDT unit ID", "Description" -> "identifier for unit of measure definition according to QUDT \ ontology"|>], ExternalIdentifier["WikidataID", "P7825", <|"Label" -> "UCUM code", "Description" -> "case-sensitive code from the Unified Code for Units of Measure \ specification to identify a unit of measurement"|>], ExternalIdentifier["WikidataID", "P3328", <|"Label" -> "wurvoc.org measure ID", "Description" -> "concept in the Ontology of units of Measure and related \ concepts (OM) 1.8 of wurvoc.org"|>], ExternalIdentifier["WikidataID", "P7007", <|"Label" -> "Wolfram Language unit code", "Description" -> "input form for a unit of measurement in the Wolfram \ Language"|>]}] |
This allows unambiguous translation of RDF (semantic web) data, which uses any of those unit ontologies to represent quantities to be translated to any other—including to and from the Wolfram Language.
Precision
The speed of light is a physical constant used to define the unit “meter.” It is exact when expressed in meters per second:
✕
WikidataData[ ExternalIdentifier["WikidataID", "Q2111", <|"Label" -> "speed of light", "Description" -> "speed at which all massless particles and associated fields \ travel in a vacuum"|>], ExternalIdentifier["WikidataID", "P1181", <|"Label" -> "numeric value", "Description" -> "numerical value of a number, a mathematical constant, or a \ physical constant"|>]] |
✕
Precision[First[%]] |
The gravitational constant, on the other hand, is uncertain:
✕
WikidataData[ ExternalIdentifier["WikidataID", "Q18373", <|"Label" -> "gravitational constant", "Description" -> "empirical physical constant relating the gravitational force \ between objects to their mass and distance"|>], ExternalIdentifier["WikidataID", "P1181", <|"Label" -> "numeric value", "Description" -> "numerical value of a number, a mathematical constant, or a \ physical constant"|>]] |
The uncertainty is contained in the previous expression (its FullForm) and it can be made “visible” by applying Around:
✕
MapAt[Around, First[%], 1] |
(Nice formatting is not the only feature of Around.)
Dates
For dates not too far in the past, the (proleptic) Gregorian calendar is typically used:
✕
WikidataData[ ExternalIdentifier["WikidataID", "Q9312", <|"Label" -> "Immanuel Kant", "Description" -> "German philosopher"|>], ExternalIdentifier["WikidataID", "P569", <|"Label" -> "date of birth", "Description" -> "date on which the subject was born"|>]] |
Here is an example of a date given in the Julian calendar:
✕
WikidataData[ ExternalIdentifier["WikidataID", "Q859", <|"Label" -> "Plato", "Description" -> "ancient Greek philosopher"|>], ExternalIdentifier["WikidataID", "P569", <|"Label" -> "date of birth", "Description" -> "date on which the subject was born"|>]] |
The ability to enter dates in different calendar systems allows faithful representation of values given in (primary) sources. It leaves the task of converting between calendars to the application consuming such data:
✕
CalendarConvert[%, "Gregorian"] |
The inception of the city Berlin is only known to year precision:
✕
WikidataData[ ExternalIdentifier["WikidataID", "Q64", <|"Label" -> "Berlin", "Description" -> "capital and largest city of Germany"|>], ExternalIdentifier["WikidataID", "P571", <|"Label" -> "inception", "Description" -> "date or point in time when the subject came into existence as \ defined"|>]] |
Result Forms
If you know entities, then you’ll probably have discovered the possibility to “shape” the result. When requesting multiple values at once, the default result is a list, or more generally, an array:
✕
Entity["Person", "AlbertEinstein::6tb7g"][{"BirthDate", "DeathDate"}] |
The position of the value in the result is the same as the position of the corresponding property in the input. To see the properties and values next to each other, one can request an Association or a Dataset instead:
✕
Entity["Person", "AlbertEinstein::6tb7g"][{"BirthDate", "DeathDate"}, "Association"] |
WikidataData supports the same result forms:
✕
WikidataData[ ExternalIdentifier["WikidataID", "Q937", <|"Label" -> "Albert Einstein", "Description" -> "German-born physicist; developer of the theory of \ relativity"|>], {"BirthDate", "DeathDate"}, "Association"] |
Omitting the list of properties produces all available data:
✕
WikidataData[ ExternalIdentifier["WikidataID", "Q937", <|"Label" -> "Albert Einstein", "Description" -> "German-born physicist; developer of the theory of \ relativity"|>], "Dataset"] |
Entities
Entities can be discovered easily with the linguistic assistant, thereby facilitating access to Wikidata:
✕
WikidataData[\!\(\*NamespaceBox["LinguisticAssistant", DynamicModuleBox[{Typeset`query$$ = "zebra", Typeset`boxes$$ = TemplateBox[{"\"Cape mountain zebra\"", RowBox[{"Entity", "[", RowBox[{"\"Species\"", ",", "\"Infraspecies:EquusZebraZebra\""}], "]"}], "\"Entity[\\\"Species\\\", \ \\\"Infraspecies:EquusZebraZebra\\\"]\"", "\"species specification\""}, "Entity"], Typeset`allassumptions$$ = {{ "type" -> "Clash", "word" -> "zebra", "template" -> "Assuming \"${word}\" is ${desc1}. Use as \ ${desc2} instead", "count" -> "3", "Values" -> {{ "name" -> "Species", "desc" -> "a species specification", "input" -> "*C.zebra-_*Species-"}, { "name" -> "MaterialClass", "desc" -> "a class of materials", "input" -> "*C.zebra-_*MaterialClass-"}, { "name" -> "Word", "desc" -> "a word", "input" -> "*C.zebra-_*Word-"}}}, { "type" -> "SubCategory", "word" -> "zebra", "template" -> "Assuming ${desc1}. Use ${desc2} instead", "count" -> "3", "Values" -> {{ "name" -> "Infraspecies:EquusZebraZebra", "desc" -> "Cape mountain zebra", "input" -> "*DPClash.SpeciesE.zebra-_*Infraspecies%\ 3AEquusZebraZebra-"}, { "name" -> "Species:EquusGrevyi", "desc" -> "Grevy's zebra", "input" -> "*DPClash.SpeciesE.zebra-_*Species%3AEquusGrevyi-\ "}, {"name" -> "Species:EquusZebra", "desc" -> "mountain zebra", "input" -> "*DPClash.SpeciesE.zebra-_*Species%3AEquusZebra-"}\ }}}, Typeset`assumptions$$ = {}, Typeset`open$$ = {1}, Typeset`querystate$$ = { "Online" -> True, "Allowed" -> True, "mparse.jsp" -> 1.128411`6.504012304598597, "Messages" -> {}}}, DynamicBox[ToBoxes[ AlphaIntegration`LinguisticAssistantBoxes["", 4, Automatic, Dynamic[Typeset`query$$], Dynamic[Typeset`boxes$$], Dynamic[Typeset`allassumptions$$], Dynamic[Typeset`assumptions$$], Dynamic[Typeset`open$$], Dynamic[Typeset`querystate$$]], StandardForm], ImageSizeCache->{56., {7., 16.}}, TrackedSymbols:>{ Typeset`query$$, Typeset`boxes$$, Typeset`allassumptions$$, Typeset`assumptions$$, Typeset`open$$, Typeset`querystate$$}], DynamicModuleValues:>{}, UndoTrackedVariables:>{Typeset`open$$}], BaseStyle->{"Deploy"}, DeleteWithContents->True, Editable->False, SelectWithContents->True]\), "Dataset"] |
One can also go in the other direction. Given an item, request the corresponding entity:
✕
WikidataData[ ExternalIdentifier["WikidataID", "Q2", <|"Label" -> "Earth", "Description" -> "third planet from the Sun in the Solar System"|>], "Entity"] |
Classes
Entity classes represent an implicit collection of entities. They behave in various contexts just like an explicitly given list of entities. Here is a class of the three longest rivers in Italy:
✕
longRivers = EntityClass[ "River", {EntityProperty["River", "Countries"] -> Entity["Country", "Italy"], EntityProperty["River", "Length"] -> TakeLargest[3]}]; |
Request the lengths of those rivers:
✕
longRivers["Length", "Association"] |
Request the same information from Wikidata instead:
✕
WikidataData[longRivers, "Length", "Association"] |
In this example, the type, constrained property and value were specified as entity type, entity property and entity. Those are translated to corresponding Wikidata items and properties for query evaluation. However, there is no need to start with terms from the Wolfram ontology.
Here’s an example of a query that only uses external identifiers to specify type, constrained property and constraint value:
✕
WikidataData[ EntityClass[ ExternalIdentifier["WikidataID", "Q47461344", <|"Label" -> "written work", "Description" -> "any creative work expressed in writing like: inscriptions, \ manuscripts, documents or maps"|>], ExternalIdentifier["WikidataID", "P50", <|"Label" -> "author", "Description" -> "main creator(s) of a written work (use on works, not humans); \ use P2093 when Wikidata item is unknown or does not exist"|>] -> ExternalIdentifier["ISNI", "0000 0001 1047 0442"]], ExternalIdentifier["WikidataID", "P577", <|"Label" -> "publication date", "Description" -> "date or point in time when a work was first published or \ released"|>], "Association"] |
But what does it mean for “written work” to appear in the type position? Taking a random article from the list of results, we see that it is an instance of (Wikidata’s property P31 to indicate class membership) a “scholarly article”:
✕
WikidataData[ ExternalIdentifier["WikidataID", "Q56091626", <|"Label" -> "Computer algebra"|>], "InstanceOf"] |
“Scholarly article” is an (indirect) subclass of (Wikidata’s property P279 to indicate subclass relations) “written work”:
✕
WikidataData[ ExternalIdentifier["WikidataID", "Q13442814", <|"Label" -> "scholarly article", "Description" -> "article in an academic publication, usually peer reviewed"|>], \ "SubclassOf"] |
✕
WikidataData[ ExternalIdentifier["WikidataID", "Q191067", <|"Label" -> "article", "Description" -> "text that forms an independent part of a publication"|>], \ "SubclassOf"] |
When requesting members of classes (specified via EntityClass) from Wikidata, then both explicit and implicit class membership relations are taken into account.
More on Classes
Typically, the second argument of EntityClass is a list of rules. Each rule represents a (Boolean) condition indicating possible class membership. If all conditions applied to an entity indicate membership, then that entity is part of the class (“conjunction”). If we relax the syntactic constraint to allow only rules, we can represent more constraints.
For instance, using a GeoDisk, we can look up the three largest (by area) lakes within a disk of radius 100 km around Madrid:
✕
WikidataData[ EntityClass[ "Lake", {GeoDisk[Entity["City", {"Madrid", "Madrid", "Spain"}], Quantity[100, "Kilometers"]], "Area" -> TakeLargest[3]}], "Area", "Association"] |
The GeoDisk lookup considers any item that has a value for “coordinate location” (P625). This idea can be generalized to allow a GeoDisk in any position where otherwise an item would be expected. For instance, in the following query, we are looking for physicists whose birthplace is any item located within the given disk:
✕
WikidataData[ EntityClass[ ExternalIdentifier["WikidataID", "Q5", <|"Label" -> "human", "Description" -> "common name of Homo sapiens, unique extant species of the genus \ Homo"|>], { ExternalIdentifier["WikidataID", "P106", <|"Label" -> "occupation", "Description" -> "occupation of a person; see also \"field of work\" \ (Property:P101), \"position held\" (Property:P39)"|>] -> ExternalIdentifier["WikidataID", "Q169470", <|"Label" -> "physicist", "Description" -> "scientist who does research in physics"|>], ExternalIdentifier["WikidataID", "P19", <|"Label" -> "place of birth", "Description" -> "most specific known (e.g. city instead of country, or \ hospital instead of city) birth location of a person, animal or \ fictional character"|>] -> GeoDisk[Entity["City", {"Ulm", "BadenWurttemberg", "Germany"}], Quantity[30, "Kilometers"]] }], {"BirthPlace", "BirthDate"}, "Dataset"] |
Future
Ranks and Qualifiers
Wikidata supports associating a “rank” with each statement. By default, each statement is assigned the “normal” rank. Other possible ranks are “deprecated” and “preferred.”
The deprecated rank is assigned to statements that are known to be wrong. Why not just delete the wrong value? Sometimes even reliable sources contain errors. By recording the wrong value—together with the source and deprecated rank—other curators are made aware of the error. This reduces the chance that the wrong value is being re-added over and over again.
Some properties naturally accept multiple values, which are values applicable under different conditions. Examples of these might be density of water at different temperatures and pressures, a population at different points in time and so on. While for certain applications the value is only useful in combination with the associated qualifiers, in others, one is interested in a single value. “What’s the population/the density of water/…?” can be understood as asking about the most recent value/the value at standard conditions, etc. Such values typically receive the preferred rank.
WikidataData takes into account those ranks. By default, a deprecated value is never returned. Only “best” values are included, which means either all preferred values or all normal-rank values if there are no preferred values.
Here is the most recent value for the population of a small human settlement in Germany:
✕
WikidataData[ ExternalIdentifier["WikidataID", "Q82911", <|"Label" -> "Weinstadt", "Description" -> "city in Baden-Wrttemberg, Germany"|>], ExternalIdentifier["WikidataID", "P1082", <|"Label" -> "population", "Description" -> "number of people inhabiting the place; number of people of \ subject"|>]] |
We can request all non-deprecated values using the "StatementRank" suboption of the Method option:
✕
WikidataData[ ExternalIdentifier["WikidataID", "Q82911", <|"Label" -> "Weinstadt", "Description" -> "city in Baden-Wrttemberg, Germany"|>], ExternalIdentifier["WikidataID", "P1082", <|"Label" -> "population", "Description" -> "number of people inhabiting the place; number of people of \ subject"|>], Method -> "StatementRank" -> "NonDeprecated"] // Short |
Those values are somewhat useless without, for instance, the point in time they refer to. We can request such additional detail with the "StatementFormat" suboption:
✕
populations = WikidataData[ ExternalIdentifier["WikidataID", "Q82911", <|"Label" -> "Weinstadt", "Description" -> "city in Baden-Wrttemberg, Germany"|>], ExternalIdentifier["WikidataID", "P1082", <|"Label" -> "population", "Description" -> "number of people inhabiting the place; number of people of \ subject"|>], Method -> {"StatementRank" -> "NonDeprecated", "StatementFormat" -> "Association"}]; |
✕
First[populations] // Short |
That allows us to plot the population over time:
✕
TimeSeries[#[ ExternalIdentifier["WikidataID", "P585", <|"Label" -> "point in time", "Description" -> "time and date something took place, existed or a statement \ was true"|>]][[1]] -> #["Value"] & /@ populations] |
✕
DateListPlot[%] |
Access to qualifiers is essential for certain applications. However, an association is not the most natural representation of a qualified value. We are investigating new ways to represent, request and work with qualified data. Therefore, until then, that functionality is “hidden” in the Method option for the motivated expert user.
Wikibase
The software behind Wikidata is called Wikibase. It includes the interface where (human) curators view and enter data, APIs where programs can request data or an edit and the SPARQL query service for sophisticated analysis. While you can download the whole Wikidata dataset in one file and load it onto your own server (say for extended analysis that goes beyond the query time limit of Wikidata), an important use case of Wikibase is for individuals or organizations to manage their own data. Wolfram Language developers are investigating functionality similar to WikidataData to interact with such custom Wikibase installations.
Putting the Pieces Together
Before being able to make a statement about an object, one needs to identify that object. The entity framework provides both: identifiers (Entity) for a large number of objects and data access about those objects (EntityValue). With external identifiers on the one hand and data functions on the other, the Wolfram Language is separating the different concerns of identification and data retrieval. This separation is a good fit for external identifiers. It allows for a diversity of functions, either built into the Wolfram Language—WikidataData being the first such function—or developed by users. This will allow the user to choose whether they prefer looking up, for example, the composer of a musical work in Wikidata or MusicBrainz—both of which support the MusicBrainz work ID. In that sense, external identifiers are the basic building blocks for a new class of functionality to come: EntityValue-like functions for external data. Contact us with your suggestions or publish your own data function on the Wolfram Function Repository.
Get full access to the latest Wolfram Language functionality with a Mathematica 12.1 or Wolfram|One trial. |
Comments