The Practical Business of Ontology: A Tale from the Front Lines

The Philosophy of Chemicals

“We’ve just got to decide: is a chemical like a city or like a number?” I spent my day yesterday—as I have for much of the past 30 years—designing new features of the Wolfram Language. And yesterday afternoon one of my meetings was a fast-paced discussion about how to extend the chemistry capabilities of the language.

At some level the problem we were discussing was quintessentially practical. But as so often turns out to be the case for things we do, it ultimately involves some deep intellectual issues. And to actually get the right answer—and to successfully design language features that will stand the test of time—we needed to plumb those depths, and talk about things that usually wouldn’t be considered outside of some kind of philosophy seminar.

Thinker

Part of the issue, of course, is that we’re dealing with things that haven’t really ever come up before. Traditional computer languages don’t try to talk directly about things like chemicals; they just deal with abstract data. But in the Wolfram Language we’re trying to build in as much knowledge about everything as possible, and that means we have to deal with actual things in the world, like chemicals.

We’ve built a whole system in the Wolfram Language for handling what we call entities. An entity could be a city (like New York City), or a movie, or a planet—or a zillion other things. An entity has some kind of name (“New York City”). And it has definite properties (like population, land area, founding date, …).

We’ve long had a notion of chemical entities—like water, or ethanol, or tungsten carbide. Each of these chemical entities has properties, like molecular mass, or structure graph, or boiling point.

And we’ve got many hundreds of thousands of chemicals where we know lots of properties. But all of these are in a sense concrete chemicals: specific compounds that we could put in a test tube and do things with.

But what we were trying to figure out yesterday is how to handle abstract chemicals—chemicals that we just abstractly construct, say by giving an abstract graph representing their chemical structures. Should these be represented by entities, like water or New York City? Or should they be considered more abstract, like lists of numbers, or, for that matter, mathematical graphs?

Well, of course, among the abstract chemicals we can construct are chemicals that we already represent by entities, like sucrose or aspirin or whatever. But here there’s an immediate distinction to make. Are we talking about individual molecules of sucrose or aspirin? Or about these things as bulk materials?

At some level it’s a confusing distinction. Because, we might think, once we know the molecular structure, we know everything—it’s just a matter of calculating it out. And some properties—like molar mass—are basically trivial to calculate from the molecular structure. But others—like melting point—are very far from trivial.

OK, but is this just a temporary problem that one shouldn’t base a long-term language design on? Or is it something more fundamental that will never change? Well, conveniently enough, I happen to have done a bunch of basic science that essentially answers this: and, yes, it’s something fundamental. It’s connected to what I call computational irreducibility. And for example, the precise value of, say, the melting point for an infinite amount of some material may actually be fundamentally uncomputable. (It’s related to the undecidability of the tiling problem; fitting in tiles is like seeing how molecules will arrange to make a solid.)

So by knowing this piece of (rather leading-edge) basic science, we know that we can meaningfully make a distinction between bulk versions of chemicals and individual molecules. Clearly there’s a close relation between, say, water molecules, and bulk water. But there’s still something fundamentally and irreducibly different about them, and about the properties we can compute for them.

At Least the Atoms Should Be OK

Alright, so let’s talk about individual molecules. Obviously they’re made of atoms. And it seems like at least when we talk about atoms, we’re on fairly solid ground. It might be reasonable to say that any given molecule always has some definite collection of atoms in it—though maybe we’ll want to consider “parametrized molecules” when we talk about polymers and the like.

But at least it seems safe to consider types of atoms as entities. After all, each type of atom corresponds to a chemical element, and there are only a limited number of those on the periodic table. Now of course in principle one can imagine additional “chemical elements”; one could even think of a neutron star as being like a giant atomic nucleus. But again, there’s a reasonable distinction to be made: almost certainly there are only a limited number of fundamentally stable types of atoms—and most of the others have ridiculously short lifetimes.

There’s an immediate footnote, however. A “chemical element” isn’t quite as definite a thing as one might imagine. Because it’s always a mixture of different isotopes. And, say, from one tungsten mine to another, that mixture might change, giving a different effective atomic mass.

And actually this is a good reason to represent types of atoms by entities. Because then one just has to have a single entity representing tungsten that one can use in talking about molecules. And only if one wants to get properties of that type of atom that depend on qualifiers like which mine it’s from does one have to deal with such things.

In a few cases (think heavy water, for example), one will need to explicitly talk about isotopes in what is essentially a chemical context. But most of the time, it’s going to be enough just to specify a chemical element.

To specify a chemical element you just have to give its atomic number Z. And then textbooks will tell you that to specify a particular isotope you just have to say how many neutrons it contains. But that ignores the unexpected case of tantalum. Because, you see, one of the naturally occurring forms of tantalum (180mTa) is actually an excited state of the tantalum nucleus, which happens to be very stable. And to properly specify this, you have to give its excitation level as well as its neutron count.

In a sense, though, quantum mechanics saves one here. Because while there are an infinite number of possible excited states of a nucleus, quantum mechanics says that all of them can be characterized just by two discrete values: spin and parity.

Every isotope—and every excited state—is different, and has its own particular properties. But the world of possible isotopes is much more orderly than, say, the world of possible animals. Because quantum mechanics says that everything in the world of isotopes can be characterized just by a limited set of discrete quantum numbers.

We’ve gone from molecules to atoms to nuclei, so why not talk about particles too? Well, it’s a bigger can of worms. Yes, there are the well-known particles like electrons and protons that are pretty easy to talk about—and are readily represented by entities in the Wolfram Language. But then there’s a zoo of other particles. Some of them—just like nuclei—are pretty easy to characterize. You can basically say things like: “it’s a particular excited state of a charm-quark-anti-charm-quark system” or some such. But in particle physics one’s dealing with quantum field theory, not just quantum mechanics. And one can’t just “count elementary particles”; one also has to deal with the possibility of virtual particles and so on. And in the end the question of what kinds of particles can exist is a very complicated one—rife with computational irreducibility. (For example, what stable states there can be of the gluon field is a much more elaborate version of something like the tiling problem I mentioned in connection with melting points.)

Maybe one day we’ll have a complete theory of fundamental physics. And maybe it’ll even be simple. But exciting as that will be, it’s not going to help much here. Because computational irreducibility means that there’s essentially an irreducible distance between what’s underneath, and what phenomena emerge.

And in creating a language to describe the world, we need to talk in terms of things that can actually be observed and computed about. We need to pay attention to the basic physics—not least so we can avoid setups that will lead to confusion later. But we also need to pay attention to the actual history of science, and actual things that have been measured. Yes, there are, for example, an infinite number of possible isotopes. But for an awful lot of purposes it’s perfectly useful just to set up entities for ones that are known.

The Space of Possible Chemicals

But is it the same in chemistry? In nuclear physics, we think we know all the reasonably stable isotopes that exist—so any additional and exotic ones will be very short-lived, and therefore probably not important in practical nuclear processes. But it’s a different story in chemistry. There are tens of millions of chemicals that people have studied (and, for example, put into papers or patents). And there’s really no limit on the molecules that one might want to consider, and that might be useful.

But, OK, so how can we refer to all these potential molecules? Well, in a first approximation we can specify their chemical structures, by giving graphs in which every node is an atom, and every edge is a bond.

What really is a “bond”? While it’s incredibly useful in practical chemistry, it’s at some level a mushy concept—some kind of semiclassical approximation to a full quantum mechanical story. There are some standard extra bits: double bonds, ionization states, etc. But in practice chemistry is very successfully done just by characterizing molecular structures by appropriately labeled graphs of atoms and bonds.

OK, but should chemicals be represented by entities, or by abstract graphs? Well, if it’s a chemical one’s already heard of, like carbon dioxide, an entity seems convenient. But what if it’s a new chemical that’s never been discussed before? Well, one could think about inventing a new entity to represent it.

Any self-respecting entity, though, better have a name. So what would the name be? Well, in the Wolfram Language, it could just be the graph that represents the structure. But maybe one wants something that seems more like an ordinary textual name—a string. Well, there’s always the IUPAC way of naming chemicals with names like 1,1′-{[3-(dimethylamino)propyl]imino}bis-2-propanol. Or there’s the more computer-friendly SMILES version: CC(CN(CCCN(C)C)CC(C)O)O. And whatever underlying graph one has, one can always generate one of these strings to represent it.

There’s an immediate problem, though: the string isn’t unique. In fact, however one chooses to write down the graph, it can’t always be unique. A particular chemical structure corresponds to a particular graph. But there can be many ways to draw the graph—and many different representations for it. And in fact even the (“graph isomorphism”) problem of determining whether two representations correspond to the same graph can be difficult to solve.

What Is a Chemical in the End?

OK, so let’s imagine we represent a chemical structure by a graph. At first, it’s an abstract thing. There are atoms as nodes in the graph, but we don’t know how they’d be arranged in an actual molecule (and e.g. how many angstroms apart they’d be). Of course, the answer isn’t completely well defined. Are we talking about the lowest-energy configuration of the molecule? (What if there are multiple configurations of the same energy?) Is the molecule supposed to be on its own, or in water, or whatever? How was the molecule supposed to have been made? (Maybe it’s a protein that folded a particular way when it came off the ribosome.)

Well, if we just had an entity representing, say, “naturally occurring hemoglobin”, maybe we’d be better off. Because in a sense that entity could encapsulate all these details.

But if we want to talk about chemicals that have never actually been synthesized it’s a bit of a different story. And it feels as if we’d be better off just with an abstract representation of any possible chemical.

But let’s talk about some other cases, and analogies. Maybe we should just treat everything as an entity. Like every integer could be an entity. Yes, there are an infinite number of them. But at least it’s clear what names they should be given. With real numbers, things are already messier. For example, there’s no longer the same kind of uniqueness as with integers: 0.99999… is really the same as 1.00000…, but it’s written differently.

What about sequences of integers, or, for that matter, mathematical formulas? Well, every possible sequence or every possible formula could conceivably be a different entity. But this wouldn’t be particularly useful, because much of what one wants to do with sequences or formulas is to go inside them, and transform their structure. But what’s convenient about entities is that they’re each just “single things” that one doesn’t have to “go inside”.

So what’s the story with “abstract chemicals”? It’s going to be a mixture. But certainly one’s going to want to “go inside” and transform the structure. Which argues for representing the chemical by a graph.

But then there’s potentially a nasty discontinuity. We’ve got the entity of carbon dioxide, which we already know lots of properties about. And then we’ve got this graph that abstractly represents the carbon dioxide molecule.

We might worry that this would be confusing both to humans and programs. But the first thing to realize is that we can distinguish what these two things are representing. The entity represents the bulk naturally occurring version of the chemical—whose properties have potentially been measured. The graph represents an abstract theoretical chemical, whose properties would have to be computed.

But obviously there’s got to be a bridge. Given a concrete chemical entity, one of the properties will be the graph that represents the structure of the molecule. And given a graph, one will need some kind of ChemicalIdentify function, that—a bit like GeoIdentify or maybe ImageIdentify—tries to identify from the graph what chemical entity (if any) has a molecular structure that corresponds to that graph.

Philosophy Meets Chemistry Meets Math Meets Physics…

As I write out some of the issues, I realize how complicated all this may seem. And, yes, it is complicated. But in our meeting yesterday, it all went very quickly. Of course it helps that everyone there had seen similar issues before: this is the kind of thing that’s all over the foundations of what we do. But each case is different.

And somehow this case got a bit deeper and more philosophical than usual. “Let’s talk about naming stars”, someone said. Obviously there are nearby stars that we have explicit names for. And some other stars may have been identified in large-scale sky surveys, and given identifiers of some kind. But there are lots of stars in distant galaxies that will never have been named. So how should we represent them?

That led to talking about cities. Yes, there are definite, chartered cities that have officially been assigned names–and we probably have essentially all of these right now in the Wolfram Language, updated regularly. But what about some village that’s created for a single season by some nomadic people? How should we represent it? Well, it has a certain location, at least for a while. But is it even a definite single thing, or might it, say, devolve into two villages, or not a village at all?

One can argue almost endlessly about identity—and even existence—for many of these things. But ultimately it’s not the philosophy of such things that we’re interested in: we’re trying to build software that people will find useful. And so what matters in the end is what’s going to be useful.

Now of course that’s not a precise thing to know. But it’s like for language design in general: think of everything people might want to do, then see how to set up primitives that will let people do those things. Does one want some chemicals represented by entities? Yes, that’s useful. Does one want a way to represent arbitrary chemical structures by graphs? Yes, that’s useful.

But to see what to actually do, one has to understand quite deeply what’s really being represented in each case, and how everything is related. And that’s where the philosophy has to meet the chemistry, and the math, and the physics, and so on.

I’m happy to say that by the end of our hour-long meeting yesterday (informed by about 40 years of relevant experience I’ve had, and collectively 100+ years from people in the meeting), I think we’d come up with the essence of a really nice way to handle chemicals and chemical structures. It’s going to be a while before it’s all fully worked out and implemented in the Wolfram Language. But the ideas are going to help inform the way we compute and reason about chemistry for many years to come. And for me, figuring out things like this is an extremely satisfying way to spend my time. And I’m just glad that in my long-running effort to advance the Wolfram Language I get to do so much of it.

6 comments

  1. Interesting questions and challenges! I saw no mention of the innate packing patterns of atoms (and perhaps of particles in nuclei) in molecules, in relation to stability – I have especially wondered about this in nuclei. There is much that remains unknown I think . . . Thanks for your (and your teams’) work!

  2. Will the solution extend to organizational charts (e.g. for businesses, families and governments)? The solutions to puzzles about ways to organize humans can look a lot like molecular graphs: http://grinfree.com/marriage-employment-and-interdependence/

    Notice that some solutions could facilitate something like cellular automata–I wonder whether that throws a wrinkle into the potential to use Wolfram Language to reason about team-size and composition?

  3. To get around the “fuzziness” you mentioned about bonds, atoms, etc., you might try reading about RFW Baders Quantum Theory of Atoms In Molecules (QTAIM). Paul Popelier’s Atoms in Molecules and Matta and Boyds The Quantum Theory of Atoms in Molecules are a good place to start. Precise definable concepts about atoms bonds and molecules based on the concepts of electron density (a physical measurable property) are elucidated and related to measurable properties. So I think that if you are going to apply the concepts you have developed over the years to chemistry and chemical this (QTAIM) is the place to start

    • Thank you for your interest in this project. As pointed out in the blog post, different representations of a molecule or more or less appropriate in different settings. There really is no one best model that will serve all purposes, so our goal is to make a reasonable number of them available and easily interconverted so that a user’s molecule can be used effectively and efficiently in the computation of choice. Supporting users desiring QTAIM methodology is certainly feasible in the Wolfram Language.

  4. Hi, I work with chemistry organized in the form of “constitutional” graphs [name given back in the fifties]. Many graph theory operations work directly for chemistry. Symmetry,validation, subgraphs[fragments] matching, conjugation, local chemical environment, Lewis adducts,angles, torsions, chirality, hydrogen networks, proton & heavy atom apologetics, meal and inorganic coordination, accessible surface area, ring finding, resonance, flip states, tautomers, rotamers and conformers, surface chemistry, crystal symmetry operations so that the graph becomes periodic, and keep adding more.

    Your title used the word ontology but you rightly didn’t use the word in this article but instead graphs:-) I haven’t found an data structures for chemistry more useful than graphs

    • Great! Chemical graphs are one of our favorite representations, and workhorse for the computational chemistry community. Being able to treat a molecule as a graph object in the Wolfram language is very high on our list of features. We’ll also be supporting other representations where 3D atomic coordinates are appropriate, and intend to make it as easy as possible for a user’s molecule to be used effectively and efficiently in the computation of choice.