For the past couple of years, I’ve been playing with, collecting and analyzing data from used car auctions in my free time with an automotive journalist named Steve Lang to try and get an idea of what the used car market looks like in terms of long-term vehicle reliability. I figured it was about time that I showed off some of the ways that the Wolfram Language has allowed us to parse through information on over one million vehicles (and counting).
I’ll start off by saying that there isn’t anything terribly elaborate about the process we’re using to collect and analyze the information on these vehicles; it’s mostly a process of reading in reports from our data provider (and cleaning up the data), and then cross-referencing that data with various automotive APIs to get additional information. This data then gets dumped into a database that we use for our analysis, but having all of the tools we need built into the Wolfram Language makes the entire operation something that can be scripted—which greatly streamlines the process. I’ll have to skip over some of the details or this will be a very long post, but I’ll try to cover most of the key elements.
The data we get comes in from a third-party provider that manages used car auctions around the country (unfortunately, our licensing agreement doesn’t allow me to share the data right now), but it’s not very computable at first (the data comes in as a text file report once a week):
Fortunately, parsing this sort of log-like data into individual records is easy in the Wolfram Language using basic string patterns:
Then it’s mostly a matter of cleaning up the individual records into something more standardized (I’ll spare you some of the hacky details due to artifacts in the data feed). You’ll end up with something like the following:
From there, we use the handy Edmunds vehicle API to get more information on an individual vehicle using their VIN decoder:
We then insert the records into an HSQL database (conveniently included with Mathematica), resulting in an easy way to search for the records we want:
From there, we can take a quick look at metrics using larger datasets, such as the number of transmission issues for a given set of vehicles for different model years:
Or a histogram of those issues broken down by vehicle mileage:
It also lets us look at industry-wide trends, so we can develop a baseline for what the expected rate of defects for an average vehicle (or vehicle of a certain class) should be:
We can then compare a given vehicle to that model:
We then use that model, as well as other information, to generate a statistical index. We use that index to give vehicles an overall quality rating based on their historical reliability, which ranges from a score of 0 (chronic reliability issues) to 100 (exceptional reliability), with the industry average hovering right around 50:
We also use various gauges to put together informative visualizations of defect rates and the overall quality:
There is a lot more we do to pull all of this together (like the Wolfram Language templating we use to generate the HTML pages and reports), and honestly, there is a whole lot more we could do (my background in statistics is pretty limited, so most of this is pretty rudimentary, and I’m sure others here may already have ideas for improvements in presentation for some of this data). If you’d like to take a look at the site, it’s freely available (Steve has a nice introduction to the site here, and he also writes articles for the page related to practical uses for our findings).
Our original site was called the Long-Term Quality Index, which is still live but showed off my lack of experience in HTML development, so we recently rolled out our newer, WordPress-based venture Dashboard Light, which also includes insights from our auto journalist on his experiences running an independent, used car dealership.
This is essentially a two-man project that Steve and I handle in our (limited) free time, and we’re still getting a handle on presenting the data in a useful way, so if anyone has any suggestions or questions about our methodology, feel free to reach out to us.
Cheers!
Continue the conversation at Wolfram Community.
]]>
Toolbox for the Mathematica Programmers
This new guide from Viktor Aladjev and V. A. Vaganov outlines a modular approach to programming with the Wolfram Language. Providing over 800 tools that can be incorporated into a variety of projects, Toolbox for the Mathematica Programmers will be useful for students and seasoned programmers alike.
Option Valuation under Stochastic Volatility II: With Mathematica Code
In this second volume of his series about quantitative finance, Alan L. Lewis’s Option Valuation under Stochastic Volatility II: With Mathematica Code expands his original focus to include jump diffusions. The finance industry is increasingly relying on computational analysis to model risk and track customer data. Lewis’s volume is a welcome addition to the literature of the field, of interest for both researchers and investors/traders looking to learn more about computational thinking. Topics covered include spectral theory for jump diffusions, boundary behavior for short-term interest rate models, modeling VIX options, inference theory and discrete dividends.
CRC Standard Curves and Surfaces with Mathematica
The third edition of the popular CRC Standard Curves and Surfaces with Mathematica is an indispensable reference text for anyone who works with curves and surfaces, from engineers to graphic designers. With new illustrations in almost every chapter, the updated version contains nearly 1,000 visualizations, depicting nearly every geometrical figure used today. It also includes a CD with a series of interactive Computable Document Format (CDF) files.
T. D. McGlone provides a useful introduction to Butterworth and Bessel (aka Thomson) filter functions. With an overview of mathematical functions, topology choices and component selection based on sensitivity criteria, Butterworth & Bessel Filters will be particularly useful for engineers.
Automation of Finite Element Methods
Another text for engineers, Automation of Finite Element Methods provides an introduction to developing virtual prediction techniques. New finite elements need to be created for individual purposes, which can be time-consuming. Authors Jože Korelc and Peter Wriggers outline an approach to automating this process through Wolfram Language programming.
Computational Proximity: Excursions in the Topology of Digital Images
Based on James F. Peters’s popular graduate course on the topology of digital images, Computational Proximity: Excursions in the Topology of Digital Images introduces the concept of computational proximity as an algorithmic approach to finding nonempty sets of points that are either close to each other or far apart. Peters discusses the applications of this concept in computer vision, multimedia, brain activity, biology, social networks and cosmology.
Now available as well is the Chinese translation of Stephen Wolfram’s An Elementary Introduction to the Wolfram Language: Wolfram 语言入门. The translated edition includes all of the material that made the English edition popular with anyone wanting to learn to program in the Wolfram Language. Look out for translations into additional languages in the future! |
It’s been a busy year here at the Wolfram Blog. We’ve written about ways to avoid the UK’s most unhygienic foods, exciting new developments in mathematics and even how you can become a better Pokémon GO player. Here are some of our most popular stories from the year.
In August, we launched Version 11 of Mathematica and the Wolfram Language. The result of two years of development, Version 11 includes exciting new functionality like the expanded map generation enabled by satellite images. Here’s what Wolfram CEO Stephen Wolfram had to say about the new release in his blog post:
OK, so what’s the big new thing in Version 11? Well, it’s not one big thing; it’s many big things. To give a sense of scale, there are 555 completely new functions that we’re adding in Version 11—representing a huge amount of new functionality (by comparison, Version 1 had a total of 551 functions altogether). And actually that function count is even an underrepresentation—because it doesn’t include the vast deepening of many existing functions.
Using the Wolfram Language, John McLoone analyzes government data about food safety inspections to create visualizations of the most unhygienic food in the UK. The post is a treasure trove of maps and charts of food establishments that should be avoided at all costs, and includes McLoone’s greatest tip for food safety: “If you really care about food hygiene, then the best advice is probably just to never be rude to the waiter until after you have gotten your food!”
Bernat Espigulé-Pons creates visualizations of Pokémon across multiple generations of the game and then uses WikipediaData, GeoDistance and FindShortestTour to create a map to local Pokémon GO gyms. If you’re a 90s kid or an avid gamer, Espigulé-Pons’s Pokémon genealogy is perfect gamer geek joy. If you’re not, this post might just help to explain what all those crowds were doing in your neighborhood park earlier this year.
Connor Flood writes about creating “the world’s first online syntax-free proof generator using induction,” which he designed using Wolfram|Alpha. With a detailed explanation of the origin of the concept and its creation from development to prototyping, this post provides a glimpse into the ways that computational thinking applications are created.
Wolfram|Alpha Chief Scientist Michael Trott returns with a post about the history of the discovery of the exact value of the Planck constant, covering everything from the base elements of superheroes to the redefinition of the kilogram.
In January of 2016, we launched the Wolfram Open Cloud to—as Stephen Wolfram says in his blog post about the launch—“let anyone in the world use the Wolfram Language—and do sophisticated knowledge-based programming—free on the web.” You can read more about this integrated cloud-based computing platform in his January post.
In February, the Laser Interferometer Gravitational-Wave Observatory (LIGO) announced that it had confirmed the first detection of a gravitational wave. Wolfram software engineer Jason Grigsby explains what gravitational waves are and why the detection of them by LIGO is such an exciting landmark in experimental physics.
Silvia Hao uses Mathematica to recreate the renaissance engraving technique of stippling: a kind of drawing style using only points to mimic lines, edges and grayscale. Her post is filled with intriguing illustrations and is a wonderful example of the intersection of math and illustration/drawing.
In April, we reported on new books that use Wolfram technology to explore a variety of STEM topics, from data analysis to engineering. With resources for teachers, researchers and industry professionals and books written in English, Japanese and Spanish, there’s a lot of Wolfram reading to catch up on!
The year 2016 also saw the launch of Wolfram Programming Lab, an interactive online platform for learning to program in the Wolfram Language. Programming Lab includes a digital version of Stephen Wolfram’s 2016 book, An Elementary Introduction to the Wolfram Language, as well as Explorations for programmers already familiar with other languages and numerous examples for those who learn best by experimentation.
]]>The general idea of Ed Pegg’s tribute post honoring Martin Gardner, “Extreme Orchards for Gardner,” is to find patterns for planting trees in configurations with constraints like “25 trees to get 18 lines, each having 5 trees.” Most of the configurations look like ridiculous ideas of how to plant actual trees. For example:
I have a seven-acre apple orchard with 200+ trees in New York’s Adirondack Park, and so I read “Extreme Orchards for Gardner” as a gardener first. Of course, Pegg’s post was never intended as a proposal for how to plant actual orchards, but as I live in the middle of an orchard, I can’t help wondering, what if you did plant orchards this way?
When considering this as an actual planting pattern, we should borrow that character ubiquitous in physics: the observer. To the observer on the ground, only the center cluster would look much like an orchard; the trees at the vertices would appear to have nothing much to do with the rest.
One of my favorite physics jokes is the one about the theoretical physicist who loses his job as a professor and has to go to work as a milkman. (Once upon a time, milk was delivered to people’s houses by “milkmen.”) After a few weeks on the job, the physicist just can’t stand not being able to give lectures. So he assembles his colleagues in front of a blackboard, draws a circle on the board and begins by saying, “Consider a spherical cow of uniform density.” The representation of orchards by Martin Gardner, Branko Grünbaum and such in the usual rendition of the orchard planting problem is to real orchards as spherical cows are to the animals who produce the milk you drink. So, to some extent, the fact that trees are not points and need a certain spacing is an unfair criticism. Nonetheless, since every way I look out my windows I see real apple trees, I feel compelled to point this out. (I think Grünbaum, who was my professor many years ago and who encouraged us to reality-test our mathematical ideas, would approve.)
This is even more true for this configuration involving rows of six “trees.” Just how much land would it take to plant an orchard like this using real trees? No one would do this.
Pegg also shows some more possible configurations—like these, in which the lines pass through exactly four trees each. For actual, rather than hypothetical, trees, some of these look a bit more workable.
My own apple trees, planted in the mid-1980s, are planted in rows, which is practical if a bit boring.
There are pragmatic constraints involved in planting apple trees. The orchardist needs access to the trees from two sides, both for maintenance (pruning, spraying, etc.) and to harvest apples. Assuming semi-dwarf trees, this involves aisles with a minimum width of about 22 feet (ca. 6.7 meters), starting from the center of each trunk. The trees should be planted no closer than intervals of 16 feet (ca. 4.9 meters) to give them enough air and light.
Only configurations in which there is a small variation in the segments connecting trees could realistically be planted as something that would, on the ground, resemble an orchard. Most of the configurations would require an enormous amount of land and so are mostly mathematical abstractions rather than something one could really implement.
But the configuration on the lower left in Pegg’s four-tree grouping looks like something one could actually plant. Like so:
One advantage I see in the configurations with a small variation in segment length is that planting a portion of the orchard as pentagons within pentagons reduces the amount of grass under the trees to be maintained, thus significantly reducing mowing and therefore labor and gasoline costs. So it is not completely foolish to consider planting at least a small orchard this way.
I am attracted to the 25-tree pentagon configuration because of its empty center circle, creating a private grove space. Taking into account an air gap around the outside, my guess is that a circle in the field of about 125 feet in diameter should be big enough. That center circle could, for example, hold a very nice circle of wildflowers 20 feet across for bee forage, maybe some beehives in the center, and still leave room for equipment to navigate.
Another advantage: this would be a good layout for planting five types of trees in groups of five. They could then be easily identified in their mini-groves and harvested together. The more I thought about it, the more this became something I might actually want to do. I started shopping online for heritage varieties of apple trees, looking around at my farm for the right place to put the new trees, imagining new designs…. Hmm.
Pegg, on the other hand, is more concerned with finding new solutions to the abstract version of the orchard problem, which are indeed quite beautiful, if impractical for the planting of trees:
These contemplations make me want to go deeper into mathematical patterns to see what else might be plantable. Maybe this last “orchard” plot might work with bulbs.
]]>Building on thirty years of research, development and use throughout the world, Mathematica and the Wolfram Language continue to be both designed for the long term and extremely successful in doing computational mathematics. The nearly 6,000 symbols built into the Wolfram Language as of 2016 allow a huge variety of computational objects to be represented and manipulated—from special functions to graphics to geometric regions. In addition, the Wolfram Knowledgebase and its associated entity framework allow hundreds of concrete “things” (e.g. people, cities, foods and planets) to be expressed, manipulated and computed with.
Despite a rapidly and ever-increasing number of domains known to the Wolfram Language, many knowledge domains still await computational representation. In his blog “Computational Knowledge and the Future of Pure Mathematics,” Stephen Wolfram presented a grand vision for the representation of abstract mathematics, known variously as the Computable Archive of Mathematics or Mathematics Heritage Project (MHP). The eventual goal of this project is no less than to render all of the approximately 100 million pages of peer-reviewed research mathematics published over the last several centuries into a computer-readable form.
In today’s blog, we give a glimpse into the future of that vision based on two projects involving the semantic representation of abstract mathematics. By way of further background and motivation for this work, we first briefly discuss an international workshop dedicated to the semantic representation of mathematical knowledge, which took place earlier this year. Next, we present our work on representing the abstract mathematical concepts of function spaces and topological spaces. Finally, we showcase some experimental work on representing the concepts and theorems of general topology in the Wolfram Language.
In February 2016, the Wolfram Foundation, together with the Fields Institute and the IMU/CEIC working group for the creation of a Global Digital Mathematics Library, organized a Semantic Representation of Mathematical Knowledge Workshop designed to pool the knowledge and experience of a small and select group of experts in order to produce agreement on a forward path toward the semantic encoding of all mathematics. This workshop was sponsored by the Alfred P. Sloan Foundation and held at the Fields Institute in Toronto. The workshop included approximately forty participants who met for three days of talks and discussions. Participants included specialists from various fields, including:
Among the many accomplished and knowledgeable participants (a complete list of whom, together with the complete schedule of events, may be viewed on the workshop website), Georges Gonthier and Tom Hales shared their experience on the world’s largest extant formal proofs (the Feit–Thompson odd order theorem and the Kepler conjecture, respectively); Harvey Friedman, Dana Scott and Yuri Matiyasevich brought expertise on mathematical foundations, incompleteness and undecidability; Jeremy Avigad and John Harrison shared their knowledge and experience in designing and implementing two of the world’s most powerful theorem provers; Bruno Buchberger and Wieb Bosma contributed extensive knowledge on computational mathematics; Fields Medal winners Stanislav Smirnov and Manjul Bhargava expounded on the needs of practicing mathematicians; and Ingrid Daubechies and Stephen Wolfram shared their thoughts and knowledge on many technical and organizational challenges of the problem as a whole.
As one might imagine, the list of topics discussed at the workshop was quite extensive. In particular, it included type theory, the calculus of constructions, homotopy type theory, mathematical vernacular, partial functions and proof representations, together with many more. The following word cloud, compiled from the text of hundreds of publications by the workshop participants, gives a glimpse of the main topics:
Recordings of workshop presentations can be viewed on the workshop video archive, and a white paper discussing the workshop’s outcomes is also available. In addition, because of the often under-emphasized yet vital importance of the subject for the future development (and practice) of mathematics in the coming decades, 18 participants were interviewed on the technological and scientific needs for achieving such a project, culminating in a 90-minute video (excerpts also available in a 9-minute condensed version) that highlights the visions and thoughts of some of the world’s most important practitioners. We thank filmmaker Amy Young for volunteering her time and talents in the compilation and production of this unique glimpse into the thoughts of renowned mathematicians and computer scientists from around the world, which we sincerely hope other viewers will find as inspiring and enlightening as we do.
The eCF project encoded continued fraction terminology, theorems, literature and identities in computational form, demonstrating that Wolfram|Alpha and the Wolfram Language provide a powerful framework for representing, exposing and manipulating mathematical knowledge.
While the theory of continued fractions contains both high-level and abstract mathematics, it represents only a tiny first step toward Stephen Wolfram’s grand vision for computational access to all of mathematics and the dynamic use of mathematical knowledge. Our next step down this challenging path therefore sought to encode within the Wolfram Language and Wolfram|Alpha entity-property framework a domain of more abstract and inhomogeneous mathematical objects having nontrivial properties and relations. The domain chosen for this next step was the important and fairly abstract branch of mathematics known as functional analysis.
That step posed a number of new challenges, among them the need for graduate-level mathematical knowledge in the domain of interest, formulation of entity names that “naturally” contain parameters and encode additional information (say, measure spaces) and the introduction of stub extensions to the Wolfram Language.
Work was carried out from December 2014–July 2016 and consisted of knowledge curation in three interconnected knowledge domains: "FunctionSpace", "TopologicalSpaceType" and "FunctionalAnalysisSource", together with the development of framework extensions to support them. This functionality was recently made available through the Wolfram Language entity framework and consists of the following content:
Full availability on the Wolfram|Alpha website is expected by early January 2017.
Two underlying concepts in functional analysis are those of the function space and the topological space. A function space is a set of functions of a given kind from one set to another. Common examples of function spaces include L^{p} spaces (Lebesgue spaces; defined using a natural generalization of the p-norm for finite-dimensional vector spaces) and C^{k} spaces (consisting of functions whose derivatives exist and are continuous up to k^{th} order).
As a simple first example in accessing this functionality, we can use RandomEntity to return a sample list of function spaces:
Similarly, EntityValue can be used to access curated properties for a given space:
As can be seen in various properties in this table, some mathematical representations required the introduction of new symbols not (yet) present in the Wolfram Language. This was accomplished by introducing them into a special PureMath` context. For example, after evaluating the above table, the following “pure math extension symbols” appear:
For now, these constructs are just representational. However, they are not merely placeholders for mathematical concepts/computational structures, but also have the benefit of enhancing human readability by automatically adding traditional mathematical typesetting and annotations. This can be seen, for example, by comparing the raw semantic expressions in the table above with those displayed on the Wolfram|Alpha website:
In the longer term, many such concepts may be instantiated in the Wolfram Language itself. As a result, both this and any similar semantic projects to follow will help guide the inclusion and implementation of computational mathematical functionality within Mathematica and the Wolfram Language.
A slightly more involved example demonstrates how the entity framework can be used to construct programmatic queries. Here, we obtain a list of all curated function spaces associated with mathematician David Hilbert:
One interesting property from the table above that warrants a bit more scrutiny is "RelationshipGraph". This consists of a hierarchical directed graph connecting all curated topological space types, where nodes A and B are connected by a directed edge A↦B if and only if “S is a topological space of type A” implies “S is a topological space of type B”, and with the additional constraint that all nodes are connected only via paths maximizing the number of intermediate nodes. For each function space, this graph also indicates (in red) topological space types to which a given space belongs. For example, the Lebesgue space L^{2} has the following relationship graph:
Here we show a similar graph in a slightly more streamlined and schematic form:
This graph corresponds to the following topological space type memberships:
While portions of this graph appear in the literature, the above graph represents, to our knowledge, the most complete synthesis of the hierarchical structure of topological vector spaces available. (The preceding notwithstanding, it is important to keep in mind that the detailed structure depends on the detailed conventions adopted in the definitions of various topological spaces—conventions that are not uniform across the literature.) A number of interesting facts can be gleaned from the graph. In particular, it can immediately be seen that the well-known Hilbert and Banach spaces (which have high-level structural properties whose relaxations lead to more general spaces) fall at the top of the hierarchical heap together with “inner product space.” On the other hand, topological vector spaces are the “most generic” types in some heuristic sense.
During the curation process, we have taken great care that function space properties are correct for all parameter values. This can be illustrated using code like the following to generate a tab view of Lebesgue spaces for various values of its parameter p and noting how properties adjust accordingly:
One of the beautiful things about computational encoding (and part of the reason it is so desirable for mathematics as a whole) is that known results can be easily tested or verified. (Similarly, and maybe even more importantly, new propositions can be easily formulated and explored.) As an example, consider the duality of Lebesgue spaces L^{p} and L^{q} for 1/p+1/q=1 with p≥1. First, define a variable to represent the L^{p} entity:
Now, use the "DualSpace" property (which may be specified either as a string or via a fully qualified EntityProperty["FunctionSpace", "DualSpace"] object, the latter of which may be given directly in that form or the corresponding formatted form ) to obtain the dual entity:
As can be seen, this formulation allows computation to be performed and expressed through the elegant paradigm of symbolic transformation of the entity canonical name. Taking the dual space of L^{q} in turn then gives:
Finally, applying symbolic simplification to the entity canonical name:
This verifies we have obtained the same space we originally started with:
In other words, that the double dual (L^{p})^{**}, where * denotes the dual space, is equivalent to L^{p}. (Function spaces with this property are said to be reflexive.)
It is also important to emphasize that the curation of the existing literature on function spaces is not always straightforward, as illustrated in particular by the myriad of (mutually conflicting) conventions used for the interrelated collection of function spaces known as Campanato–Morrey spaces:
This challenge is made clear with the following table, whose creation required a meticulous study of the literature:
As a result of multiple conventions, we chose in cases like this to include multiple, separate entities that are equivalent under appropriate (but possibly nontrivial) transformations of parameters and notations. For example:
A topological space may be defined as a set of points and neighborhoods for each point satisfying a set of axioms relating the points and neighborhoods. The definition of a topological space relies only upon set theory and is the most general notion of a mathematical space that allows for the definition of concepts such as continuity, connectedness and convergence. Other spaces, such as manifolds and metric spaces, are specializations of topological spaces with extra structures or constraints. Common examples of topological vector spaces include the Banach space (a complete normed vector space) and the Hilbert space (an abstract Banach space possessing the structure of an inner product that allows length and angle to be measured). Topological spaces could be considered more abstract than function spaces (e.g. they are typically defined based on the existence of a norm as opposed to having a definite value for their norm). Being so general, topological spaces are a central unifying notion and appear in virtually every branch of modern mathematics. The branch of mathematics that studies topological spaces in their own right is called point-set topology or general topology.
EntityList can be used to see a complete list of curated topological space types:
Similarly, EntityValue[space type, "PropertyAssociation"] returns all curated properties for a given space:
While more could be said and done with topological space types, in this project this domain was primarily used as a convenient way to classify function spaces. However, as the second project to be discussed in this blog will show, additional exploratory work is currently being done that could result in the augmentation of the human- (but not computer-) readable descriptions of topological spaces with semantically encoded versions potentially even suitable for use with automated proof assistants or theorem provers.
A final component added in this project was a set of cross-linked literature references that provide provenance and documentation for the various conventions (definitions etc.) adopted in our curated functional analysis datasets. These references can be searched based on the journal in which a paper appears, the year or decade it was published, the author or the language in which it was written:
For mathematicians who wish to explore the source of the data down to the page (theorem etc.) level, this information has also been encoded:
Finally, we can use this detailed reference information in a way that provides a convenient overview of both existing notational conventions and those we adopted in this project:
The second project we discuss in this blog is the not-unrelated augmentation of the Wolfram Language to precisely represent the definitions of mathematical concepts, statements and proofs in the field of point-set topology. This was done by creating an “entity store” for general topology consisting of concepts and theorems curated from the second edition of James Munkres’s popular Topology textbook. Although this project did not construct an explicit proof language (suitable, say, for use by a proof assistant or automated theorem prover), it did result in the comprehensive representation of 216 concepts and 225 theorems from a standard mathematical text, which is a prelude to any work involving machine proof.
EntityStore is a function introduced in Version 11 of the Wolfram Language that allows custom entity-property data to be packaged, placed in the cloud via the Wolfram Data Repository and then conveniently loaded and used. To load and use the general topology entity store, first access it via its ResourceData handle, then make it available in the Wolfram Language by prepending it to the list of known entity stores contained in the global $EntityStores variable:
As can be seen in the output, a nice summary blob shows the contents of the registered stores (in this case, a list containing the single store we just registered), including the counts of entities and properties in each of its constituent domains. Now that the entity store is registered, the custom entities it contains can be used within the Wolfram Language entity framework just as if they were built in. For example:
Similarly, we can see a full list of currently supported properties for topological theorems using EntityValue:
Before proceeding, we perform a little context path manipulation to make output symbols format more concisely (slightly deferring a discussion of why we do this until the end of this section):
A nice summary table can now be generated to show basic information about a given theorem:
"InputFormSummaryGrid" displays the same information as "SummaryGrid", but without applying the formatting rules we’ve used to make the concepts and theorems easily readable. It’s a good way to see the exact internal representation of the data associated with the entity. This can help us to understand what is going on when the formatting rules obscure this structure:
While it’s pretty straightforward to understand the mathematical assertion being made here, let’s look at each property in detail. Here, for example, is the display name (“label”) used for the entity representing the above theorem in the entity store, formatted using InputForm to display quotes explicitly and thus emphasize that the label is a string:
Similarly, here are alternate ways of referring to the theorem:
… the universally quantified variables appearing at the top level of the theorem statement (i.e. these are the variables representing the objects that the theorem is “about”):
… the conditions these objects must satisfy in order for the theorem to apply:
… and the conclusion of the theorem:
Of course, we could have just as easily listed Math["IsHausdorff"][Χ] as a restriction to this theorem and Math["IsT1"][Χ] as the statement since the manner in which the hypotheses are split between "Restrictions" and "Statement" is not unique. However, while the details of the splitting are subject to style and readability, the mathematical content of the theorem as expressed through any of these subjective choices is equivalent.
Finally, we can retrieve metadata about the source from which the theorem was curated:
Now, backing up a bit, you may well wonder about expressions with structures such as Category[...] and Math[...] that you’ve seen above. Let’s take a look at one of them, but this time through a general topology concept instead of a theorem:
Some of these properties are shared with corresponding properties for theorems:
You can see the common properties by intersecting the full lists of supported properties for concepts and theorems:
While properties are similar across topology theorems and concepts, there are some differences that should be addressed. "Arguments" for a concept takes the role of "QualifyingObjects" for a theorem. Just as theorems are thought of as applying to certain objects, concepts are thought of as functions that can be applied to certain objects. The output can be a Boolean value, as in this case. We would call such a concept a property or a predicate. Other concepts represent mathematical structures. For example, Math["MetricTopology"] takes a metric space as an argument and outputs the corresponding topology induced by the metric. The entity that corresponds to this math concept is .
A "Restrictions" property for concepts is very similar to the corresponding property for theorems. And just as in the case with theorems, there’s nothing in principle stopping us from moving this condition from "Restrictions" and conjoining it to the output. The difference is that this can always be done for theorems, but it can only be done for concepts representing properties since the output is interpreted as having a truth value:
Finally, the "Output" property here gives the value of the expression Math["IsHausdorff"][X]:
When we use such an expression in a theorem or in the definition of another concept, we interpret it as equivalent to what we see in "Output". As we know, stating and understanding mathematics is much easier when we have such shorthands than if all theorems were stated in terms of atomic symbols and basic axioms.
Two of the most exciting properties on this list are "RelatedConcepts" and "RelatedTheorems". One of our goals is to represent mathematical concepts and theorems in a maximally computable way, and these are just an example of some of the computations we hope to do with these entities. A concept appears in "RelatedConcepts" if it is used in the "Restrictions", "Notation" or "Output" of a concept or the "Restrictions", "Notation" or "Statement" of a theorem. A theorem appears in the "RelatedTheorems" of a concept if that concept appears in the "RelatedConcepts" of that theorem. With this in mind, take a closer look at the examples above:
It is important to emphasize that these relations were not curated, but rather computed, which is possible because of the precise, consistent and expressive language used to encode the concepts and theorems. As a matter of convenience, however, they’ve been precomputed for speed to allow you to, say, easily find the definition of concepts appearing in a theorem.
As an example of the power of this approach, we can use the Wolfram Language’s graph functionality to easily analyze the connectivity and structure of the network of topological theorems and concepts in our corpus:
As was the case for topological spaces, a number of extension symbols to the Wolfram Language were introduced in this project. We already encountered the Math and Theorem extensions, but there are also a number of others. For now, they have been placed in a GeneralTopology` context (analogous to the PureMath` context introduced for function spaces). This can be verified by examining the context of such symbols, e.g.:
The motivation behind appending GeneralTopology` to our context path is also now revealed, namely to suppress verbose context formatting in our outputs (so we will see things like Math instead of GeneralTopology`Math). Here is a complete listing of language extensions introduced in the GeneralTopology` context:
Again—as was the case for language extensions introduced for function spaces—some of these may eventually find their way into the Wolfram Language. However, independent of such considerations, these two small projects already show the need for some kind of infrastructure that allows incorporation, sharing and alignment of language extensions from different—and likely independently curated—domains.
We close with some experimental tidbits used to enhance the readability and usability of the concepts and theorems in our entity store. You have probably already noted the nice formatting in "SummaryGrid" and possibly even wondered how it was achieved. The answer is that it was produced using a set of MakeBoxes assignments packaged inside the entity store via the property EntityValue["GeneralTopologyTheorem", "TraditionalFormMakeBoxAssignments"]. Similarly, in order to provide usage messages for the GeneralTopology` symbols (which must be defined prior to having messages associated with them), we have packaged the messages in the special experimental EntityValue["GeneralTopologyTheorem", "Activate"] property, which can be activated as follows:
The result is the instantiation of standard Mathematica-style usage messages such as:
While the eventual implementation details of such features into a standard framework remains the subject of ongoing design and technical discussions, the ease with which it is possible to experiment with such functionality (and to implement semantic representation of mathematical structures in general) is a testament to the power and flexibility of the Wolfram Language as a development and prototyping tool.
These projects undertaken at Wolfram Research during the last year have explored the semantic representation of abstract mathematics. In order to facilitate experimentation with this functionality, we have posted two small notebooks to the cloud (function space entity domain and the topology entity store) that allow interactive exploration and evaluation without the need to install a local copy of Mathematica. We welcome your feedback, comments and even collaboration in these efforts to extend and push the limits of the mathematics that can be represented and computed.
As a final note, we would like to emphasize that significant portions of the work discussed here were carried out as a part of internship projects. If you know or are a motivated mathematics or computer science student who is interested in trying to break new ground in the semantic representation of mathematics, please consider 1) learning the Wolfram Language (which, since you are reading this, you may well have already) and 2) joining the Wolfram internship program next summer!
]]>This is where Wolfram comes in. Our UK-based Technical Services Team worked with the British NHS to help solve a specific problem facing the NHS—one many organizations will recognize: data sitting in siloed databases, with limited analysis algorithms on offer. They wanted to see if it was possible to pull together multiple data sources, combining off-the-shelf clinical databases with the hospital trusts’ bespoke offerings and mine them for signals. We set out to help them answer questions like “Can the number of slips, trips and falls in hospitals be reduced?”
I was assigned by Wolfram to lead the analysis. The databases I was given consisted of about six years’ worth of anonymized data, just over 120 million patient records. It contained a mixture of aggregate averages and patient-level daily observations, drawn from four different databases. While Mathematica is not a database, it has the ability to interface with them easily. I was able to plug into the SQL databases and pull in data from Excel, CSV and text files as needed, allowing us to inspect and streamline the data.
Working closely with a steering committee comprising healthcare professionals, academics and patients, we identified a range of parameters to investigate, including the level of nurse staffing and training, average patient heart rate and the rate of patients suffering from slips and falls. Altogether, the team identified around 1,000 parameter pairings to investigate, far too many to work through by hand in the limited time available.
Some of the tools in the Wolfram Language that made this achievable include:
These tools enabled us to rapidly scale up the analysis across this complex dataset, allowing more time to consider the validity of the relationships and signals that emerged. Some of these seemed obvious—wards where patients were more likely to be bed-bound for medical reasons had fewer falls. But not all the signals were this easy to explain. For example, an increase in the number of nurses appeared to be linked to an increase in falls.
This observation seemed surprising. Given that there is little variation in ward size, it seemed unlikely that more nurses would lead to a decrease in patient safety. But not all nurses are equivalent. When we considered the ratio of registered nurses to healthcare support workers, we saw a strong relationship between the increase in highly trained registered nurses and the increase in patient safety.
So we see an increase in falls in some wards that rely more heavily on healthcare support workers. Could these wards be forced to rely on these less qualified, lower-paid nurses when in truth fully licensed, registered nurses are needed? I can only speculate, and the data at this stage is insufficient to answer this question. But following this analysis, the hospital trust in question has changed its staffing policy to increase the level of registered-nurse employment. Whether it leads to an increase in patient safety or a new issue raises its head—we will have to wait and see.
For the full findings, see the paper published this week in BMJ Open.
This project has only started to scrape the surface of the complexities hidden inside this rich dataset. In a mere 10 days, relying on the flexibility designed into the Wolfram Language, we’re able to deliver some insight into this complex problem.
Contact the Wolfram Technical Services group to discuss your data science or coding projects.
]]>Computational thinking needs to be an integral part of modern education—and today I’m excited to be able to launch another contribution to this goal: Wolfram|Alpha Open Code.
Every day, millions of students around the world use Wolfram|Alpha to compute answers. With Wolfram|Alpha Open Code they’ll now not just be able to get answers, but also be able to get code that lets them explore further and immediately apply computational thinking.
It takes a lot of sophisticated technology to make this possible. But to the user, it’s simple. Do a computation with Wolfram|Alpha. Now in almost every section of the output you’ll see an “Open Code” link. Click it and Wolfram|Alpha will generate code for you, then open it in a fully runnable and editable notebook that you can immediately use in the Wolfram Open Cloud:
The sections of the notebook parallel the sections of your Wolfram|Alpha output. But now each section contains not results, but instead core Wolfram Language code needed to get those results. You can run any piece of code by clicking the [>] button (or typing Shift+Enter):
But the really important thing is that right there on the web you can change and extend the code, and then instantly run it again:
If all someone wants is a single, quick result, then classic Wolfram|Alpha should be all they’ll need. But as soon as they want to go further—that’s where Wolfram|Alpha Open Code comes in.
Let’s say you just got a mathematical result from Wolfram|Alpha:
But then you wonder: “what happens for a whole range of exponents?” Well, it’s going to get pretty complicated to tell Wolfram|Alpha what you want just using natural language. But it’s easy to say what to do by giving a tiny bit of Wolfram Language code (and, yes, you can interactively spin those 3D surfaces around):
You could give code to interactively change the parameters too:
Starting with Wolfram|Alpha, then extending using the Wolfram Language, is very powerful. Here’s what happens with some real-world data. Start in Wolfram|Alpha, then get the underlying Wolfram Language code (it can be made shorter, but then it’s a little less clear what’s going on):
Evaluate the code to get a time series. Then plot it. And divide by the corresponding result for the US:
An important feature of notebooks is that they’re full, computable documents—and you can add whatever you want to them. You can do a whole series of computations. You can put in text to annotate what you’re doing. You can add section headings. You can edit out parts you don’t need. And so on. And of course you can do all of this in the cloud, using any modern web browser.
Wolfram|Alpha Open Code is going to be really useful to a lot of people—not just students. But when I invented it my immediate objective was very much educational: I wanted to be able to give the millions of students who use Wolfram|Alpha every day a taste of the power of code, and what can be achieved if one learns about code and computational thinking.
Computational thinking is a critically important skill for the future. And after 30 years of development we’re at the exciting point with the Wolfram Language of being able to directly teach serious computational thinking to a very wide range of students. I see Wolfram|Alpha Open Code as opening a window into the world of computational thinking for all the students who use Wolfram|Alpha.
There’s no learning curve to climb with Wolfram|Alpha: you just type in your question, directly in natural language. But now with Wolfram|Alpha Open Code you can explicitly see how your question gets interpreted computationally. And as soon as you want to go further, you’re immediately doing computational thinking, and writing code. You’re not doing an abstract coding exercise, or creating code in some toy context. You’re immediately using code to formulate computational ideas and get results about something you’re working on.
Of course, what makes this feasible is the character of the Wolfram Language—and its uniquely high-level knowledge-based nature. Because that’s what allows real computations that you want to do to be expressed in small amounts of code that can readily be understood and modified or extended.
Yes, the Wolfram Language has a definite structure and syntax, based on definite principles. But that’s a lot of what makes it easy to understand and to write. And in a notebook you’re always getting suggestions about what to type—and if your browser language is set to something other than English you’ll often get annotations in that language too. And the code you get from using Wolfram|Alpha Open Code will continually illustrate the core principles of the Wolfram Language.
Over the course of the past year, we’ve introduced two important paths into computational thinking, both supported by Wolfram Programming Lab, and available free in the Wolfram Open Cloud.
The first path is to start from Explorations: small projects created using code, that a student can immediately dive into, and then modify and interact with. The second path is to systematically learn the Wolfram Language, for example using my book An Elementary Introduction to the Wolfram Language.
And now Wolfram|Alpha Open Code provides a third path: start from a question that a student has asked, and then automatically generate custom code that provides a starting point for further work and thinking.
It’s a nice complement to the other two paths—and perhaps it’ll often provide encouragement to pursue one or the other of them. But it’s a perfectly good path all by itself—and students can go a long way following it.
Of course, under the hood, there’s a remarkable amount of sophisticated technology that’s being used. There’s the whole natural-language understanding system of Wolfram|Alpha that’s understanding the original question. There’s the Wolfram|Alpha computational knowledge system that’s formulating what pieces of code to generate. Then there’s the Wolfram Open Cloud, providing an interactive notebook environment on the web capable of running the code. And at the center of all of it is the Wolfram Language, with its whole integrated design and vast built-in capabilities and knowledge.
It’s taken 30 years of development to get to this point. But now we’ve been able to put everything together to create a very powerful path for students to get into computational thinking.
And I have to say that for me it’s exciting to think about kids out there using Wolfram|Alpha just for homework, but then pressing the Open Code button, and suddenly being transported into the world of code and computational thinking—and perhaps starting on a lifelong journey.
I’m thrilled to be able to provide the tools that make this possible. Try it out. Tell us what you think. Share what you do, and show others what’s possible.
]]>Congratulations! Now what? Revisions, of course! And we, the kindly Wolfram Blog Team, are here to get you through your revisions with a little help from the Wolfram Language.
By combining the Wolfram Language’s text analysis tools with the Wolfram Knowledgebase’s collection of public-domain novels by authors like Jane Austen and James Joyce, we’ve come up with a few things to help you reflect on your work and see how you measure up to some of the greats.
Literary scholars have been using computational thinking to explore things like genre and emotion for years. Working with large amounts of data, this research gives us a larger picture of things that we can’t discover through reading individual novels, illuminating patterns within the mass of published novels that we would otherwise know little about.
“That’s all well and good,” you might say, “but what about my great (and scandalously unread) dragon bildungsroman?” Well, you’re in luck! You can apply the principles of computational thinking to your writing as well by using the Wolfram Language to help you revise.
Many writers have things about their writing that they would like to improve. It might be a tendency to overuse adjectives or a penchant for bird metaphors. If you already know your writing tics, it’s easy to find them using your word processor. But what if you don’t know what you’re looking for?
A great way to find unknown writing tics is to use WordCloud, which can help you visualize words’ frequencies and relative importance. We can test this method on Herman Melville’s classic Moby-Dick.
We start by pulling up a list of Melville’s notable books.
Then use WordCloud to create a visualization of word frequency in one of his novels—say, Moby-Dick.
Of the words that appear most often, the titular Moby Dick is a whale, and the narrator reflects frequently on the obsessed ship captain Ahab. But notice something interesting: the word “like” shows up disproportionately—even more than key words such as “ship,” “sea” and “man.” And from inspecting places where he uses the word “like,” we can discover that Melville loves similes:
“like silent sentinels all around the town”
“like leaves upon this shepherd’s head”
“like a grasshopper in a May meadow”
“like a snow hill in the air”
“like a candle moving about in a tomb”
“like a Czar in an ice palace made of frozen sighs”
The similes help bring cosmic grandeur to his epic about a whaling expedition—but it also shows that even the greats aren’t immune to over-reliance on literary devices. As explained in the classic book on writing, The Elements of Style by Strunk and White, similes are handy, but should be used in moderation: “Readers need time to catch their breath; they can’t be expected to compare everything with something else, and no relief in sight.”
Our coworker Kathryn Cramer, a science fiction editor and author with some serious chops, often uses word clouds as an editing tool. She looks at her most frequently used words and asks whether there are any double meanings (what she calls “sinister puns”) that she can develop. She notes that you can also use them to clean up sloppy writing; if she sees too many instances of “then,” she knows that there are too many sentences that use “and then.”
An easy way to find if your word has some double meanings, synonyms, antonyms or any other interesting property is to use the WordData function and see how many different ways you can play with a word like “hand,” for instance.
You can try these techniques out on your own writing using Import. Maybe you could even identify some more writing tics in some of the famous authors whose works are built into the Wolfram Language!
When polishing our prose, many of us often think, “I wish I could write like [insert famous author here].”
For some added fun, in just a few lines of code, we can take a selection of authors—Virginia Woolf, Herman Melville, Frederick Douglass, etc.—and build a ClassifierFunction from their notable books.
Then with a simple FormPage, we built in about half an hour a fun, toy web app called AuthorIdentify that tries to figure out which classic author would most likely have written your text sample.
To test it out, we gave AuthorIdentify the first paragraph of James Joyce’s Finnegans Wake, which was not already in the system. To our delight, it correctly identified the author of the work to be Joyce.
But it’s more fun to let it take a stab at your own work. Our coworker Jeremy Sykes, a publishing assistant here at Wolfram, shared with us a paragraph of his novel, an intergalactic space thriller that combines sci-fi, economics and comedy: Norman Aidlebee, Galactic CPA.
It’s fun trying different samples and playing with them to see if you can get a different author. While far from perfect, our AuthorIdentify is still amusing, even when it’s incredibly wrong.
Feel free to try it out. With some more work, including more text samples, a wider range of authors and playing with the options in Classify, it is simple to build a robust author identification app with the Wolfram Language.
We hope some of these tips and tools help you aspiring writers out there as you sit down and edit your manuscript. We’ve found the Wolfram Language to be an excellent revision buddy, and are constantly on the lookout for new ways to enhance the editing process. So if you have any ideas of your own on how to code with your novel, please feel free to share in the comments!
]]>Computational thinking can be integrated across the curriculum. It is not just the purview of the math teacher or the computer club, but a key instructional tool for educators from all disciplines. For example, using the Wolfram Language to teach computational thinking, English teachers can explore palindromes, history students can explore main concepts from the Gettysburg Address and science teachers can examine dinosaurs’ weights.
How does a busy teacher apply computational thinking in the classroom? Easily: computational thinking provides a framework for learning, which makes concepts easier for students to understand. It incorporates real-world math into students’ everyday lives.
For instance, using the Wolfram Programming Lab during Computer Science Education Week, you can teach students to think computationally about geography. The “Miles around You” starter Exploration will allow your students to see what exists in their vicinity. Students can make a map of the location, then draw a disk of any size around it and zoom in and out to gain perception. How many sites show on the map at a radius of 100 miles? What about 150 miles?
This exercise requires no knowledge of the Wolfram Language. The activity can last as long or as short as the students and teacher desire. Yet it introduces in a relevant way how computational thinking can answer questions. As students advance with their Wolfram Language engagement, they can complete Wolfram challenges on a variety of subjects, from basketball scores to Pig Latin.
Students are often bored with math in school because they do not see the real-world applications of their lessons. Computer-Based Math education lets students use computers at school the same way they would in their everyday lives: with the computers, not the humans, performing rote calculations. Computational thinking helps students discern which calculations the computer needs to solve so they can explore higher concepts. For instance, if your students are basketball fans, they can take a Wolfram challenge to discover how a basketball team can reach a certain score. The many applications of computational thinking make it easy to incorporate math throughout the curriculum.
Computational thinking in the classroom encourages student engagement when students see the results of their efforts. Maybe your students are excited about the upcoming holidays. Why not let them create a unique decoration in the Wolfram Demonstrations Project?
This example and other Wolfram Demonstrations are accessible ways to explore computational thinking at any level of classroom. Once you’ve played with a few interactive examples, contributing your own Demonstration might be a fun and informative way for you and your students to spend an Hour of Code.
The Wolfram Summer Programs are one example of a place where students learn computational thinking through achieving their personal goals. This year at the Wolfram Summer School in Armenia, students developed their original ideas into working prototypes. Prior to the Armenian and other Wolfram camps, students prepare by completing homework assignments. Students can do the same in a flipped classroom, where they experience material before coming to class and arrive at the in-person lesson ready to engage with an activity.
If you’ve flipped your classroom, then computational thinking can be easily integrated into this environment by introducing your students to the Wolfram Language and using it to work with real-world data. An Elementary Introduction to the Wolfram Language Training Series will provide the pre-class materials for your students. Using this video series, you and your students can learn the basics of the Wolfram Language. Maybe you’re teaching an astronomy lesson this week. The Real-World Data video can introduce your students to computational thinking by using the Wolfram Language to explore planets, stars, galaxies and more—perhaps during the Hour of Code.
With computational thinking, students will learn by doing. Allow students to follow their own interests. Let them choose projects that intrigue them or relate to something they are already undertaking in class. Work computational thinking into the syllabus. Computational thinking is part of the learning process, not a single lesson.
Computational thinking can lead students to answer big questions. Are your students interested in public health? Teaming up with each other—and perhaps members of Wolfram Community—they can use the Wolfram Language to model the spread of a global disease outbreak.
Professors can teach computational thinking too. Perhaps you’re a humanities faculty member. Why not flex your own computational thinking by learning to analyze your data with the Wolfram Language? You and your students may be surprised by what you discover.
Here at Wolfram, there are more plans to help educators teach a generation of students computational thinking. For Computer Science Education Week, we will be hosting another Hour of Code event: middle- and high-school students will go on a computation adventure. If you’re unable to join us in person, why not host your own event?
If you’ve decided to have an Hour of Code, perhaps spend your time having students create tweetable programs—code that fits into 140 characters. Or analyze sea level rise, like Anush Mehrabyan did during the 2015 Wolfram High School Summer Camp. Or create a camera-controlled musical instrument. The examples are inspiring; the possibilities are exciting.
Whatever you decide to do with your students, don’t confine computational thinking and Computer-Based Math to Computer Science Education Week or an Hour of Code. Have fun exploring—and please let us know what you and your students create and learn.
]]>
This operation returns a FeatureExtractorFunction, which can be applied to the original data:
As you can see, the examples are transformed into vectors of numeric values. This operation can also be done in one step using FeatureExtraction’s sister function FeatureExtract:
But a FeatureExtractorFunction allows you to process new examples as well:
In the example above, the transformation is very simple: the nominal values are converted using a “one-hot” encoding, but sometimes the transformation can be more complex:
In that case, a vector based on word counts is extracted for the text, another vector is extracted from the color using its RGB values and another vector is constructed using features contained in the DateObject (such as the absolute time, the year, the month, etc.). Finally, these vectors are joined and a dimensionality reduction step is performed (see DimensionReduction).
OK, so what is the purpose of all this? Part of the answer is that numerical spaces are very handy to deal with: one can easily define a distance (e.g. EuclideanDistance) or use classic transformations (Standardize, AffineTransform, etc.), and many machine learning algorithms (such as linear regression or k-means clustering) require numerical vectors as input. In this respect, feature extraction is often a necessary preprocess for classification, clustering, etc. But as you can guess from the example above, FeatureExtraction is more than a mere data format converter: its real goal is to find a meaningful and useful representation of the data, a representation that will be helpful for downstream tasks. This is quite clear when dealing with images; for example, let’s use FeatureExtraction on the following set:
We can then extract features from the first image:
In this case, a vector of length 31 is extracted (a huge data reduction from the 255,600 pixel values in the original image). This extraction is again done in two steps: first, a built-in extractor, specializing in images, is used to extract about 1,000 features from each image. Then a dimensionality reducer is trained from the resulting data to reduce the number of features down to 31. The resulting vectors are much more useful than raw pixel values. For example, let’s say that one wants to find images in the original dataset that are similar to the following query images:
We can try to solve this task using Nearest directly on images:
Some search results make sense, but many seem odd. This is because, by default, Nearest uses a simple distance function based on pixel values, and this is probably why the white unicorn is matched with a white dragon. Now let’s use Nearest again, but in the space of features defined by the extractor function:
This time, the retrieved images seem semantically close to the queries, while their colors can differ a lot. This is a sign that the extractor captures semantic features, and an example of how we can use FeatureExtraction to create useful distances. Another experiment we can do is to further reduce the dimension of the vectors in order to visualize the dataset on a plot:
As you can see, the examples are somewhat semantically grouped (most dragons in the lower right corner, most griffins in the upper right, etc.), which is another sign that semantic features are encoded in these vectors. In a sense, the extractor “understands” the data, and in a sense this is what FeatureExtraction is trying to do.
In the preceding, the “understanding” is mostly due to the first step of the feature extraction process—that is, the use of a built-in feature extractor. This extractor is a byproduct of our effort to develop ImageIdentify. In a nutshell, we took the network trained for ImageIdentify and removed its last layers. The resulting network transforms images into feature vectors encoding high-level concepts. Thanks to the large and diverse dataset (about 10 million images and 10,000 classes) used to train the network, this simple strategy gives a pretty good extractor even for objects that were not in the dataset (such as griffins, centaurs and unicorns). Having such a feature extractor for images is a game-changer in computer vision. For example, if one were to label the above dataset with the classes “unicorn,” “griffin,” etc. and use Classify on the resulting data (as shown here), one would obtain a classifier that correctly classifies about 90% of new images! This is pretty high considering that only eight images per class have been seen during the training. This is not yet a “one-shot learning,” as humans can perform on such tasks, but we are getting there… This result would have been unthinkable in the first versions of Classify, which did not use such an extractor. In a way, this extractor is the visual system of the Wolfram Language. There is still progress to be made, though. For example, this extractor can be greatly enhanced. One of our jobs now is to train other feature extractors in order to boost machine learning performance for all classic data types, such as image, text and sound. I often think of these extractors, and trained models in general, as a new form of built-in knowledge added to the Wolfram Language (along with algorithms and data).
The second step of the reduction, called dimensionality reduction (also sometimes “embedding learning” or “manifold learning”), is the “learned” part of the feature extraction. In the example above, it is probably not the most important step to obtain a useful representation, but it can play a key role for other data types, or when the number of examples is higher (since there is more to learn from). Dimensionality reduction stems from the fact that, in a typical dataset, examples are not uniformly distributed in their original space. Instead, most examples are lying near a lower-dimensional structure (think of it as a manifold). The data examples can in principle be projected on this structure and thus represented with fewer variables than in their original space. Here is an illustration of a two-dimensional dataset reduced to a one-dimensional dataset:
The original data (blue points) is projected onto a uni-dimensional manifold (multi-color curve) that is learned using an autoencoder (see here for more details). The colors indicate the value of the (unique) variable in the reduced space. This procedure can also be applied to more complex datasets, and given enough data and a powerful-enough model, much of the structure of the data can be learned. The representation obtained can then be very useful for downstream tasks, because the data has been “disentangled” (or more loosely again, “understood”). For example, you could train a feature extractor for images that is just as good as our built-in extractor using only dimensionality reduction (this would require a lot of data and computational power, though). Also, reducing the dimension in such a way has other advantages: the resulting dataset is smaller in memory, and the computation time needed to run a downstream application is reduced. This is why we apply this procedure to extract features even in the image case.
We talked about extracting numeric vectors from data in an automatic way, which is the main application of FeatureExtraction, but there is another application: the possibility of creating customized data processing pipelines. Indeed, the second argument can be used to specify named extraction methods, and more generally, named processing methods. For example, let’s train a simple pipeline that imputes missing data and then standardizes it:
We can now use it on new data:
Another classic pipeline, often used in text search systems, consists of segmenting text documents into their words, constructing tf–idf (term frequency–inverse document frequency) vectors and then reducing the dimension of the vectors. Let’s train this pipeline using the sentences of Alice in Wonderland as documents:
The resulting extractor converts each sentence into a numerical vector of size 144 (and a simple distance function in that space could be used to create a search system).
One important thing to mention is that this pipeline creation functionality is not as low-level as one might think; it is a bit automatized. For example, methods such as tf–idf can be applied to more than one data type (in this case, it will work on nominal sequences, but also directly on text). More importantly, methods are only applied to data types they can deal with. For example, in this case the standardization is only performed on the numerical variable (and not on the nominal one):
These properties make it quite handy to define processing pipelines when many data types are present (which is why Classify, Predict, etc. use a similar functionality to perform their automatic processing), and we hope that this will allow users to create custom pipelines in a simple and natural way.
FeatureExtraction is a versatile function. It offers the possibility to control processing pipelines for various machine learning tasks, but also unlocks two new applications: dataset visualization and metric learning for search systems. FeatureExtraction will certainly become a central function in our machine learning ecosystem, but there is still much to do. For example, we are now thinking of generalizing its concept to supervised settings, and there are many interesting cases here: data examples could be labeled by classes or numeric values, or maybe ranking relations between examples (such as “A is closer to B than C”) could be provided instead. Also, FeatureExtraction is another important step in the domain of unsupervised learning, and like ClusterClassify, it enables us to learn something useful about the data—but sometimes we need more than just clusters or an embedding. For example, in order to randomly generate new data examples, predict any variable in the dataset or detect outliers, we need something better: the full probability distribution of the data. It is still a long shot, but we are working to achieve this milestone, probably through a function called LearnDistribution.
To download this post as a CDF, click here. New to CDF? Get your copy for free with this one-time download.
]]>