New in the Wolfram Language: TextCases
August 12, 2015 — Gopal Sarma, Advanced Research Group
The Wolfram Language has had extensive support for string manipulation since Mathematica 5, and in Version 10 it provided uniform symbolic access to a huge repository of computable data via the Wolfram Knowledgebase. Taking advantage of both of these fundamental capabilities, along with new machine learning functionality with Classify and Predict, we’re excited to be making further inroads into the rich domains of natural language processing and text analytics with TextCases, new in Version 10.2.
TextCases, like its sister functions Cases and StringCases, finds instances of patterns in a given input. Whereas Cases operates on Wolfram Language expressions and StringCases on strings, TextCases assumes that the input is human understandable text, from which one can extract known syntactic and semantic entities. These include basic textual types such as words, sentences, and paragraphs, but also more sophisticated semantic types such as countries, cities, and numbers.
As a simple example, let’s use TextCases to find instances of countries in a sentence:
Since the Wolfram Knowledgebase includes computable entities for countries, you can specify that TextCases returns the corresponding computable Entity objects, rather than the matched string:
Let’s now explore a simple natural language processing workflow. Natural language processing often begins with basic segmentation tasks, such as splitting text at word and sentence boundaries. To demonstrate, let’s first use WikipediaData to grab the plain text of an article:
Here are the article’s first three sentences:
And here are its first 10 words:
Words and sentences are just two of the many different types of textual units that TextCases is able to handle. As a short exercise in exploratory data analysis, let’s find instances of numbers (expressed as digit strings or scientific notation) in the same Wikipedia article:
TextCases has a convenient syntax for drilling down even further into a body of text. For example, you can use the new symbol Containing to find, say, all sentences that contain numbers.
Here are the first three sentences Containing found:
There are many other uses for TextCases, such as finding email addresses, telephone numbers, or paragraphs of text of a given language. And using the symbols Containing and Alternatives, one can create complex queries for in-depth data exploration and data analysis.
But TextCases is just the beginning. Stay tuned: many additional useful tools for natural language processing and text analytics will be added to the Wolfram Language soon.
TextCases is supported in Version 10.2 of the Wolfram Language and Mathematica, and is rolling out soon in all other Wolfram products.