Wolfram Computation Meets Knowledge

New in the Wolfram Language: TextCases

The Wolfram Language has had extensive support for string manipulation since Mathematica 5, and in Version 10 it provided uniform symbolic access to a huge repository of computable data via the Wolfram Knowledgebase. Taking advantage of both of these fundamental capabilities, along with new machine learning functionality with Classify and Predict, we’re excited to be making further inroads into the rich domains of natural language processing and text analytics with TextCases, new in Version 10.2.

TextCases, like its sister functions Cases and StringCases, finds instances of patterns in a given input. Whereas Cases operates on Wolfram Language expressions and StringCases on strings, TextCases assumes that the input is human understandable text, from which one can extract known syntactic and semantic entities. These include basic textual types such as words, sentences, and paragraphs, but also more sophisticated semantic types such as countries, cities, and numbers.

As a simple example, let’s use TextCases to find instances of countries in a sentence:

Finding countries in a sentence using TextCases

Since the Wolfram Knowledgebase includes computable entities for countries, you can specify that TextCases returns the corresponding computable Entity objects, rather than the matched string:

TextCases returning the corresponding Entity

Let’s now explore a simple natural language processing workflow. Natural language processing often begins with basic segmentation tasks, such as splitting text at word and sentence boundaries. To demonstrate, let’s first use WikipediaData to grab the plain text of an article:

Using WikipediaData to grab plain text of an article

Here are the article’s first three sentences:

Article's first three sentences

And here are its first 10 words:

First ten words

Words and sentences are just two of the many different types of textual units that TextCases is able to handle. As a short exercise in exploratory data analysis, let’s find instances of numbers (expressed as digit strings or scientific notation) in the same Wikipedia article:

Instances of numbers in Wikipedia article

TextCases has a convenient syntax for drilling down even further into a body of text. For example, you can use the new symbol Containing to find, say, all sentences that contain numbers.

Using Containing to find all sentences that contain numbers

Here are the first three sentences Containing found:

First three sentences Containing found

There are many other uses for TextCases, such as finding email addresses, telephone numbers, or paragraphs of text of a given language. And using the symbols Containing and Alternatives, one can create complex queries for in-depth data exploration and data analysis.

But TextCases is just the beginning. Stay tuned: many additional useful tools for natural language processing and text analytics will be added to the Wolfram Language soon.

TextCases is supported in Version 10.2 of the Wolfram Language and Mathematica, and is rolling out soon in all other Wolfram products.

Download this post as a Computable Document Format (CDF) file.

Comments

Join the discussion

!Please enter your comment (at least 5 characters).

!Please enter your name.

!Please enter a valid email address.

4 comments

  1. Great stuff. What other languages are supported then English?

    Reply
    • Unfortunately some formats are not yet supported for every languages. Basic formats such as “Words” or “Sentences” should have correct performances in many language, but things such as “Country” should not work properly in this first version. We are working on a better support for other languages, and we will probably introduce the option “Language” for that purpose.

      Reply
  2. Is this an alternative to regular expressions? An extension?

    Reply
    • I think the functionality that already existed for Cases (esp. StringCases) is the alternative/extension to reg.exp.
      This is adding additional “classes” to that structure, classes based on semantic meaning as gleaned by machine learning

      Reply