New in the Wolfram Language: FindTextualAnswer

Are you ever certain that somewhere in a text or set of texts, the answer to a pressing question is waiting to be found, but you don’t want to take the time to skim through thousands of words to find what you’re looking for? Well, soon the Wolfram Language will provide concise answers to your specific, fact-based questions directed toward an unstructured collection of texts (with a technology very different from that of Wolfram|Alpha, which is based on a carefully curated knowledgebase).

Let’s start with the essence of FindTextualAnswer. This feature, available in the upcoming release of the Wolfram Language, answers questions by quoting the most appropriate excerpts of a text that is presumed to contain the relevant information.

 ✕ FindTextualAnswer["Lake Titicaca is a large, deep lake in the Andes \ on the border of Bolivia and Peru. By volume of water and by surface \ area, it is the largest lake in South America", "Where is Titicaca?"]

 ✕ bandArticle = WikipediaData["The Who"]; Snippet[bandArticle]
 ✕ FindTextualAnswer[bandArticle, "Who founded the Who?"]

 ✕ FindTextualAnswer[bandArticle, "Who founded the Who?", 3, {"Probability", "HighlightedSentence"}] // TableForm

 ✕ text = "Even thermometers can't keep up with the plunging \ temperatures in Russia's remote Yakutia region, which hit minus 88.6 \ degrees Fahrenheit in some areas Tuesday. In Yakutia - a region of 1 \ million people about 3,300 miles east of Moscow - students routinely \ go to school even in minus 40 degrees. But school was cancelled \ Tuesday throughout the region and police ordered parents to keep \ their children inside. In the village of Oymyakon, one of the coldest inhabited places on \ earth, state-owned Russian television showed the mercury falling to \ the bottom of a thermometer that was only set up to measure down to \ minus 50 degrees. In 2013, Oymyakon recorded an all-time low of minus \ 98 Fahrenheit."; questions = {"What is the temperature in Yakutia?", "Name one of the coldest places on earth?", "When was the lowest temperature recorded in Oymyakon?", "Where is Yakutia?", "How many live in Yakutia?", "How far is Yakutia from Moscow?"}; Thread[questions -> FindTextualAnswer[text, questions]] // Column

Because FindTextualAnswer is based on statistical methods, asking the same question in different ways can provide different answers:

 ✕ cityArticle = WikipediaData["Brasília"]; Snippet[cityArticle]
 ✕ questions = {"Brasilia was inaugurated when?", "When was Brasilia finally constructed?"}; FindTextualAnswer[cityArticle, questions, 3, {"Probability", "HighlightedSentence"}]

The answers to similar questions found in different pieces of text can be merged and displayed nicely in a WordCloud:

 ✕ WordCloud[ Catenate[FindTextualAnswer[cityArticle, questions, 5, {"String", "Probability"}]], WordSpacings -> {10, 4}, ColorFunction -> "TemperatureMap"]

Any specialized textual knowledge database can be given to FindTextualAnswer. It can be a set of local files, a URL, a textual resource in the Wolfram Data Repository, the result of a TextSearch or a combination of all of these:

 ✕ FindTextualAnswer[{File["ExampleData/USConstitution.txt"], WikipediaData[ "US Constitutional Law"]}, "which crimes are punished in the US \ Constitution?", 5]
 ✕ FindTextualAnswer[texts, "which crimes are punished in the US \ Constitution?", 5]

FindTextualAnswer is good, but not perfect. It can occasionally make silly, sometimes funny or inexplicable mistakes. You can see why it is confused here:

 ✕ question = "Who is Raoul?"; context = ResourceData["The Phantom of the Opera"]; FindTextualAnswer[context, question, 1, "HighlightedSentence"] // First

We will keep on improving the underlying statistical model in the next versions.

Under the Hood…

FindTextualAnswer combines well-established techniques for information retrieval and state-of-the-art deep learning techniques to find answers in a text.

If a significant number of paragraphs is given to FindTextualAnswer, it first selects the closest ones to the question. The distance is based on a term frequency–inverse term frequency (TFIDF) weighting of the matching terms, similar to the following lines of code:

 ✕ corpus = WikipediaData["Rhinoceros"]; passages = TextCases[corpus, "Sentence"];
 ✕ tfidf = FeatureExtraction[passages, "TFIDF"];
 ✕ question = "What are the horns of a rhinoceros made of?";
 ✕ TakeSmallestBy[passages, CosineDistance[tfidf@#, tfidf@question] &, 2]

The TFIDF-based selection allows us to discard a good amount of irrelevant passages of text and spend more expensive computations to locate more precisely the answer(s) within a subset of candidate paragraphs:

 ✕ FindTextualAnswer[corpus, question, 2, "HighlightedSentence"] // Column

This finer detection of the answer is done by a deep artificial neural network inspired by the cutting-edge deep learning techniques for question answering.

The neural network at the core of FindTextualAnswer was constructed, trained and deployed using the Wolfram neural network capabilities, primarily NetGraph, NetTrain and NetModel. The network is shown in the following directed graph of layers:

 ✕ net = NetModel["Wolfram FindTextualAnswer Net for WL 11.3"]
 ✕ net = NetModel["Wolfram FindTextualAnswer Net for WL 11.3"]

This network was first developed using the Stanford Question Answering Dataset (SQuAD) before using similarly labeled data from various domains and textual sources of knowledge, including the knowledgebase used to power Wolfram|Alpha. Each training sample is a tuple with a paragraph of text, a question and the position of the answer in the paragraph. The current neural network takes as input a sequence of tokens, where each token can be a word, a punctuation mark or any symbol in the text. As the network was trained to output a unique span, the positions of the answers are given as start and end indices of these tokens, as in the tokenized version of the SQuAD dataset in the Wolfram Data Repository. A single training sample is shown here:

 ✕ ResourceData["SQuAD v1.1 Tokens Generated with WL", "TrainingData"][[All, 14478]]

Several types of questions and answers are used to train; these can be classified as follows for the SQuAD dataset:

The following chart shows the different components of the network and their roles in understanding the input text in light of the question:

A first part encodes all the words in the context and the question in a semantic space. It mainly involves two deep learning goodies: (1) word embeddings that map each word in a semantic vector space, independent of the other words in the text; and (2) a bidirectional recurrent layer to get the semantics of the words in context.

The embeddings already capture a lot about the semantics—putting together synonyms and similar concepts—as illustrated below using FeatureSpacePlot to show the computed semantic relationships among fruits, animals and colors.

 ✕ animals = {"Alligator", "Bear", Sequence[ "Bird", "Bee", "Camel", "Zebra", "Crocodile", "Rhinoceros", "Giraffe", "Dolphin", "Duck", "Eagle", "Elephant", "Fish", "Fly"]}; colors = {"Blue", "White", Sequence[ "Yellow", "Purple", "Red", "Black", "Green", "Grey"]}; fruits = {"Apple", "Apricot", Sequence[ "Avocado", "Banana", "Blackberry", "Cherry", "Coconut", "Cranberry", "Grape", "Mango", "Melon", "Papaya", "Peach", "Pineapple", "Raspberry", "Strawberry", "Fig"]}; FeatureSpacePlot[ Join[animals, colors, fruits], FeatureExtractor -> NetModel["GloVe 300-Dimensional Word Vectors Trained on Wikipedia \ and Gigaword 5 Data"]]

Word embeddings have been a key ingredient in natural language processing since 2013. Several embeddings are available in the Wolfram Neural Net Repository. The current model in FindTextualAnswer is primarily based on GloVe 300-Dimensional Word Vectors Trained on Wikipedia and Gigaword 5 Data.

A second part of the neural network produces a higher-level representation that takes into account the semantic matching between different passages of the text and the question. This part uses yet another powerful deep learning ingredient, called attention, that is particularly suited for natural language processing and the processing of sequences in general. The attention mechanism assigns weights to all words and uses them to compute a weighted representation. Like most of the state-of-the-art models of question answering, the neural network of FindTextualAnswer uses a two-way attention mechanism. The words of the question focus attention on the passage and the words of the passage focus attention on the question, meaning that the network exploits both a question-aware representation of the text and a context-aware representation of the question. This is similar to what you would do when answering a question about a text: first you read the question, then read the text with the question in mind (and possibly reinterpret the question), then focus on the relevant pieces of information in the text.

Let’s illustrate how encoding and attention work on a simple input example:

 ✕ question = "What colour are elephants?"; context = "Elephants have a grey or white skin.";

The network is fed with the list of tokens from the context and the question:

 ✕ getTokens = StringSplit[#, {WhitespaceCharacter, x : PunctuationCharacter :> x}] &; input = <|"Context" -> getTokens@context, "Question" -> getTokens@question, "WordMatch" -> Join[{{0, 1, 1}}, ConstantArray[0, {7, 3}]]|>

Note that this input includes a vector "WordMatch" that indicates for each word of the context if it occurs in the question under a certain form. For instance, here the word “Elephants” is matched if we ignore the case. The goal of this tailored feature is to cope with out-of-vocabulary words, i.e. with words that are not in the word embeddings’ dictionary (their embedding will be a vector full of zeros).

The encoding of the text and the question are computed by two subparts of the full network. These intermediate representations can be extracted as follows:

 ✕ questionEncoded = net[input, NetPort["encode_question", "Output"]];
 ✕ questionEncoded = net[input, NetPort["encode_question", "Output"]];
 ✕ contextEncoded = net[input, NetPort["encode_context", "Output"]];
 ✕ contextEncoded = net[input, NetPort["encode_context", "Output"]];

Each encoding consists of one vector per word, and is therefore a sequence of vectors for a full text or question. These sequences of numbers are hardly interpretable per se, and would just be perceived as noise by an average human being. Yes, artificial neural networks are kind of black boxes.

The attention mechanism is based on a similarity matrix that is just the outer dot product of these two representations:

 ✕ outerProduct = Outer[Dot, questionEncoded, contextEncoded, 1];

This similarity matrix is normalized using a SoftmaxLayer. Each word of the question focuses attention on the text, with a row of weights that sum up to 1:

 ✕ outerProduct = Outer[Dot, questionEncoded, contextEncoded, 1];

Each word of the text also focuses attention on the question with a set of weights that are this time obtained by normalizing the columns:

 ✕ outerProduct = Outer[Dot, questionEncoded, contextEncoded, 1];

Finally, the network builds upon the joint context-question representation, again with recurrent layers aggregating evidence to produce a higher-level internal representation. And finally, a last part of the network assigns probabilities for each possible selection of text. The outputs of the network are then two distributions of probabilities in the position of, respectively, the start and the end of the answer:

 ✕ netOutput = net[input]; probas = Flatten /@ KeyTake[netOutput, {"Start", "End"}]; ListPlot[probas, FrameTicks -> {ticksContext, Automatic}, Filling -> Axis, Joined -> True, PlotTheme -> "Web", PlotStyle -> {Blue, Red}, PlotRange -> {0, 1}]

The most probable answer spans are then chosen using a beam search.

These posterior probabilities are based on the assumptions that: (1) the answer is in the context; and (2) there is a unique answer. Therefore, they are not suited to estimate the probability that the answer is right. This probability is computed differently, using a logistic regression on a few intermediate activations of the network at the start and end positions. These activations are accessible through some output NetPort of the network that we named "StartActivation" and "EndActivation":

 ✕ {startActivations, endActivations} = netOutput /@ {"StartActivation", "EndActivation"};

Logistic regression can be expressed as a shallow neural network with just one linear layer and a LogisticSigmoid function:

 ✕ scorer = NetModel["Wolfram FindTextualAnswer Scorer Net for WL 11.3"]

In the current example, the positions of the answers “grey,” “white” and “grey or white” are given by:

 ✕ positions = <|"grey or white" -> {4, 6}, "grey" -> {4, 4}, "white" -> {6, 6}|>;

Then their probabilities can be obtained by accessing the intermediate activations at these positions and applying the logistic regression model:

 ✕ Map[scorer@ Join[startActivations[[First@#]], endActivations[[Last@#]]] &, positions]

Now look at how the network takes into account some additional nuance in the input statement. With the word “sometimes,” the probability of the subsequent word “white” drops:

 ✕ context2 = "Elephants have a grey or sometimes white skin.";

So Try It Out!

FindTextualAnswer is a promising achievement of deep learning in the Wolfram Language that mines knowledge in unstructured texts written in natural language. The approach is complementary to the principle of Wolfram|Alpha, which consists of querying a structured knowledge database that is carefully curated, updated and tuned with a unique magical sauce. FindTextualAnswer is different, and enables you to use any personal or specialized unstructured text source. It can, for example, search for the answer to a question of yours in a long history of emails.

If you’d like to work with the code you read here today, you can download this post as a Wolfram Notebook.

Posted in: Developer Insights

 This looks like a fantastic tool for NLP, and I can’t wait to try it out when it’s released for (I assume) either 11.3 or 12.0. You mentioned that the word embedding model in FindTextualAnswer is based on one of the GloVe datasets – will it have the ability to take in other word embedding models (such as ones trained by the user), or will it be limited just to the default? Posted by David Freiberg    February 25, 2018 at 3:19 pm
 Thanks for your comment! FindTextualAnswer is based on a recurrent neural net whose first layer is GloVe word embedding. This net was trained on some data, with these embeddings frozen as a first step. If one changes these embeddings, then it “invalidates” all the upcoming layers in the network. It means that the full network would have to be retrained to have a chance to give meaningful results, based on a different word embedding. Unless the word embedding trained by the user would be trained to interpolate the GloVe embeddings (it could make sense, to overcome out-of-vocabulary words by using a character-based embedding)… But usually, word embeddings are trained in an unsupervised way, as an intermediate representation that is useful to solve a prediction task. Each embedding component has no particular meaning. There is no reason to have one-to-one correspondence between two different embeddings. So what should be considered is the ability to change the recurrent neural network in FindTextualAnswer. This can be done with the hidden option “Net”. And the suggestion to use another word embedding is a good one. We are actually working on training neural networks based on ELMo contextual embeddings, that were pre-trained by the Allen Institute and that provide significant improvement for NLP tasks in general. Posted by Wolfram Blog    February 27, 2018 at 2:51 pm
 I hadn’t realized that the specific word embedding used to train FindTextualAnswer was so so fundamentally integrated into its structure, but it makes sense; I’ve made various word embedding models for some more specialized lexicons and was hoping there was some way to retrain the NetGraph model FindTextualAnswer uses to ask questions from corpora with those words, but from what you’re saying that’s probably still years away from being something an average user can do. Using ELMo contextual embedding on future releases is exciting stuff, though, and being able to change the recurrent neural network is as well; I still can’t wait to use these functions. Posted by David Freiberg    March 1, 2018 at 10:32 am
 Hi, I am not able to get any response for FindTextualAnswer[], can you let me know if this is available only for desktop version ? Regards, Pradeep Posted by Pradeep Ankem    March 27, 2018 at 9:51 am
 Hello Pradeep: Thank you for reaching out. First, I want to make sure that you are running the current version of Mathematica (11.3). If you are using any previous version, the FindTextualAnswer will not be available. If in fact you are using version 11.3, it’s impossible to know what can be wrong without having any details on which product you’re running. If you could send more details on the command you’re trying to run, if you’ve been able to use the command elsewhere (cloud, etc.), that would allow us to provide the help you are looking for. Thanks again for taking the time to reach out! Best regards, Wolfram Research Posted by Wolfram Blog    March 27, 2018 at 2:47 pm