# Behind Wolfram|Alpha’s Mathematical Induction-Based Proof Generator

July 14, 2016 — Connor Flood, Consultant, Wolfram|Alpha Math Content

An idea, some initiative, and great resources allowed me to design and create the world’s first online syntax-free proof generator using induction, which recently went live on Wolfram|Alpha.

#### Motivation

It is no secret that Wolfram|Alpha is a well-used tool for math students, especially those taking first-year college and university math courses. In fact, one of the most common responses I receive from students when I explain my role with Wolfram|Alpha is, “OMG, that site saved me during calculus.” This is usually a reference to the differential equation solver, or derivative and integral features of Wolfram|Alpha.

A majority of math topics that are taught at the high-school and first-year level are computation based, similar to the features listed above. This means questions on tests and exams are often in the following form:

- Solve for
*x* - Find the derivative of
*f*(*x*) - Determine the roots of the following equation

Computation questions typically involve performing a series of calculations (i.e. steps) in order to obtain a final result. Topics that can be solved computationally can often be done so using repetitive approaches. Once a student has mastered the limited steps and rules associated with one of these topics, they should be able to handle just about any problem that is thrown at them on an assignment or an exam related to that topic. This is because computation-based questions are generally consistent in the approaches used to solve them. Similar to learning how to calculate derivatives of functions, once a student has gained experience with all the necessary “derivative rules,” answering these questions is a matter of repeatedly applying those rules.

The problem, though, is what about topics that are *not* computation based? More specifically, how do students study and practice questions for math topics that don’t have a limited set of rules or approaches? I encountered this problem as a student in a first-year discrete math course. The topic being taught was proofs by principle of mathematical induction, which I will give a brief background on. This topic was completely different than anything I had previously been taught in math, for a couple reasons:

- Proofs are not computation questions. The same set of rules can’t be applied to every question.
- Proofs don’t have a single final answer.

It’s weird to think of a math problem that doesn’t have a nice, simple, clean answer at the bottom of the page. If a student was solving an equation on a test, they might use five lines to show their steps, and then at the very end they would likely have the final answer written in the form “*x* = 21.” Mathematical proofs, however, don’t work that way. In fact, the “answer” actually is the steps. The goal of a proof is to logically show why something is true.

Consider if you were playing the game Clue (a.k.a. Cluedo), a murder mystery board game. The goal of the game is to determine who the murderer is, the weapon used, and the location of the crime. Someone would therefore win by stating something along the lines of, “Colonel Mustard with the revolver in the dining room.” Now imagine instead of trying to figure out this information, you were told who the murderer was, the weapon used, and the location of the crime at the very beginning of the game. The goal now is then not to determine the conclusion (since you were already told it) but instead figure out *why* this is true. Based on your own cards and conclusions made from information being revealed throughout the game, you should eventually be able to prove using logic why the information you were told at the beginning of the game must be true. Therefore, you might win the game by stating something along the lines of, “Since I have all of the weapon cards besides the revolver, and Joe has all the location cards except the dining room, and I know neither of you have Colonel Mustard based on your cards played, I have proved the proposed murderer to be true.”

I am not at all suggesting that the above example would be a fun alteration to Clue; in fact, it would likely kill the fun of the game (bad joke). However, it demonstrates the type of question/answer format that proofs represent. Below is a sample induction proof question a first-year student might see on an exam:

Prove using mathematical induction that 8^*n* – 3^*n* is divisible by 5, for *n* > 0.

The assertion made, that 8^*n* – 3^*n* is divisible by 5 when *n* is greater than 0, is completely true (assuming *n* is an integer). However, the question is asking the student to logically show why the statement is correct. Just like in that alternate Clue example, you were told the information at the beginning of the game and then asked to prove why it’s true. Writing these proofs is not trivial for most students, and is often viewed as one of the more difficult math concepts to master.

Due to the complexity of these questions, students often rely on online resources for further help (e.g. Wolfram|Alpha). If students had a differential equation they were solving and wanted to verify their answer, they can easily input it into Wolfram|Alpha and be given the final answer as well as the step-by-step solution. This approach can also be taken for derivatives, integrals, simplifying expressions, etc.

… but what about proofs?

We’ve established that proofs are not considered to be computation questions. As a result, to my understanding, no calculator or online tool has been able to generate solutions for proof questions… until now.

I vividly remember in my first year wanting to verify my solutions for proofs on an assignment. I had written out what I thought were the correct solutions to the proof questions. But how could I check if my answers were right? Surprisingly, there were only two ways to verify the correctness of my proofs:

- Find the exact same proof as an example online (which was very rare)
- Compare my solutions with someone else in the class, and assume if we both had the same solution that we must both be right (bad assumption)

So why is it so easy to find a “derivative calculator” online, but not a “proof calculator”?

The answer is mainly due to the fact that proofs have generally not been considered computable. Since the same set of rules can’t be applied to cover 100% of proofs, a computer has difficulty creating the logical steps of which the proof is composed. My solution to this problem was to build a library of pattern-matched proofs (as explained further below). This approach demonstrated the existence of computation playing a role in constructing proofs, and led to the creation of a working “proof generator.”

#### Prototype

Over a summer I began building a prototype application, which was able to perform a limited number of proofs using the principle of mathematical induction (PMI). PMI is a common tactic for constructing proofs. The logic for proving a query true via PMI is often compared to proving that a line of dominoes will all fall down in succession. First, you want to prove that the first domino will fall over if you push it. Second, you want to prove that if any given domino falls over, the one following it will also fall over. By proving these two facts, you are able to show that a line of dominoes will fall over if the first one is pushed.

This domino example well describes the logic behind induction proofs, which are composed of:

- Base case (first domino): Prove the statement is true for the base case (the smallest
*n*value), i.e. if you were proving something was true for*n*≥ 1, the base case is showing the statement is true for*n*= 1. - Inductive step (consecutive dominoes): This is the more challenging step. In the inductive step, you assume the statement is true for some value (i.e.
*k*), and then try to show this is also true for*k*+ 1.

If both of these steps are done correctly, then the proof is complete.

Before development, I had discussed building this application with a professor at my undergraduate university (Algoma University). He compared this challenging idea to building a recipe book that contains every recipe in the world. You can keep adding as many recipes as you can get your hands on, but it is essentially impossible to include every recipe ever made. Similarly, I can keep adding more coverage for proofs to be generated, but I will never be able to include every proof that a user will ask for.

While this was a rather notable limitation, I thought about it in a slightly different sense. Instead I considered that if someone new to cooking wanted a cookbook, it might be a fairly large cookbook. There’s a wide range of recipes that they would want to try and make. At the same time, they would not likely be relying on a recipe for Gordon Ramsay’s Beef Wellington to be included in the book. The level of difficulty for preparing that dish is likely too high for someone who is brand-new to cooking. As for someone who is new to learning proofs (first-year university level), they likely won’t need access to a resource that generates very advanced proofs. As long as their resource can just cover queries used at the first-year level, then in their eyes the resource has terrific coverage. This was an important distinction made in the planning stage.

During the development of the prototype, I quickly realized there was a rather large challenge right out of the gate. Many of the proofs I wanted the application to cover involved simplifying expressions, which can be a very complex program on its own to write. It didn’t make sense to reinvent the wheel, and this was when I first discovered Wolfram|Alpha’s APIs. Using its APIs, I could simply make calls directly to Wolfram|Alpha (i.e. simplify an expression), and return the result back to my application. I had a few specific questions about the usage of the APIs, and so I decided to just contact Wolfram|Alpha directly with my questions.

The helpful person at the other end answered my questions quickly, and then the conversation took a direction I didn’t expect. After I explained the little side project I was building, I was asked to demonstrate the application to a few different members at Wolfram|Alpha (after signing a nondisclosure agreement, of course). At this point the application was fairly limited in terms of the number of proofs it could handle, but it performed well as a prototype or proof of concept. The demonstration went well, but I was once again surprised at the direction the meeting took. It was just a couple weeks after the demonstration that I was offered—and then accepted—a position with Wolfram|Alpha as a Math Content Developer.

#### How the app works

I moved to the Wolfram|Alpha HQ in Champaign, Illinois, the following summer. There, I was able to work on several projects related to new math features being added to the site. However, my main project was to reimplement my induction proof program to become part of Wolfram|Alpha.

The goal of the project was that a student would be able to input any induction proof question they had in a first-year course. For that to become a reality, I scoured the internet and several textbooks to search for all of the induction proof problems I could find… there were a lot.

To be clear, this project was *not* a database of all induction questions I could find. For the program to work consistently for all proofs (even ones it hadn’t seen before), I needed to first generalize the types of induction proofs that students are being taught. The decision was made early on that the program would handle three main types of queries for proofs:

- Summation/product equalities, e.g. prove 1 + 2 + 3 +… +
*n*=*n*(*n*+ 1)/2, for*n*> 0 - Expression divisibility, e.g. prove 8^
*n*– 3^*n*is divisible by 5, for*n*> 0 - Expression inequalities, e.g. prove 2^
*n*>*n*^3, for*n*> 9

Using these three main types of proof queries, I then further broke down each of these query types into subsets of more specific types of proofs. Below is an example of what I mean by that.

Let’s say we want to write a proof for the following:

Prove 8^*n* – 3^*n* is divisible by 5, for *n* > 0

Instead of “hard-coding” the full solution of this proof into the application, I want to generalize it as much as I can so that it can possibly cover even more proofs. Therefore, I don’t add very specific proofs to the application, but instead add pattern-matched proofs.

In the expression 8^*n* – 3^*n*, I really don’t care all that much about the 8 and the 3. The reason I don’t care is because even if those numbers change, it’s not going to change the structure of the proof being generated. That means if a user instead asked to prove that 10^*n* – 3^*n* is divisible by 7, the exact same steps would be followed for the 8^*n* – 3^*n* example, just with different values. In order to take advantage of this, I would therefore write a proof solution for this type of query:

Prove *a*^*n* – *b*^*n* is divisible by *d*, for n > *x*

By pattern matching the user’s input, if it is detected that their query is in this form (i.e. follows this pattern), then their proof will be generated according to this structure. The variables (*a*, *b*, *n*, *d*, *x*) are then extracted in order to provide the values for the generated proof.

This concept is then applied to many different structures for proof queries. The pattern-matching approach seemed efficient in quickly covering most of the example proof questions I came across online and in textbooks. This approach is what allows the application to generate proofs for queries it hasn’t seen before, so long as it matches the pattern of one of the structures the app is aware of (of which there are many).

However, specific pattern matching is really only used for the expression divisibility and expression inequalities query types. For the summation/product equalities, the application is theoretically able to handle 100% of these queries. There is a specific mathematical approach for summation/product equalities that can be followed, regardless of the right-hand expression of the equality or the contents of the summation or product. This approach works both for proofs and disproofs of summation equalities. This is a nice feature, since it ensures complete coverage of induction proofs for this query type, so long as Wolfram|Alpha doesn’t time out due to large input.

For expression divisibility, most generated proofs are based solely off pattern matching the input. There is, however, a somewhat general-case algorithm used as a last-ditch effort, in case the given expression was not successfully pattern matched. The algorithm performs a certain manipulation on the expression in an attempt to put the expression into a specific form. This form is simply the addition of two terms, where one term is the inductive hypothesis and the other term contains a coefficient that has the given divisor as a factor. If this manipulation is successful, then a proof can be generated. Otherwise, it will fail.

As for expression inequalities, all the generated proofs are pattern matched since I am unaware of any general-case algorithm to apply toward expression inequality proofs. It makes more sense to pattern match the difference between the two expressions. Not only is this beneficial for inequality manipulation but this is also a crucial step in case any terms need to be eliminated. For example, here is a possible query:

Prove 2*n* + 7 + sin(*n*) < (*n* +7)^2 + sin(*n*), for *n* > 0

The proof generated here would ideally be pattern matched for the following:

Prove *a* *n* + c < (*n* + *b*)^*d*, for *n* > *g*

The problem is that the pattern-matched expression doesn’t include a sin(*n*) on each side of the inequality. However, if the difference between each side of the expression (e.g. left-hand side – right-hand side) is taken, the new query would just be 2*n* + 7 – (*n*+7)^2, which would be pattern matched against *a* *n* + *c* – (*n*+*b*)^*d*, which matches correctly.

It is also important to note that some of these variables can sometimes be optional (i.e. equal to 0 or 1), such as the constants, coefficients, and exponents.

Once testing for the application began, I noticed there were two types of queries that were consistently generating incorrect proofs. While the goal is to include coverage for as many proofs as possible, it is much, much, much worse to generate a proof that is mathematically incorrect as opposed to not generating a proof at all. As soon as the application starts producing invalid content, the quality and reliability of the application goes down the drain.

The first type of query that generated invalid proofs can be demonstrated with the following example:

Prove 4*n* + 7 > 2^*n*, for *n* ≥ 0

In this example, the base case (*n* = 0) correctly passes (since 7 > 1). However, the inductive step *should* fail because the actual claim being made is false (i.e. when n = 5, since 27 ≱ 32). But for whatever reason, the app attempted to generate an inductive step to make the proof valid. This resulted in faulty mathematics, and needed to be eliminated. Proofs using induction are terrific for verifying claims, but they are not so great for disproving claims. The decision was made then to not generate proofs (or “disproofs”) for expression inequality queries where the base case is valid but the given query is false.

The other type of query that generated invalid proofs can be explained using this example:

Prove 4*n* < 2^*n*, for *n* ≥ 0

This is actually a fairly tricky example. The graph below illustrates the comparison of these expressions:

The base case of this example is *n* = 0, which results in 4(0) < 2(0), which simplifies to 0 < 1, which is true. Since the base case passes, the inductive step can proceed. Like the previous example, this query also generated an invalid inductive step. The problem is that many of the manipulations that are done in the inductive step may be assuming a certain range of *k*. For example, in the above proof, one of the steps relies on the fact 3 < 2^*k*, for *k* ≥ 2. However, the range of *k* values between the base case and *k* ≥ 2, where *n* = *k*, has not been proven to satisfy the given claim of

“Prove 4*n* < 2^*n*”. Therefore, the base case must now also include *n* = 1 and *n* = 2 in the base case coverage. Note that a “modified proof” would need to suggest that *n* > 4 since the case that *n* = 4 is not true for 4*n* < 2^*n*.

If a manipulation is made in the inductive step that requires *n* ≥ 2 but the base case is *n* = 0, then we haven’t proven the query to be true for *n* = 1 or *n* = 2. This is very similar to the false paradox “all horses are the same color.” The solution to this problem was not to get rid of the proof altogether. Instead, the base case step would test multiple values of *n* if any assumption was to be made in the inductive step. Therefore, the query 4*n* < 2^*n* would fail in the base case because it is not true for

*n* = 1.

The final challenge with this project involved the natural language processing (NLP). This allows a user to ask a question in plain English and have Wolfram|Alpha understand what they’re looking for. While I had no applicable experience with NLP before, the existing parsing/scanning framework on Wolfram|Alpha made it fairly easy to integrate the NLP component with the application. The main challenge was trying to determine all the possible ways a user can ask Wolfram|Alpha to prove a query using induction. This continues to be an ongoing development, as different phrasings for queries are still being added.

#### Using the app

Here’s a screen shot of the application on Wolfram|Alpha:

By clicking on the “Step-by-step solution” button (a Wolfram|Alpha Pro feature), you will be able to see the generated proof. All users, however, will be able to see the above content. Additional information has been added to the application in order to provide important details to the user. A feature added to this result page is that a different divisor is suggested when the proof is invalid.

Below is a screen shot of the generated proof with the “modified divisor”:

For the multiple-base case example discussed, two proofs were generated for this result—one that shows the initial proof (invalid), and then a second proof that recommends a different base case in order to make the proof valid:

One final, cool feature is that an invalid summation/product equality proof will be corrected with a new recommended expression:

#### Aftermath

The application went live on Wolfram|Alpha on October 15, 2015 (I’ll never forget it). This was approximately fifteen months after I began designing the prototype program. This date was also six days before I presented the application at the Wolfram Technology Conference on my twenty-first birthday. The success and feedback from this project has been more than I ever could have imagined.

By taking an idea and some initiative, I landed a position with the world’s largest computational knowledge engine (where I’ve been an employee for more than a year now). This initiative also played a key role in being accepted as a graduate student in the University of Waterloo’s symbolic computation department.

While the goal of the project was to generate induction proofs, this whole process was a terrific proof for the rewards that can result from imagination, originality, and hard work.

## 18 Comments

Is it supposed to handle cases like this, or is this example simply beyond the scope of the algorithm:

prove by induction sum(floor(i/2),i=1..n)=floor(n/2)ceiling(n/2)

Right now it erroneously says “not always true for n > 0 and n in Z”.

Hello,

Thank you for pointing that out. It is within the scope of the algorithm, and is actually a bug which occurs in the second-last section of steps. For some reason, it is incorrectly stating that ceiling(k/2) floor(k/2)+floor((1+k)/2) != ceiling((k+1)/2) floor((k+1)/2). If this had correctly been interpreted as an equality, then the generated proof would verify the claim.

I will fix this shortly!

Fantastic Connor! I remember your presentation at WTC last year very well. I was the guy in the right front row who didn’t stop clapping.

Do you think it would be possible to build up a more extensive library of pattern-matched proofs, not so they can possibly hope to cover all known proofs which is impossible, but rather so that it can act as a computational aid to constructing most types of mathematical proofs? Something along the lines of the collection of proofs at ProofWiki at https://proofwiki.org/wiki/Main_Page

Michael Kelly

Thanks Michael!

Yes, I absolutely agree. There are some proofs that I think will always need to be hard-coded, for example proving there’s an infinite number of primes, since pattern-matching wouldn’t be beneficial here. But if we generalized the structure for different approaches used for many of these proofs (besides induction), then pattern-matching could definitely be helpful from a computational stand-point in constructing a wider breadth of proofs.

I didn’t mention it in the blog post, but in my talk I think I discussed the possibility of A.I. playing a role in generating proofs. In the case of induction for example (where the base case is considered trivial): if an agent was aware of a start state (inductive hypothesis), a goal state (conclusion), and its types of operations (expression manipulations), it’s possible the agent could determine the necessary operations to create a logical path between the start state and goal state.

Great read. Very cool work, well done sir!

Thanks Brandon!

This is really cool work. Congratulations on the project and the job.

One point of clarification, you say a proof wouldn’t be significantly different if you varied the bases in the arithmetic example and you give the example “10^n – 3n”. Is this meant to read “10^n – 3^n”?

Thanks a lot Matthew!

Yes you’re absolutely right, it was a typo and has now been corrected. Thanks for pointing that out!

Neat post – I’m impressed that Wolfram|Alpha can generate proofs, even if it’s as restricted as this. Taking a first step makes the next steps that much easier.

For proving inequalities, would it be useful to use the following argument? You can prove f(k+1) > g(k+1) using f(k) > g(k) and f(k+1)-f(k) >= g(k+1)-g(k), or alternatively that f(k+1)/f(k)) >= g(k+1)/g(k). I feel like this could be useful for a lot of simple induction exercises. For the 2n + 7 + sin(n) < (n +7)^2 + sin(n) problem, for the inductive step it would be enough to show that 2 <= (n+8)^2 – (n+7)^2, which is simple after simplifying the right side. The 2^n > n^3 example would be done with the division version, which simplifies to 2 >= ((k+1)/k)^3, which is done by computing the minimum value of (k+1)/k.

Thanks Michael!

Yes both those approaches seems like they could definitely work! While there’s often going to be more than one way to perform the inductive step, it makes the most sense to choose the simplest path. I’m going to look into this a bit more, and see what kind of breadth this can cover, and how strict its reliance is on pattern matching each expression.

When presented with:

prove by induction (a^n-b^n) is divisible by (a-b) for n > 0 and n in Z

It seems to correctly state the problem but gives a puzzling counter-example:

a^n-b^n is not divisible by a-b for True

True is not in Z. What went wrong?

Hi Bob,

This is a mix between a bug and going slightly outside the scope of what it can currently handle. So far, the divisibility proofs have only been able to handle univariate expressions, which are divisible by an integer. Since the proof couldn’t be generated, it still wanted to at least verify or disprove the given claim. When doing so, it would have searched for a possible counter-example in an attempt to disprove the claim. However, it was prepared for multivariate cases, and so one of the calculations would have defaulted to True instead of what was likely supposed to be an integer, which messed things up.

With that being said, the program should ideally be able to handle this type of query anyway since the logic for the univariate case (when a and b are provided) would follow the same logic. I’m going to first get rid of the bug altogether, but I’ve added this to problem to my TODO as a future type of query to support.

Thanks for the input!

I’d like to add a couple of things to my first comment. It’s a nice idea to try to automate induction proofs, but there are some pitfalls. First, if we have Sum[s[k], {k, n}] == S[n], then the task is to verify S[n+1]-S[n] == s[n+1]. But if s[k] depends on n, the proof is not as simple as that. To illustrate, for

prove by induction sum(n,k=1..n)=n^2

W|A right now says “not always true for n > 0 and n in Z”. Second, if the problem is Sum[k^-a, {k, n}] == HarmonicNumber[n, a], in that case HarmonicNumber in Mma is defined to be equal to that sum. So it’s not clear what the proof actually proves. More generally, when asked to simplify S[n+1]-S[n] == s[n+1], Mma uses its built-in identities, some of which may be equivalent to the statement that we’re proving, making the proof circular.

Hi,

For the “prove by induction sum(n,k=1..n)=n^2″ claim, it gave the “not always true” result because the program incorrectly failed in the base case. So far, only summations whose summand contains the iterator variable where examined, and thus failed to evaluate the sum properly. I will be fixing this soon.

For the HarmonicNumber summation equality, a similar issue was encountered. It assumed a base case of a=1, but it couldn’t properly evaluate for n. I will look further into this.

You’re absolutely right about encountering pitfalls. Given the small amount of time this has been live, and the little history it has, there’s no doubt going to be a variety logical adjustments being made. This blog was great for receiving some suggestions and pointing out bugs, which allows the proof generator to keep improving!

Great work, Connor! Congratulations on imagining, then creating a new capability for Wolfram|Alpha, on landing a position with Wolfram Research, and on getting accepted into the symbolic computation program at the University of Waterloo. Your blog is one of the most interesting I’ve read in the Wolfram blog series.

Thank you very much Rand, that means a lot!

(a^n-b^n) is divisible by (a-b) for n > 0 and n in Z

is obviously false if (a-b) = 0, but there doesn’t seem to be a way to express this additional constraint.

Hello, somehow it has problems with binomials, I think that is a bug? For example it will not prove the statement

prove sum of j from 0 to n binomial(n,j)=2^n

which is certainly true and not too difficult to prove.