Deep Learning for Symbolic Mathematics

This blog post was co-authored with Alfred Saidlo, Jatan Shrestha, and Reemet Ammer as a part of the Neural Network’s course project at the University of Tartu. This article attempts to illustrate the idea behind Deep Learning for Symbolic Mathematics and strives to reproduce the results outlined in the paper by Guillaume Lample and François Charton.

Symbolic Maths

Motivation

AlexNet singlehandedly kickstarted the Deep Learning Revolution in 2012 by successfully demonstrating the power of deep neural networks to win the ImageNet image classification challenge by leaving the competitors in the dust. The advancement of Graphical Processing Units (GPUs), availability of large-scale labeled datasets courtesy of crowd-sourcing marketplace, and the rapid improvement of deeper neural architectures led to crucial breakthroughs in a wide range of problems.

Applications of Deep Learning

Deep learning methods are the current state of the art in many applications on Computer Vision, Speech Recognition, and Natural Language Processing. Deep learning has exhibited stellar effectiveness in pattern recognition, natural language processing, and machine translation- a symbol manipulation task but has failed to demonstrate any reasonable success in symbolic computations. Deep neural networks capable of classifying thousands of images with a frightening degree of accuracy are inferior in menial mathematical tasks like integer multiplication.

But what if we could use the current state-of-the-art architectures used for machine translation to solve symbolic computations. The current architectures for machine translation work with sentences considered to be sequences of tokens and need no domain-specific knowledge such as grammar and dictionaries. Moreover, they can perform complex tasks without any supervision. So, a central theme behind this paper is to perceive mathematics as a language, generate large datasets of problems and their solutions, and solve a problem by directly translating it into its solution.

Expressing Mathematical Expressions as Sequences

Machine Translation for symbolic mathematics follows the same workflow as any other NLP / Machine translation tasks:

Represent problems and solutions as sequences to use seq2seq models
Generate large datasets of problems and solutions
Train models to translate problems into their solution

So, how do we represent math problems and their solutions as sequences? The idea is to convert the expressions to trees and from trees to sequences. A simple procedure to represent expressions as sequences is illustrated below.

Sequence Representation Expressions to trees

Tree to Seq
Trees to sequences

Now the expressions can be processed by seq2seq models.

A Sneak Peek into the Data Generation Process

The paper mainly considers three tasks in Symbolic Mathematics:

Symbolic Integration
- $f(x) = 3x + 2x\, sin(x)$ , find $\int f(x) dx$
First-order differential equations
- find $f(x)$ such that $ff^{'} + 2f \, sin(x) = 0$
Second-order differential equations
- find $f(x)$ such that $xf^{''} + ff^{'} \, + 2f \, sin(x) = 0$

Now the question remains, how do we generate the datasets? Creating large datasets of problems and their solutions boil down to a Supervised Learning problem. The process of generating a random problem and its solution involves the formation of a random unary-binary tree. The operators are selected randomly as their internal nodes, and sample constants or variables are left as their leaves.

Consequently, there are three different ways to generate data, i.e., a problem and their solution for the symbolic integration.

The first one is the Forward (FWD) approach

fwd

This approach generates a random function $f$ , computes its antiderivative F using an external symbolic framework ( Sympy, Mathematica, etc. ), and Add $(f, F)$ to the training set
However, this approach is slow and limited to functions the framework can integrate.

The alternative is the Backward (BWD) approach

bwd

This approach generates a function $F$ , compute its derivative, and Add $(f, F)$ to the training set
BWD leads to long problems with short solutions whereas FWD leads to short problems with longer solutions.

The middle ground is the Integration by parts (IBP) approach

ibp

This approach generates random functions $F$ and $G$ , computes their derivatives $f$ and $g$ and if $f \,* G$ is in the training set, compute the integral of $F \, * g$ with : $\int F g =F\,G - \int fG$

Ordinary Differential Equations (ODE) have a slightly bit more complicated approach, thus they’re omitted from this blog post. Five different datasets were generated using FWD, BWD, IBP, ODE1, and ODE2 approaches. Expressions with up to

$n = 15$ internal nodes,
$L = 11$ leaf values,
$p_2 = 4$ binary operators ( $+, -, x, /$ ) , and
$p_1 = 15$ unary operators ( $exp, log, sqrt, sin, cos, tan, ...., tanh^{-1}$ ) were considered.

Attention is all you need

Transformer model is leveraged to tackle the seq2seq task. The model is an example of a non-recurrent neural network with an attention mechanism. The model architecture comprises six layers, eight attention heads, 512 dimensions, and trained over cross-entropy loss with Adam optimizer.

Transformer model architecture

The Seq2Seq model takes an input sequence of arbitrary length and transforms it into some output sequence of arbitrary length as illustrated below.

Seq2seq
Input sequence in German transformed to output sequence in English

In the case of symbolic integration, the input sequences would be the symbols that define the input function and the output sequence would be the symbols that define the integral. Moreover, the beam search is used at inference to look for a token by token generation of the most likely answer. Using Beam search at inference allows the generation of several solutions.

Performance Assessment

The evaluation process involves

Testing the trained models on 5000 held-out examples from the same dataset
No overlap, no test problem has been seen during training
The solutions are checked with an external framework- SymPy
- Integration: derive the solution and subtract from the problem
- ODE: feed the solution into the equation and check the result
Uses beam search and if a solution is incorrect, simply try the next one up to 10 or 50 solutions.

The benchmarks from the paper for different datasets are attached below. All accuracies below are given using a beam search of size 10.

Benchmark 1
Benchmark 2
Benchmarks from the paper

Initially, we decided to reproduce the results with the provided models and the datasets. They were evaluated on the Google Colab and the Google Cloud Virtual Machines, typically consisting of an Nvidia Tesla V100 with 16 GB VRAM and 24 GB RAM. The table below shows the results achieved during the revaluation process with the provided models on some of the feasible datasets. Some of the datasets were massive in size and infeasible to evaluate on our hardware.

Our Result
Our Result 2
Evaluation results on workable datasets

We could observe that the results for integration are better than for ODEs as we approach higher order equations. However, the results, especially on the backward generated datasets, are better compared to Mathematica and Maple.

The Holy Trinity - Generating, Training, & Evaluating

Moreover, we decided to generate the FWD, BWD, IBP, ODE1, and ODE2 datasets having around 10,000 equations. Generating the datasets turned out to be a daunting task due to the timeframe required to generate datasets for one individual approach. The next step was to train the seq2seq model on the newly generated datasets and evaluate it on the benchmark datasets. The evaluation results are attached below, and in comparison to the benchmark, the performance of the model is below par.

Eval 1
Eval 2
Evaluation results on generated datasets

Summary

To conclude, machine translation models apply to tasks in symbolic mathematics. Seq2seq models are capable of processing mathematical expressions to translate the problems to their solutions. The data generation process remains to be the key to improvement in the generalization front. Mixing different generators yield better results, as exhibited by BWD and FWD models performing better on the IBP test set. With limited hardware resources, tuning the Beam size search size turned out to be a challenge. However, symbolic problems have many solutions, and the beam search allows the retrieval of equivalent solutions to improve the results.

Source code:

https://bitbucket.org/Reemet11/symbolicmath/src