Neural Networks for the Prediction of Organic Chemistry Reactions
We introduce a neural
network machine learning algorithm for the prediction of basic organic
chemistry reactions. We explore the performance for two compound types using a
novel fingerprint.
Reaction prediction
remains one of the major challenges for organic chemistry and is a prerequisite
for efficient synthetic planning. It is desirable to develop algorithms that,
like humans, “learn” from being exposed to examples of the application of the rules
of organic chemistry. We explore the use of neural networks for predicting
reaction types, using a new reaction fingerprinting method. We combine this
predictor with SMARTS transformations to build a system which, given a set of
reagents and reactants, predicts the likely products. We test this method on
problems from a popular organic chemistry textbook.
To develop the intuition and understanding for
predicting reactions, a human must take many semesters of organic chemistry and
gather insight over several years of lab experience. Over the past 40 years,
various algorithms have been developed to assist with synthetic design,
reaction prediction, and starting material selection. LHASA was the first of
these algorithms to aid in developing retrosynthetic pathways. This
algorithm required over a decade of effort to encode the necessary subroutines
to account for the various subtleties of retrosynthesis such as functional group
identification, polycyclic group handling, relative protecting group
reactivity, and functional group based transforms.
In the late 1980s to the early 1990s, new
algorithms for synthetic design and reaction prediction were developed. CAMEO, a
reaction predicting code, used subroutines specialized for each reaction type,
expanding to include reaction conditions in its analysis. EROS identified leading structures for
retrosynthesis by using bond polarity, electronegativity across the molecule,
and the resonance effect to identify the most reactive bond. SOPHIA was
developed to predict reaction outcomes with minimal user input; this algorithm
would guess the correct reaction type subroutine to use by identifying
important groups in the reactants; once the reactant type was identified,
product ratios would be estimated for the resulting products. SOPHIA was
followed by the KOSP algorithm and uses the same database to predict
retrosynthetic targetsOther methods generated rules based on published
reactions and use these transformations when designing a retrosynthetic
pathway. Some methods encoded expert rules in the form of electron flow
diagrams. Another group attempted to grasp the diversity of reactions by
creating an algorithm that automatically searches for reaction mechanisms using
atom mapping and substructure matching.
While these algorithms have their subtle
differences, all require a set of expert rules to predict reaction outcomes.
Taking a more general approach, one group has encoded all of the reactions of
the Beilstein database, creating a “Network of Organic Chemistry”. By searching
this network, synthetic pathways can be developed for any molecule similar
enough to a molecule already in its database of 7 million reactions,
identifying both one-pot reactions that do not require time-consuming
purification of intermediate products( and
full multistep reactions that account for the cost of the materials, labor, and
safety of the reaction. Algorithms that use encoded expert rules or databases
of published reactions are able to accurately predict chemistry for queries
that match reactions in its knowledge base. However, such algorithms do not
have the ability of a human organic chemist to predict the outcomes of
previously unseen reactions. In order to predict the results of new reactions,
the algorithm must have a way of connecting information from reactions that it
has been trained upon to reactions that it has yet to encounter.
Another strategy of reaction prediction
algorithm draws from principles of physical chemistry and first predicts the
energy barrier of a reaction in order to predict its likelihood. Specific
examples of reactions include the development of a nanoreactor for early Earth
reactions, heuristic aided quantum chemistry, and
ROBIA, an algorithm for reaction prediction. While methods that are guided
by quantum calculations have the potential to explore a wider range of
reactions than the heuristic-based methods, these algorithms would require new
calculations for each additional reaction family and will be prohibitively
costly over a large set of new reactions.
A third strategy for reaction prediction
algorithms uses statistical machine learning. These methods can sometimes
generalize or extrapolate to new examples, as in the recent examples of picture
and handwriting identification, playing video gamesand most recently,
playing Go. This last example is particularly interesting as Go is a
complex board game with a search space of 10170, which is on the
order of chemical space for medium sized molecules. SYNCHEM was one early
effort in the application of machine learning methods to chemical predictions,
which relied mostly on clustering similar reactions, and learning when
reactions could be applied based on the presence of key functional groups\
Today, most machine learning approaches in
reaction prediction use molecular descriptors to characterize the reactants in
order to guess the outcome of the reaction. Such descriptors range from
physical descriptors such as molecular weight, number of rings, or partial
charge calculations to molecular fingerprints, a vector of bits or floats that
represent the properties of the molecule. ReactionPredictor is an
algorithm that first identifies potential electron sources and electron sinks
in the reactant molecules based on atom and bond descriptors. Once identified,
these sources and sinks are paired to generate possible reaction mechanisms.
Finally, neural networks are used to determine the most likely combinations in
order to predict the true mechanism. While this approach allows for the
prediction of many reactions at the mechanistic level, many of the elementary
organic chemistry reactions that are the building blocks of organic synthesis
have complicated mechanisms, requiring several steps that would be costly for
this algorithm to predict.
Many algorithms that predict properties of
organic molecules use various types of fingerprints as the descriptor. Morgan
fingerprints and extended circular fingerprintshave been used to predict
molecular properties such as HOMO–LUMO gaps, protein–ligand binding
affinity, and drug toxicity levelsand even to predict synthetic accessibility. Recently
Duvenavud et al. applied graph neural networks to generate continuous
molecular fingerprints directly from molecular graphs. This approach
generalizes fingerprinting methods such as the ECFP by parametrizing the
fingerprint generation method. These parameters can then be optimized for each
prediction task, producing fingerprint features that are relevant for the task.
Other fingerprinting methods that have been developed use the Coulomb matrix, radial
distribution functions, and atom pair descriptors For classifying
reactions, one group developed a fingerprint to represent a reaction by taking
the difference between the sum of the fingerprints of the products and the sum
of the fingerprints of the reactants. A variety of fingerprinting methods were
tested for the constituent fingerprints of the molecules.
In this work, we apply fingerprinting methods,
including neural molecular fingerprints, to predict organic chemistry
reactions. Our algorithm predicts the most likely reaction type for a given set
of reactants and reagents, using what it has learned from training examples.
These input molecules are described by concatenating the fingerprints of the
reactants and the reagents; this concatenated fingerprint is then used as the
input for a neural network to classify the reaction type. With information
about the reaction type, we can make predictions about the product molecules.
One simple approach for predicting product molecules from the reactant
molecules, which we use in this work, is to apply a SMARTS transformation that
describes the predicted reaction. Previously, sets of SMARTS transformations
have been applied to produce large libraries of synthetically accessible
compounds in the areas of molecular discovery, metabolic networks, drug
discovery and discovery of one-pot reactions. In our algorithm, we use SMARTS
transformation for targeted prediction of product molecules from reactants.
However, this method can be replaced by any method that generates product
molecule graphs from reactant molecule graphs. An overview of our method can be
found in 1 and
is explained in further detail in Prediction
Methods.
Figure 1. An overview of our method for predicting reaction
type and products. A reaction fingerprint, made from concatenating the
fingerprints of reactant and reagent molecules, is the input for a neural
network that predicts the probability of 17 different reaction types,
represented as a reaction type probability vector. The algorithm then predicts
a product by applying to the reactants a transformation that corresponds to the
most probable reaction type. In this work, we use a SMARTS transformation for
the final step.
Performance on
Cross-Validation Set
We created a data set of reactions of four alkyl
halide reactions and 12 alkene reactions; further details on the construction
of the data set can be found in Methods.
Our training set consisted of 3400 reactions from this data set, and the test
set consisted of 17,000 reactions; both the training set and the test set were
balanced across reaction types. During optimization on the training set, k-fold
cross-validation was used to help tune the parameters of the neural net. Table 1 reports
the cross-entropy score and the accuracy of the baseline and fingerprinting
methods on this test set. Here the accuracy is defined by the percentage of
matching indices of maximum values in the predicted probability vector and the
target probability vector for each reaction.
fingerprint method
|
fingerprint length
|
train NLL
|
train accuracy (%)
|
test NLL
|
test accuracy (%)
|
baseline
|
51
|
0.2727
|
78.8
|
2.5573
|
24.7
|
Morgan
|
891
|
0.0971
|
86.0
|
0.1792
|
84.5
|
neural
|
181
|
0.0976
|
86.0
|
0.1340
|
85.7
|
Figure 2 shows
the confusion matrices for the baseline, neural, and Morgan fingerprinting
methods, respectively. The confusion matrices for the Morgan and neural
fingerprints show that the predicted reaction type and the true reaction type
correspond almost perfectly, with few mismatches. The only exceptions are in
the predictions for reaction types 3 and 4, corresponding to nucleophilic
substitution reaction with a methyl shift and the elimination reaction with a
methyl shift. As described in Methods,
these reactions are assumed to occur together, so they are each assigned
probabilities of 50% in the training set. As a result, the algorithm cannot
distinguish these reaction types and the result on the confusion matrix is a 2
× 2 square. For the baseline method, the first reaction type, the “NR”
classification, is often overpredicted, with some additional overgeneralization
of some other reaction type as shown by the horizontal bands.
Performance on
Predicting Reaction Type of Exam Questions
Kayala et al. had
previously employed organic textbook questions both as the training set and as
the validation set for their algorithm, reporting 95.7% accuracy on their
training set. We similarly decided to test our algorithm on a set of textbook
questions. We selected problems 8-47 and 8-48 from the Wade sixth edition
organic chemistry textbook shown in Figure 3.The reagents listed in each problem
were assigned as secondary reactants or reagents so that they matched the
training set. For all prediction methods, our networks were first trained on
the training set of generated reactions, using the same hyperparameters found
by the cross-validation search. The similarity of the exam questions to the
training set was determined by measuring the Tanimoto(49) distance of the fingerprints of the
reactant and reagent molecules in each reactant set. The average Tanimoto score
between the training set reactants and reagents and the exam set reactants and
reagents is 0.433, and the highest Tanimoto score oberved between exam
questions and training questions was 1.00 on 8-48c and 0.941 on 8-47a. This
indicates that 8-48c was one of the training set examples. Table SI.1 show
more detailed results for this Tanimoto analysis.
Using our fingerprint-based neural network
algorithm, we were able to identify the correct reaction type for most
reactions in our scope of alkene and alkyl halide reactions, given only the
reactants and reagents as inputs. We achieved an accuracy of 85% of our test
reactions and 80% of selected textbook questions. With this prediction of the
reaction type, the algorithm was further able to guess the structure of the
product for a little more than half of the problems. The main limitation in the
prediction of the product structure was due to the limitations of the SMARTS
transformation to describe the mechanism of the reaction type completely.
While previously developed machine learning
algorithms are also able to predict the products of these reactions with
similar or better accuracy, the structure of our algorithm
allows for greater flexibillity. Our algorithm is able to learn the
probabilities of a range of reaction types. To expand the scope of our
algorithm to new reaction types, we would not need to encode new rules, nor
would we need to account for the varying number of steps in the mechanism of
the reaction; we would just need to add the additional reactions to the
training set. The simplicity of our reaction fingerprinting algorithm allows
for rapid expansion of our predictive capabilities given a larger data set of
well-curated reactions Using data sets of experimentally published reactions, we
can also expand our algorithm to account for the reaction conditions in its
predictions and, later, predict the correct reaction conditions.
This paper represents a step toward the goal of
developing a machine learning algorithm for automatic synthesis planning of
organic molecules. Once we have an algorithm that can predict the reactions
that are possible from its starting materials, we can begin to use the
algorithm to string these reactions together to develop a multistep synthetic
pathway. This pathway prediction can be further optimized to account for
reaction conditions, cost of materials, fewest number of reaction steps, and
other factors to find the ideal synthetic pathway. Using neural networks helps
the algorithm to identify important features from the reactant molecules’
structure in order to classify new reaction types.
·
Methods
Data Set Generation
The data set of reactions was developed as
follows: A library of all alkanes containing 10 carbon atoms or fewer was
constructed. To each alkane, a single functional group was added, either a
double bond or a halide (Br, I, Cl). Duplicates were removed from this set to
make the substrate library. Sixteen different reactions were considered, 4
reactions for alkyl halides and 12 reactions for alkenes. Reactions resulting
in methyl shifts or resulting in Markovnikov or anti-Markovnikov product were
considered as separate reaction types. Each reaction is associated with a list
of secondary reactants and reagents, as well as a SMARTS transformation to
generate the product structures from the reactants.
To generate the reactions, every substrate in
the library was combined with every possible set of secondary reactants and
reagents. Those combinations that matched the reaction conditions set by our
expert rules were assigned a reaction type. If none of the reaction conditions
were met, the reaction was designated a “null reaction” or NR for short. We
generated a target probability vector to reflect this reaction type assignment
with a one-hot encoding; that is, the index in the probability vector that
matches the assigned reaction type had a probability of 1, and all other
reaction types had a probability of 0. The notable exception to this rule was
for the elimination and substitution reactions involving methyl shifts for
bulky alkyl halides; these reactions were assumed to occur together, and so 50%
was assigned to each index corresponding to these reactions. Products were
generated using the SMARTS transformation associated with the reaction type
with the two reactants as inputs. Substrates that did not match the reaction
conditions were designated “null reactions” (NR), indicating that the final
result of the reaction is unknown. RDKit was
used to handle the requirements and the SMARTS transformation. A total of
1,277,329 alkyl halide and alkene reactions were generated. A target reaction
probability vector was generated for each reaction.
Prediction Methods
As outlined in Figure 1, to predict the reaction outcomes of a given
query, we first predict the probability of each reaction type in our data set
occurring, and then we apply SMARTS transformations associated with each
reaction. The reaction probability vector, i.e., the vector encoding the
probability of all reactions, was predicted using a neural network with
reaction fingerprints as the inputs. This reaction fingerprint was formed as a
concatenation of the molecular fingerprints of the substrate (Reactant1), the
secondary reactant (Reactant2), and the reagent. Both the Morgan fingerprint
method, in particular the extended-connectivity circular fingerprint (ECFP),
and the neural fingerprint method were tested for generating the molecular
fingerprints. A Morgan circular fingerprint hashes the features of a molecule
for each atom at each layer into a bit vector. Each layer considers atoms in
the neighborhood of the starting atom that are at less than the maximum
distance assigned for that layer. Information from previous layers is
incorporated into later layers, until the highest layer, e.g., the maximum bond
length radius, is reached. A neural fingerprint also records atomic
features at all neighborhood layers but, instead of using a hash function to
record features, uses a convolutional neural network, thus creating a
fingerprint with differentiable weights. Further discussion about circular
fingerprints and neural fingerprints can be found in Duvenaud et al. The
circular fingerprints were generated with RDKit, and the neural fingerprints
were generated with code from Duvenaud et al The neural network used for
prediction had one hidden layer of 100 units. Hyperopt in conjunction with
Scikit-learnwas used to optimize the learning rate, the initial scale, and the
fingerprint length for each of the molecules.
For some reaction types, certain reagents or
secondary reactants are required for that reaction. Thus, it is possible that
the algorithm may learn to simply associate these components in the reaction
with the corresponding reaction type. As a baseline test to measure the impact
of the secondary reactant and the reagent on the prediction, we also performed
the prediction with a modified fingerprint. For the baseline metric, the
fingerprint representing the reaction was a one-hot vector representation for
the 20 most common secondary reactants and the 30 most common reagents. That
is, if one of the 20 most common secondary reactants or one of the 30 most
common reagents was found in the reaction, the corresponding bits in the
baseline fingerprint were turned on; if one of the secondary reactants or
reagents was not in these lists, then a bit designated for “other” reactants or
reagents was turned on. This combined one-hot representation of the secondary
reactants and the reagents formed our baseline fingerprint.
Once a reaction type has been predicted by the
algorithm, the SMARTS transformation associated with the reaction type is applied
to the reactants. If the input reactants meet the requirements of the SMARTS
transformation, the product molecules generated by the transformation are the
predicted structures of the products. If the reactants do not match the
requirements of the SMARTS transformation, the algorithm instead guesses the
structures of the reactants instead, i.e., it is assumed that no reaction
occurs.
These methods can sometimes generalize or extrapolate new examples? Why it can happen
BalasHapusWhat do you think about your opinion of the rendemen.
BalasHapusIn chemistry, the chemical yield, the yield of the reaction, or only the rendement refers to the amount of reaction product produced in the chemical reaction. [1] Absolute rendement can be written as weight in grams or in moles (molar yield). The relative yield used as a calculation of the effectiveness of the procedure is calculated by dividing the amount of product obtained in moles by the theoretical yield in moles:
HapusIn your statements there are various accurate we'll be gets. Why? And what should we do so we can get 100% accuration of data?
BalasHapus@hudiaumamifaisal
HapusWe must do the thoroughness in the experiment so that the results obtained accurately
Explain the cause of the ability of the electrolyte solution to conduct an electric current?
BalasHapusThe strong electrolyte compounds will dissociate completely, the weak electrolyte compounds only partially dissociate, whereas the nonelectrolyte compounds do not dissociate. A dissociated compound, either perfect or partially decomposes into its constituent ions (positive ions and negative ions). The dissociation reactions in the electrolyte compound can be written as follows.
HapusHCl(l) → H+(aq) + Cl–(aq)
NaCl(s) → Na+(aq) + Cl–(aq)
CH3COOH(aq) → H+(aq) + CH3COO–(aq)
NaOH(s) → Na+(aq) + OH–(aq)
NH4OH(s) → NH4+(aq) + OH–(aq)
Electrical conductivity is related to ions in solution. Electric current flow in the form of particle movement of electron and ion particles. When passed into the electrolyte solution, the electric current will be delivered by the ions in solution so that the lamp can be illuminated. The more ions in the solution, the conductivity of the solution is getting stronger. That is why the flame of the strong electrolyte solution is lighter than the weak electrolyte solution.
Did you know why nonelectrolyte solutions can not conduct electrical current? When dissolved into water, nonelectrolyte solutions such as sugar and alcohol solutions do not break down into ionion. Nonelectrolyte solutions break down into their molecules.
In your article it is said that the major limitation in product structure prediction is caused by limitations of SMARTS transformation to describe mechanism of reaction type completely.Why can it happen ??
BalasHapusYou mention artificial neural networks, what is the purpose of artificial neural networks?
BalasHapusArtificial Neural Network (Artificial Neural Network) is a mathematical model in the form of a collection of units connected in parallel that resemble the neural network in the human brain (neural)
HapusArtificial neural networks are often used also in the field of artificial intelligence.
Give me an example of rendement in a reaction?
BalasHapusgingger - 1,44-2
HapusTry to explain smarts,cameo?
BalasHapus