TY - JOUR
T1 - DFT-Machine Learning Approach for Accurate Prediction of pKa
AU - Lawler, Robin
AU - Liu, Yao Hao
AU - Majaya, Nessa
AU - Allam, Omar
AU - Ju, Hyunchul
AU - Kim, Jin Young
AU - Jang, Seung Soon
N1 - Publisher Copyright:
© 2021 American Chemical Society
PY - 2021/10/7
Y1 - 2021/10/7
N2 - In this study, we propose a novel method of pKaprediction in a diverse set of acids, which combines density functional theory (DFT) method with machine learning (ML) methods. First, the DFT method with B3LYP/6-31++G**/SM8 is used to predict pKa, yielding a mean absolute error of 1.85 pKaunits. Subsequently, such pKavalues predicted from the DFT method are employed as one of 10 molecular descriptors for developing ML models trained on experimental data. Kernel Ridge Regression (KRR), Gaussian Process Regression, and Artificial Neural Network are optimized using threePipelines:Pipeline 1involving only hyperparameter optimization (HPO),Pipeline 2involving HPO followed by a relative contribution analysis (RCA) and recursive feature elimination (RFE), andPipeline 3involving HPO followed by RCA and RFE on an expanded set of composite features. Finally, it is demonstrated that KRR withPipeline 3yields optimal pKaprediction at an MAE of 0.60 log units. This algorithm was then utilized to predict the pKaof 37 novel acids. The two most important features were determined to be the number of hydrogen atoms in the molecule and the degree of oxidation of the acid. The predicted pKavalues were documented for future reference.
AB - In this study, we propose a novel method of pKaprediction in a diverse set of acids, which combines density functional theory (DFT) method with machine learning (ML) methods. First, the DFT method with B3LYP/6-31++G**/SM8 is used to predict pKa, yielding a mean absolute error of 1.85 pKaunits. Subsequently, such pKavalues predicted from the DFT method are employed as one of 10 molecular descriptors for developing ML models trained on experimental data. Kernel Ridge Regression (KRR), Gaussian Process Regression, and Artificial Neural Network are optimized using threePipelines:Pipeline 1involving only hyperparameter optimization (HPO),Pipeline 2involving HPO followed by a relative contribution analysis (RCA) and recursive feature elimination (RFE), andPipeline 3involving HPO followed by RCA and RFE on an expanded set of composite features. Finally, it is demonstrated that KRR withPipeline 3yields optimal pKaprediction at an MAE of 0.60 log units. This algorithm was then utilized to predict the pKaof 37 novel acids. The two most important features were determined to be the number of hydrogen atoms in the molecule and the degree of oxidation of the acid. The predicted pKavalues were documented for future reference.
UR - https://www.scopus.com/pages/publications/85116701045
U2 - 10.1021/acs.jpca.1c05031
DO - 10.1021/acs.jpca.1c05031
M3 - Article
C2 - 34554744
AN - SCOPUS:85116701045
SN - 1089-5639
VL - 125
SP - 8712
EP - 8722
JO - Journal of Physical Chemistry A
JF - Journal of Physical Chemistry A
IS - 39
ER -