25.7k views
0 votes
In assigning 3D coordinates to (atoms in) some 10K molecules that I have are currently represented by SMILES. This is because as have been shown by many chemoinformatics papers, 3D and 2D structures together can better represent a molecule. This is also emphasized in large-scale molecular representation learning challenges, where 3D structures are estimated from DFT. However, it seems time-costly to compute DFT with publicly available libraries (e.g. pyscf), or maybe I am missing something. And the faster approach used by RDKit or OpenBabel (force field-based, e.g. MMFF, UFF) generates less accurate 3D coordinates (yes, I did notice this post). I have been thinking of some ways: (1) find some possible conversions of my compounds to nominal identifiers (e.g. DrugBank ID) and extract 3D structures from those databases; (2) just run pyscf; (3) find some papers that predicts 3D structure from SMILES (does this kind of papers even exist?)

1 Answer

5 votes

Final answer:

Converting SMILES strings to 3D structures can be done using computational DFT methods for accuracy or faster force field-based methods with tools like RDKit for expedience. Another approach involves using databases if they contain the compounds or exploring research on machine learning models that predict 3D structures from SMILES.

Step-by-step explanation:

When converting SMILES strings to 3D structures, there are indeed various methods that chemists and computational scientists use. Using publicly available libraries like pyscf to compute Density Functional Theory (DFT) calculations ensures accurate 3D coordinates but tends to be time-consuming for a large number of molecules such as 10K. Alternatively, force field-based methods provided by RDKit or OpenBabel utilizing MMFF or UFF can generate 3D structures more quickly, though with less precision.

Predicting 3D structure directly from SMILES without time-consuming computation is a topic of current research interest, and some studies have explored machine learning models that make such predictions. For your case, the trade-off would be between the computational demand and the level of accuracy required.

If the databases such as DrugBank contain your compounds, it could be an efficient way to access pre-computed 3D structures. However, each molecule's availability in databases is variable.

Chemists rely on these representations alongside the hybridization of orbitals and other quantum mechanical descriptions of bonding, such as hybridization, to understand and predict physical and chemical properties influenced by a molecule's shape and structure.

User Vishnu R
by
7.1k points