"The joy of discovery is certainly the liveliest that the mind of man can ever feel".
-Claude Bernard
Exactly an year ago, Journal of Chemoinformatics published a paper that presents a Chemical Structure Explorer that allows one to search through all the chemicals that have an entry page in the Wikipedia. I feel this is a wonderful resource that brings the chemical space across the Wikipedia into a nutshell. It provides both structure and substructure search via a simple web-interface. Here's the paper for you and the web-interface for all the chemistry enthusiasts.
I totally agree with the authors that this effort could improve quality of the chemical entries in the Wiki and indeed the resource is handy for researchers and medicinal chemists to find molecules similar to their novel leads and also to find chemical compounds of interest. They also pointed out few duplicate entries (same SMILES) in the Wiki indicating that few of these are due to missing stereochemistry in SMILES. About 250 most frequently occurring scaffolds were also presented in a nice scaffold-collage (see below).
![]() |
250 frequent scaffolds in Wiki chemical space (1) |
While there was little follow-up (2 citations so far in PubMed) by the scientific community, it would be interesting to see the data from different perspectives. In an earlier post by Egon, he pointed out that he could not parse about 42 SMILES. However, from the latest downloaded file, I could not parse (with CDK 1.5.13) only 7 of them. Most of these were due to unclosed rings and invalid kekule representations in the SMILES string.
Further, I looked at the basic properties of the parsed molecules. The overall distribution of molecules for different properties can be found in the plots below. While there are about 85% molecules with atom count between 0 and 60, there are also more than 10 molecules as complex as consisting more than 500 atoms. Also about 90% of them weigh under 500 amu, ~93% of them have hydrogen bond donors <= 5 and hydrogen bond acceptors <= 10, complying with the Lipinski's rule of five.
Further, I looked at the basic properties of the parsed molecules. The overall distribution of molecules for different properties can be found in the plots below. While there are about 85% molecules with atom count between 0 and 60, there are also more than 10 molecules as complex as consisting more than 500 atoms. Also about 90% of them weigh under 500 amu, ~93% of them have hydrogen bond donors <= 5 and hydrogen bond acceptors <= 10, complying with the Lipinski's rule of five.
A lot more can be done with this wealth of data. For example, it would be interesting to see how the open-access chemistry data is structurally different compared to the patented counterpart from the pharma industry. Looking forward to more interesting analysis over this interesting resource.
References
Ertl et al: Wikipedia Chemical Structure Explorer: substructure and similarity searching of molecules from Wikipedia. J Cheminform. 2015; 7:10
No comments:
Post a Comment