Journal: Journal of chemical information and modeling
Drug discovery programs frequently target members of the human kinome and try to identify small molecule protein kinase inhibitors, primarily for cancer treatment, additional indications being increasingly investigated. One of the challenges is controlling the inhibitors degree of selectivity, assessed by in vitro profiling against panels of protein kinases. We manually extracted, compiled and standardized such profiles published in the literature: we collected 356,908 data points corresponding to 482 protein kinases, 2,106 inhibitors and 661 patents. We then analyzed this dataset in terms of kinome coverage, results reproducibility, popularity and degree of selectivity of both kinases and inhibitors. We used the dataset to create robust proteochemometric models capable of predicting kinase activity (the ligand-target space was modeled with an externally validated RMSE of 0.41 ± 0.02 log units and R02 0.74 ± 0.03), in order to account for missing or unreliable measurements. The influence on the prediction quality of parameters such as number of measurements, Murcko scaffold frequency or inhibitor type was assessed. Interpretation of the models enabled to highlight inhibitors and kinases properties correlated with higher affinities, and an analysis in the context of kinases crystal structures was performed. Overall, the models quality allows the accurate prediction of kinase-inhibitor activities and their structural interpretation, thus paving the way for the rational design of compounds with a targeted selectivity profile.
The aim of this work is to develop group-contribution(+) (GC(+)) method (combined group-contribution (GC) method and atom connectivity index (CI) method) based property models to provide reliable estimations of environment-related properties of organic chemicals together with uncertainties of estimated property values. For this purpose, a systematic methodology for property modeling and uncertainty analysis is used. The methodology includes a parameter estimation step to determine parameters of property models and an uncertainty analysis step to establish statistical information about the quality of parameter estimation, such as the parameter covariance, the standard errors in predicted properties, and the confidence intervals. For parameter estimation, large data sets of experimentally measured property values of a wide range of chemicals (hydrocarbons, oxygenated chemicals, nitrogenated chemicals, poly functional chemicals, etc.) taken from the database of the US Environmental Protection Agency (EPA) and from the database of USEtox is used. For property modeling and uncertainty analysis, the Marrero and Gani GC method and atom connectivity index method have been considered. In total, 22 environment-related properties, which include the fathead minnow 96-h LC(50), Daphnia magna 48-h LC(50), oral rat LD(50), aqueous solubility, bioconcentration factor, permissible exposure limit (OSHA-TWA), photochemical oxidation potential, global warming potential, ozone depletion potential, acidification potential, emission to urban air (carcinogenic and noncarcinogenic), emission to continental rural air (carcinogenic and noncarcinogenic), emission to continental fresh water (carcinogenic and noncarcinogenic), emission to continental seawater (carcinogenic and noncarcinogenic), emission to continental natural soil (carcinogenic and noncarcinogenic), and emission to continental agricultural soil (carcinogenic and noncarcinogenic) have been modeled and analyzed. The application of the developed property models for the estimation of environment-related properties and uncertainties of the estimated property values is highlighted through an illustrative example. The developed property models provide reliable estimates of environment-related properties needed to perform process synthesis, design, and analysis of sustainable chemical processes and allow one to evaluate the effect of uncertainties of estimated property values on the calculated performance of processes giving useful insights into quality and reliability of the design of sustainable processes.
This paper reports an analysis and comparison of the use of 51 different similarity coefficients for computing the similarities between binary fingerprints for both simulated and real chemical data sets. Five pairs and a triplet of coefficients were found to yield identical similarity values, leading to the elimination of seven of the coefficients. The remaining 44 coefficients were then compared in two ways: by their theoretical characteristics using simple descriptive statistics, correlation analysis, multidimensional scaling, Hasse diagrams, and the recently described atemporal target diffusion model; and by their effectiveness for similarity-based virtual screening using MDDR, WOMBAT, and MUV data. The comparisons demonstrate the general utility of the well-known Tanimoto method but also suggest other coefficients that may be worthy of further attention.
Suggestions for improving the Basin-Hopping Monte Carlo (BHMC) algorithm for unbiased global optimization of clusters and nanoparticles are presented. The traditional basin-hopping exploration scheme with Monte Carlo sampling is improved by bringing together novel strategies and techniques employed in different global optimization methods, however with the care of keeping the underlying algorithm of BHMC unchanged. The improvements include a total of eleven local and nonlocal trial operators tailored for clusters and nanoparticles that allow an efficient exploration of the potential energy surface, two different strategies (static and dynamic) of operator selection, and a filter operator to handle unphysical solutions. In order to assess the efficiency of our strategies, we applied our implementation to several classes of systems, including Lennard-Jones and Sutton-Chen clusters with up to 147 and 148 atoms, respectively, a set of Lennard-Jones nanoparticles with sizes ranging from 200 to 1500 atoms, binary Lennard-Jones clusters with up to 100 atoms, (AgPd)_55 alloy clusters described by the Sutton-Chen potential, and aluminum clusters with up to 30 atoms described within the density functional theory framework. Using unbiased global search our implementation was able to reproduce successfully the great majority of all published results for the systems considered, and in many cases with more efficiency than the standard BHMC. We were also able to locate previously unknown global minimum structures for some of the systems considered. This revised BHMC method is a valuable tool for aiding theoretical investigations leading to a better understanding of atomic structures of clusters and nanoparticles.
The European REACH regulation requires information on ready biodegradation, which is a screening test to assess the biodegradability of chemicals. At the same time REACH encourages the use of alternatives to animal testing which includes predictions from QSAR models. The aim of this study was to build QSAR models to predict ready biodegradation of chemicals by using different modelling methods and types of molecular descriptors. Particular attention was given to data screening and validation procedures in order to build predictive models. Experimental values of 1055 chemicals were collected from the webpage of the National Institute of Technology and Evaluation of Japan (NITE): 837 and 218 molecules were used for calibration and testing purposes, respectively. In addition, models were further evaluated using an external validation set consisting of 670 molecules. Classification models were produced in order to discriminate biodegradable and non-biodegradable chemicals by means of different mathematical methods: k Nearest Neighbours, Partial Least Squares Discriminant Analysis and Support Vector Machines, as well as their consensus models. The proposed models and the derived consensus analysis demonstrated good classification performances with respect to already published QSAR models on biodegradation. Relationships between the molecular descriptors selected in each QSAR model and biodegradability were evaluated.
We propose a new molecular dynamics (MD) protocol to identify the binding site of a guest within a host. The method utilizes a four spatial (4D) dimension representation of the ligand allowing for rapid and efficient sampling within the receptor. We applied the method to two different model receptors characterized by diverse structural features of the binding site and different ligand binding affinities. The Abl kinase domain is comprised of a deep binding pocket and displays high affinity for the two chosen ligands examined here. The PDZ1 domain of PSD-95 has a shallow binding pocket that accommodates a peptide ligand involving far fewer interactions and a micromolar-affinity. To insure a completely unbiased searching, the ligands were placed in the direct center of the protein receptors, away from the binding site, at the start of the 4D MD protocol. In both cases the ligands were successfully docked into the binding site as identified in the published structures. The 4D MD protocol is able to overcome local energy barriers in locating the lowest energy binding pocket and will aid in the discovery of guest binding pockets in the absence of a priori knowledge of the site of interaction.
In silico modeling is a crucial milestone in modern drug design & development. Although computer-aided approaches in this field are well-studied, the application of deep learning methods in this research area is at the beginning. In this work, we present an original deep neural network (DNN) architecture named RANC (Reinforced Adversarial Neural Computer) for the de novo design of novel small-molecule organic structures based on generative adversarial network (GAN) paradigm and reinforcement learning (RL). As a generator RANC uses a Differentiable neural computer (DNC), a category of neural networks, with increased generation capabilities due to the addition of an explicit memory bank, which can mitigate common problems found in adversarial settings. The comparative results have shown that RANC trained on the SMILES string representation of the molecules outperforms its first DNN-based counterpart ORGANIC by several metrics relevant to drug discovery: the number of unique structures, passing medicinal chemistry filters (MCF), Muegge criteria and high QED scores. RANC is able to generate structures that match the distributions of the key chemical features/descriptors (e.g. MW, logP, TPSA) and lengths of the SMILES strings in the training dataset. Therefore, RANC can be reasonably regarded as a promising starting point to develop novel molecules with activity against different biological targets or pathways. In addition, this approach allows scientists to save time and covers a broad chemical space populated with novel and diverse compounds.
Quantitative Structure-Activity Relationship (QSAR) models typically rely on 2D and 3D molecular descriptors to characterize chemicals and forecast their experimental activities. Previously, we showed that even the most reliable 2D QSAR models and structure-based 3D molecular docking techniques were not capable of accurately ranking a set of known inhibitors for the ERK2 kinase, a key player in various types of cancer. Herein, we calculated and analyzed a series of chemical descriptors computed from the molecular dynamics (MD) trajectories of ERK2-ligand complexes. First, the docking of 87 ERK2 ligands with known binding affinities was accomplished using Schrodinger’s Glide software; then, solvent-explicit MD simulations (20 ns, NPT, 300K, TIP3P, 1fs) were performed using the GPU-accelerated Desmond program. Second, we calculated a series of MD descriptors based on the distributions of 3D descriptors computed for representative samples of the ligand’s conformations over the MD simulations. Third, we analyzed the dataset of 87 inhibitors in the MD chemical descriptor space. We showed that MD descriptors (i) had little correlation with conventionally used 2D/3D descriptors, (ii) were able to distinguish the most active ERK2 inhibitors from the moderate/weak actives and inactives, and (iii) provided key and complementary information about the unique characteristics of active ligands. This study represents the largest attempt to utilize MD-extracted chemical descriptors to characterize and model a series of bioactive molecules. MD descriptors could enable the next generation of hyper-predictive MD-QSAR models for computer-aided lead optimization and analogue prioritization.
Within this work a methodological extension of the matched molecular pair analysis is presented. The method is based on a pharmacophore retyping of the molecular graph and a consecutive matched molecular pair analysis. The features of the new methodology are exemplified using a large dataset on CYP inhibition. We show that fuzzy matched pairs can be used to extract activity and selectivity determining pharmacophoric features. Based on the fuzzy pharmacophore description the method clusters molecular transfers and offers new opportunities for the combination of data from different sources, namely public and industry datasets.
DNA is an important target for the treatment of multiple pathologies, most notably cancer. In particular, DNA intercalators have often been used as anti-cancer drugs. However, despite their relevance to drug discovery, only a few systematic computational studies were performed on DNA-intercalator complexes. In this work we have analyzed ligand binding sites preferences in 63 high resolution DNA-intercalator complexes available in the PDB and found that ligands bind preferentially between G and C and between the C and A base pairs (70% and 11% respectively). Next, we examined the ability of AUTODOCK to accurately dock ligands into pre-formed intercalation sites. Following the optimization of the docking protocol, AUTODOCK was able to generate conformations with RMSD values < 2.00 Å with respect to crystal structures in ~80% of the cases while focusing on the pre-formed binding site (small grid box) or on the entire DNA structure (large grid box). In addition, a top ranked conformation with an RMSD < 2.00 Å was identified in 75% and 60% of the cases using small and large docking boxes respectively. Moreover, under the large docking box setting AUTODOCK was able to successfully distinguish between the intercalation site and the minor groove site. However, in all cases the crystal structures and poses tightly clustered around it, had a lower score than the best scoring poses suggesting a potential scoring problem with AUTODOCK. A close examination of all cases where the top ranked pose had an RMSD value > 2.00 Å suggests that AUTODOCK may over emphasize the hydrogen bonding term. A decision tree was built to identify ligands which are likely to be accurately docked based on their characteristics. This analysis revealed that AUTODOCK performs best for intercalators characterized by a large number of aromatic rings, low flexibility, high molecular weight and a small number of hydrogen bond acceptors. Finally, for canonical B-DNA structures (where pre-formed sites are unavailable), we demonstrated that intercalation sites could be formed by inserting an anthracene moiety between the (anticipated) site-flanking base pairs and by relaxing the structure using either energy minimization or preferably molecular dynamics simulations. Such sites were suitable for the docking of different intercalators by AUTODOCK.