Bio-Python

"Ramblings on computational chemistry, in silico experiments and programming in python 3.x"

December 19, 2012

Configuring Emacs as Python IDE


UPDATE: The following article applies only to python 2.7 (because ropemacs for py3k is not available).

This article illustrates the idea presented in emacs article that I wrote four month ago and can be found here.
I have compiled lisp codes that will be useful for anyone who is programming in python and is using emacs. At present there is no particular IDE that offers full functionality (syntax highlighting, auto-complete, refactoring etc.) So, I decided to built one using emacs.

I am using a configuration which is variant of what is described in following two pages:
1. Original article about python emacs integration can be found at enigmacurry.com
2. A more detailed and recent step by step article can be found jesshamrick.com

To download my ~/.emacs and ~/.emacs.d/ configuration for Python programming environment see emacs-config-pythonists repository.

What I like the most about python-mode in emacs is the ability to execute current buffer on doing C-c C-c.  

Try it yourself and do share your useful comments and suggestions for improvements. I would like to know how many of you (readers) have tried python development using vim/emacs (no flame wars here!). If you  have ever tried scripting python using emacs share your views. Describe ease (and annoyances) in emacs by leaving a comment. If you use any other full-fledged IDE participate in the poll below.


Where do you write python scripts? (you can select multiple choices)

December 2, 2012

QSAR: All models are wrong, but some are useful

The core of the QSAR methodology is developing a relationship between an observed activity and structural features of a molecule. The approach depends on being able to represent the structure of a molecule in quantitative terms. The quantitative representations of molecules are termed descriptors. An extensive range of descriptors can be calculated; for example, molecular weight, atom counts, partition coefficients and surface-property descriptors. So given a set of descriptors, a QSAR model can be built by defining a relationship between these descriptors and the observed activity.

Why to establish QSAR?


• To predict biological activity and physicochemical properties by rational means.
• To understand the mechanisms of action within a series of chemicals.
• To identify novel leads with pharmacological, biocidal or pesticidal activity.
• The selection of compounds with optimal pharmacokinetic properties, its stability or availability in biological systems.
• The prediction of toxicity to humans and environmental species.


CLASSIFICATION OF QSAR METHODS

Based on the type of chemometric methods used QSAR methods are classified as linear and non-linear as shown in Figure 1.
  • LINEAR METHODS: The first QSAR models, developed by Hansch, specified linear relationships. Linear models are widely used owing to their simplicity and ease of development. Linear methods for example linear regression (LR), multiple linear regression (MLR), stepwise multiple linear regression (S-MLR), partial least-squares (PLS), and principal component analysis (PCA).
  • NON-LINEAR METHODS: Developments in the field of statistics have produced many new methods of building predictive models. These include non-linear regression and algorithmic techniques. These include support vector machine (SVM), artificial neural networks (ANN), k-nearest neighbors (kNN), and Bayesian neural nets (BNN) etc.

Flowchart of QSAR methods
Figure 1. Classification of QSAR methods


QSAR methods are also categorized into following classes, based on the way by which the descriptor values are developed (figure 1):

0D-QSAR is achieved based on descriptors derived from molecular formula such as molecular weight, number of atoms, atom types, sum of atomic properties;

1D-QSAR (correlates activity with global molecular properties like pKa, solubility, log P, functional groups(substructures) etc);

2D-QSAR (correlates activity with structural patterns like connectivity indices, 2D-pharmacophores wiener index, etc., without taking into account the 3D-representation of these properties);

3D-QSAR (deals with the orientation of the molecules in space, correlates activity with non-covalent interaction fields “steric and electrostatic field” surrounding the molecules);

4D-QSAR (additionally including ensemble of ligand configurations in 3D-QSAR, by representing each molecule in different conformations, stereoisomer’s, orientations, tautomer’s, or protonation states.);

5D-QSAR (explicitly representing different induced-fit models in 4D-QSAR, as the manifestation and magnitude of the induced fit may vary for binding of individual molecule to target protein);

6D-QSAR (further incorporating different solvation models in 5D-QSAR, can be achieved explicitly by mapping parts of the surface area of the molecules with solvent properties); HiT QSAR (hierarchical quantitative structure–activity relationship technology) based on the Simplex representation of molecular structure (SiRMS) and its application for different QSAR tasks. The spirit of this technology is a sequential solution (with the use of the information obtained on the previous steps) to the QSAR problem by the series of enhanced models of molecular structure description [from one dimensional (1D) to four dimensional (4D)].

QSAR: Basic Steps Involved

The QSAR model development trail involves four steps:

(a) Data collection and descriptor calculation,

(b) Analysis of correlation between the descriptors and input data structures (2D/3D),

(c) Validation of models and

(d) Design and activity prediction of new molecules.

For the development of a QSAR model, series (con-generic series in some cases e.g. CoMFA/CoMSIA, 3D-QSAR methods) of molecules, defined by their structures, with known activity data are used as input data for calculation of molecular descriptors. For all the compounds, the activity data should be in same units of measurement (binding/functional/IC50/Ki). Ki value is preferred instead of the IC50 data, since it is independent of the substrate concentration. To develop a QSAR, a significant number of compounds are requisite to develop a meaningful relationship. It is widely accepted that 5 to 10 compounds are required for every descriptor in a QSAR. This does suggest that a 1 descriptor regression-based QSAR could be developed. However in an ideal world “many more” compounds are required to obtain statistically healthy QSARs.

The structure cannot be directly used for creating structure-activity mappings for reasons that they do not usually contain in an unambiguous form the information that relates to activity. To evade this barrier, rationally designed molecular descriptors convert the structure to well defined sets of numerical values that can be correlated directly with the activity. Depending on data set, amongst the thousands of different descriptors available, only a few are well correlated to the activity and hence, the use of excess number of descriptors may affect the interpretability of the final model. To avoid this problem in QSAR analysis, a wide range of techniques for automated thinning of the set of descriptors to the most informative ones are used. This is followed by the splitting of the dataset into training and test sets using different approaches of hierarchical and non-hierarchical clustering, which is based on many algorithms including sphere exclusion theory or simply activity ranking. However, according to some authors, such similarity based splitting methods may result in selection of a biased test set. Hence to overcome this, splitting may also be done based on random selection of subset molecules to generate test set data.

The training set compounds are then used for building a correlation between the descriptor values and the activity data. The resulting correlation is obtained using different chemometric tools (both linear and non-linear) like PLS, PCA, ANN, kNN etc and the model quality is judged using a variety of statistical parameters. Further, the model thus obtained undergoes both internal (leave-one-out method etc.) and external validation (test set prediction etc). The predictivity and reproducibility of the QSAR model is decided based on its ability to predict the test set molecules. A proper interpretation of a QSAR model exposes the key structural features obligatory for an improvement in the definite response parameter being modeled. Hence the developed model can be used further for design and activity prediction of new series of molecules.

QSAR is a widely used tool for developing relationships between the effects (activities and properties of interest) of a series of molecules with their structural properties. It is used in many areas of science. It is a dynamic area that integrates new technologies at a staggering rate.

Topics which I have discussed above were required to introduce you to QSAR. This is just the beginning; there is plenty to share about QSAR, it is not just a method where some steps are involved in developing model. I would like to discuss QSAR models and what they signify. I didn’t discuss it here since this is just an introduction. In future I will also discuss about ‘validation’. Validation of each step is important, if you don’t want to end up with an erroneous model. Even after model development, validation plays a vital role. Any QSAR study is as good as the model on which it is based.

In conclusion I would say that usefulness of QSAR as a technique relies on generation of more robust models, which is the next big challenge.

Remember that all models are wrong; the practical question is how wrong do they have to be to not be useful.
-          George E. P. Box
  
This article is contributed by +PRAVIN AMBURE who is presently a Junior Research Fellow at Dr. Kunal Roy Lab at Jadavpur University, Kolkata. Pravin's research work is generating QSAR models with better predictability. In his spare time he likes solving Rubik’s cube.