In silico Drug Design: some concepts & tools

Drug discovery, chemical biology & precision medicine

There are many in silico tools in the field, for instance, to predict ADMET properties, binding pocket, protein-protein interaction binding site,...for drug repositioning or repurposing... to predict the 3D structure of a macromolecule, graft a sugar onto a protein structure, to dock peptides, to perform virtual screening, to invesitgate point mutations observed in patients (instead of mutation, variation is nowadays most often used), for protein docking...


If you are new to the field, here are some general ideas: the first step is to define the in silico tools you need. This is directly linked to the type of questions you want to address, the type of project, the stage of the project and the data that you have to start with. In some cases, in silico approaches can not really help initially, some experiments have to be performed first and then, in silico tools take over and then back to vitro. In other situations, you need to start both at the same time, in vitro and in silico experiments while in other situations, in silico works come first.

Small chemicals or peptides can be used for chemical biology or for drug discovery projects.

About 90% of projects entering clinical trials fail

If your project is about target-based screening, you will need one or several 3D structure. You can check the PDB to find some experimental 3D structures (Xray or NMR, Cryo-EM). If not, you can try to predict the 3D structure, for instance via comparative model building (e,g., online tool such as SWISS-MODEL) or you can use AlphaFold or RoseTTAfold or related. You can find many valuable standalone and online servers in the Shortlist page and in the sections Modeling Molecules, Simulations, etc. There are also tools to predict the 3D structures of RNA, DNA.... These tools somewhat belong to the field of structural bioinformatics.

Next, assuming you have your target macromolecule in 3D (often a protein), you may need a peptide or a small non-peptidic chemical compound or an antibody that binds to your target. If you know the binding pocket, then you can use structure-based virtual screening approaches or peptide - protein docking. In general you'll need to prepare a compound collection or you can use many available online. Once you have a collection, then, by using in silico screening or related approaches you should be able to propose a small list of molecules that will then need to be tested experimentally (important to think about the assays, how molecular mechanisms are going to be investigated...). Maybe you do not know the binding pocket and then you can use tools that will predict binding cavities and the so-called druggable pockets. The pocket may not be visible (cryptic pocket) and in this case you may need simulation, fragment docking, searching for hotspots... You can also try tools that attempt to transfer ligands into your pocket by comparing your pocket with pockets present in the PDB.

If you know a small molecule that binds to your target, you can search in databases other molecules that are similar to your query, then you can test in vitro these new molecules and build some SAR (see for instance the ligand-based virtual screening tools) (Chemoinformatics).
If you search a hit compound that could be used as starting point for drug discovery, you may need to predict some ADME-Tox properties (this can be very valuable also for chemical biology projects). You could also check if the molecules have structural alerts, PAINS or promiscous cmpds. According to such analysis, you may have to perform additional experiments to double check your initial results. Tools that can be of interest at this stage can belong to the QSAR section or the virtual screening sections, and obviously to the ADME-Tox section...(Chemoinformatics).

In most cases, you will have to look at databases to see if your target has been screened already or if your favorite compounds are already known to hit many targets. Databases that are open are for instance PubChem and ChemBL, these will be in the chemoinfo section.

You may want to know if your compound could bind to other secondary targets, often called off-targets and if the effect on health is not favorable, these secondary targets are called anti-targets. To do this, you can use tools that belong to the off-targets, repurposing, repositioning section. There, different approaches are available, from ligand-similarity searches to reverse docking...etc. If you use phenotypic screening, then several of these methods can also help to try to identify putative target(s).

As mentioned, to develop a new drug it is likely to take many years and success is far from certain. It has been estimated that it takes 13.5 years to bring a new molecular entity to market and the success rate for taking oncology drugs from phase I to approval by the US Food and Drug Administration (FDA) was only around 7%. These numbers change a bit in different reports but it gives an overall idea. Repositioning could thus be valuable in some cases as it builds on previous research, allowing compounds to progress more quickly as well as saving a substantial amount of money when it works. In silico strategies can help here. One concept that helps to understand repositioning is for instance the notion of polypharmacology that is one small molecule drug is likely to have an average of six to seven targets.

If you are interested in protein-protein interactions (Bioinformatics) and the modulation of these interactions with a small compound, you may need to use protein docking methods. You may want to see all the known interactions with your target and thus will need some "network" tools. If you have a 3D structure of your protein-protein complex, you may want to analyze the interface and predict hotspot residues. We have some recent reviews about "in silico approaches and compound design", for instance about protein-protein interaction inhibitors, see Villoutreix et al. Molecular Informatics June 2014).

If your protein has point mutations (experimental or naturally occurring, ...idea of precision medicine), you may want to predict the impact of the amino acid substitutions on folding, function etc... Then again, you need a different set of tools and you can go to the sections Simulations and Mutations...(Bioinformatics section).

You may need to search patent databases, find databases on diseases, find tools to help represent and visualize the data, you may want to find some commercial tools... These will be in the section related tools.

Bioinformatics : Ligand binding pockets

To design a compound, if you know the target and if you have a 3D structure for this target, it can be critical to identify likely binding pockets at the surface of this target (see for example our review Perot et al., DDT 2010), to score these pockets, to compare pockets from you target with databases of pockets... Many in silico tools have been developed (over 100) (see the Shortlist page).

Bioinformatics: Protein-protein and protein-membrane interactions

Your target can be DNA, RNA, or a protein... it can also be a mechanism, for instance, you may want to modulate protein-protein interactions or a transient protein-membrane interaction, to do these, again many approaches have been developed starting around 1990, you may need docking tool, tools to find hotspots etc...

In many cases, one needs to predict interface residues. Several approaches can be used, those investigating specific features of protein sequences and/or structures, they look at amino acid composition, physico-chemical properties and can use machine learning strategies. For example, some tools use evolutionary information to try to predict interface residues that tend to be more conserved that other residues on the rest of the protein surface. Amino acid features and evolutionary information can then be combined to analyze amino acid sequences and perform some predictions. Of course, predictions based only on sequences are often (but not always) limited and it is important when possible to add 3D information. Thus it is possible to map sequence evolution onto the molecular surface. Analysis of the surface in term of hydrophobicity, desolvation energy can also be used. These approaches are said to belong to the mapping approaches, for instance ODA (physics based)... but some others are based on descriptors and machine learning...Because these types of prediction are difficult, meta-predictors have been developed and tend to give better results over individual methods (e.g., meta-PPISP).

Template-based methods can also be used. Because interfaces tend to be conserved in homologous complexes, such data can help to make predictions.

For additional information, you can for example check our recent review introducing several aspects of PPI and in silico approaches:

Drug-Like Protein-Protein Interaction Modulators: Challenges and Opportunities for Drug Discovery and Chemical Biology (review). Villoutreix BO, Kuenemann MA, Poyet J-L, Bruzzoni-Giovanelli H, Labbe C, Lagorce D, Sperandio O, Miteva MA. Molecular Informatics 2014; 6-7: 414-437. (open)


Different types of interactions could in theory be modulated. We were among the first to propose the modulation of transient interactions between a protein and the cell membrane with a small molecule and to observe that next to membrane binding regions, proteins often tend to have a cavity large enough to bind a small ligand. The work was for instance reported with application on blood coagulation cofactors by Segers et al., PNAS 2007, or Nicolaes et al., Blood 2014. This molecular mechanism is still essentially unexplored in 2015 for therapeutic intervention.

Bioinformatics: molecular modeling, variations/mutations...

There are many tools here again (eg see the Shortlist page).


Mutations (or variations) - Personalized medicine - Precision medicine
Based on genome sequencing of individuals, it is estimated that each person’s proteome contains roughly 10,000–11,000 mutations compared to a reference proteome. A subset of these mutations has severe functional consequences, however, for the great majority, it is difficult to predict a priori what their effect will be on the resultant protein’s structure and or function.There are different types of mutations. Single nucleotide polymorphisms (SNPs) fall either within non-coding or coding regions of the DNA molecules. Synonymous mutations do not change the encoded protein sequence while non-synonymous SNPs (the most common disease-promoting mutations) produce either polypeptide sequences that have an amino acid substitution (missense mutations) or are truncated (nonsense mutations, this is a less common event as compared to amino acid change). Some mutations exert their effects via changes to the mRNA that can lead to altered mRNA splicing, folding or stability. Some mutations impact the responsiveness of patients to certain drug treatments (concept of pharmacogenomics).

With the decreasing cost of sequencing technologies, the study of the human genome on a large scale is now possible. Rapid advances in this field of research foreshadows the use of whole-genome or whole exome sequencing towards the goal of personalized medicine also called precision medicine. This can be defined by therapy decisions tailored to individual patients (or small groups), such as to improve therapeutic efficiencies and minimize side effects.
It is important to note that synonymous mutations could cause human diseases. The key to how a synonymous mutation can affect proteins most likely lies in RNA molecules. Also, single nucleotide polymorphisms represent the most common source of genetic variation in the human population; they often determine which patients are most likely to respond to or suffer adverse consequences from specific medical treatments.  In addition to genomic studies, epigenetics investigations are also of major importance.

There are different types of in silico predictive tools, mostly to predict the possible effect of missense mutations. For instance, only to introduce this huge field of research, in silico approaches can help to discriminate between deleterious nsSNPs vs neutral polymorphisms. There are numerous tools, one way to cluster them is to consider methods that make use of machine learning approaches and methods that attempt to compute a delta-delta G, using the 3D structure of the protein and or rule-based. Then you have meta-tools that combine approaches. As always, some methods could belong to several groups.

Another way to look at these tools is to group them as being sequence-based, structure-based or both. Many databases and datasets are also available to develop and to evaluate these engines. In silico approaches thus help gaining understanding over sequence-structure-function and the disease state.

If the protein is known in 3D, you can also run your own simulations and perform structural analysis. See for instance the table below published in Current Protein and Peptide Science, 2002, 3, 341-364: Title: Structural Bioinformatics: Methods, Concepts and Applications to Blood Coagulation Proteins by Villoutreix BO. Section Biostructural Pathology and Conformational Diseases: Some Rules for Assigning the Effects of Missense Mutations on Molecular Functions, Folding and Stability

It should be mentioned also that in general "mutation" data have not been fully explored to improve the effectiveness and efficiency of drug discovery. Genetic, epigenetic and environmental factors define pathophysiological states. For complex diseases, the one gene one drug paradigm may not be the best approach.

Peptides (binding sites and/or folding and/or docking)
Peptides can be used as chemical probes (and drug), they are usually easy to synthesize and are thus very often used in biology labs. In fact, during many years, they were used in priority as small chemical compounds were difficult to obtain in academic labs.


Chemoinformatics: Virtual screening, scoring, hit2lead, repositioning

Virtual screening can be done using different types of information, for example, known small bioactive ligands, in this case it is usually called ligand-based virtual screening, or else use information about the 3D structure of the target, often a protein, in this case it is called structure-based virtual screening. Note that for ligand-based approach, structural information can also be used and that for structure-based screening, knowing bioactive molecules also help to for instance calibrate the scoring function. In some situations, it is possible to combine these two main screening strategies. Also, one should not forget about fragment screening and de novo drug design (de novo approaches usually exploit information from the 3D structure of the receptor to build a compound inside a binding pocket while screening takes directly molecules in a collection and see if they fit in the pocket, both approaches can be combined, for instance during the optimization phase).




Chemoinformatics: ADME-Tox predictions - Personalized medicine

There are many tools, some are more appropriate for toxicity predictions, some are for the preparation of a compound collection or for the selection of molecules, etc.

Off-target predictions (and in this case it will in general be anti-targets) can be investigated in silico with :

  • Binding site similarity approaches: similar binding sites may bind similar ligands
  • Reverse docking
  • Ligand Based approaches
  • Ligand−receptor pharmacophoric descriptors
  • Chemogenomic approaches (eg., derive information from the simultaneous biological evaluation of multiple compounds on multiple targets) ...
  • Patients data can be taken into account..
  • Network-based...

Comments about virtual ligand screening and drug design

The process of drug discovery and development is challenging, time consuming, expensive, and requires consideration of many aspects [some numbers: 7–15 years and $1.2 billion dollars to bring a new molecule to the market, 5 out of about 50,000 compounds tested in animals (usually resulting from the experimental screening of millions of molecules, the cost associated with the experimental screening of 1 million compounds is around $500,000) reach clinical trials and only 1 of 5 compounds reaching clinical studies is approved, some people mention that in general 90% of the compounds entering clinical trials will fail].

One first step can be  find hits. There are many ways, for instance:



But finding binders is not enough ! One needs to optimize the compounds (hit2lead, lead optimization..) and here again in silico approaches can help, eg., using for instance multiparameter optimization tools.

Drug discovery requires a multidisciplinary approach. Indeed, several concepts, disciplines, skills and techniques (e.g., NMR, Xray, bioinformatics, chemoinformatics, data science, AI, medicinal chemistry, toxicology, medical sciences, biology, genetics…) have to work together to succeed as no technology/science alone is likely to make it. Drugs can be small chemical compounds, proteins, peptides, vaccines ... but here, we discuss only small chemical compounds and how in silico approaches can help the process.

To find interesting candidates, one possible way is to find substrates of the target (e.g., enzyme) and to make them drug-like. Another way is to start with someone else's hit and thus mine patent databases to find new ideas. These starting points can then be modified by combining for instance medicinal chemistry and chemoinformatics strategies. Around the 1990's (and still today), a very common approach to find hits is to use high-throughput screening (HTS). In this case, usually, one assumes (there are of course experimental evidences that link a target to a disease but often we fully know the value of a target at the end of the process, it can be too late...) that a target is critical to a disease condition and then, the target is screened experimentally using thousands of compounds (with robots) (some difficulties, cost, and time as it easily requires over a year to screen 200,000 cmps and analyze the results,...the hit rate can be low ~0.2% or even lower for challenging targets such as modulation of protein-protein interactions) and it is only possible to explore a very small part of the chemical space, eg, screen 3 million compounds while the number of small drug like molecules is basically infinite, thus only in silico approaches can be used to explore this almost infinite chemspace).

It is possible to use in silico screening instead of HTS (or with HTS) and then generate a small list of molecules (eg., 300 molecules) for experimental assays. Virtual screening can also be used to search for latent hits missed in a HTS project. Of course, combining experimental screening and virtual screening seem to be the best solution, but to save time and money, often only virtual screening is used, at least in many academic groups and in small companies. In silico tools are not perfect, scoring, docking, failures due to the use of a wrong training sets, difficulties with flexibility....are well-known but solutions to improve these situations are difficult to find. Yet, in many cases, the results obtained after in silico screening are interesting and the hit rate with in silico approaches is generally better than with HTS (from 1% to 10% or more). Of course many parameters will play a role here, like the preparation of the compound collection (filtering, ADMETox prediction, use of focused libraries, libraries to explore new challenging targets, like collections prepared to modulate protein-protein interactions). And again, finding hits is one thing, finding a drug is something else.

After in vitro, or phenotypic screening, or in silico screening, hits have to be optimized and here again, many in silico approaches can be used to assist the process. Most of the approaches mentioned above can be used for chemical biology endeavors. Finding binders is of course very different from finding a drug, yet, one important step is to find high quality starting compounds. In the paragraphs below, I ll give some examples published many years ago by different research groups illustrating the use of in silico drug design and screening. Obviously, you can find many new others in Pubmed or on the WWW.

Example 1: AIDS, compound docking, receptor flexibility, new binding pocket and the generation of ideas to design Isentress
A binding pocket for a new class of drugs to treat AIDS was discovered using docking while considering the flexibility of the receptor through molecular dynamics. McCammon and his colleagues used AutoDock in conjunction with the Relaxed Complex Method to discover novel modes of inhibition of HIV integrase (J Med Chem, 2004). Researchers at the Merck Pharmaceutical Company then used these data to design the orally available raltegravir (HIV integrase inhibitor, brand name Isentress, approved by the Food and Drug Administration approved in 2007 while it received in 2011 approval for pediatric use)

Example 2: Virtual screening versus experimental screening, in silico investigations proposed interesting compounds at a reduced cost
An interesting example which can serve as a proof of principle on the benefit of using in silico approach involves a type I TGF‐beta receptor kinase inhibitor. The same molecule (HTS‐466284), a 27 nM inhibitor, was discovered independently using virtual screening by Biogen IDEC (J. Singh et al., Bioorg. Med. Chem. Lett. 13, 2003, p4355) and traditional enzyme and cell‐based high‐throughput screening by Eli Lilly (J.S. Sawyer et al., J. Med. Chem. 46, 2003, p3953). The in silico work involved pharmacophore‐screening of 200,000 compounds and used as a starting point the knowledge of hit compounds published several years before. The compound discovered experimentally at Lilly required in vitro screening of a large library of compounds to find potential inhibitors in a TGF‐β‐dependent cell‐based assay and chemical synthesis

Example 3: antianxiety, antidepression, 5HT1A agonist using several in silico strategies
An in silico modeling drug development program (homology modeling, virtual screening with DOCK, hit to lead optimization and in silico profiling) led to clinical trials of a novel, potent, and selective antianxiety, antidepression 5‐HT1A agonist in less than 2 years from the start and requiring less than 6 months of lead optimization and synthesis of only 31 compounds (O.M. Becker et al., J. Med. Chem. 49, 2006, p3116)

Example 4: ADMETox, drug design, cost and ethics
Applying QSAR algorithms to toxicity data and corresponding chemical structures led to the development of in silico tools that predict toxicity response (mutagenicity, carcinogenicity) and toxicity dosing (no observed effect level, NOEL; maximum recommended starting dose, MRSD). For example, carcinogenicity QSAR model using 53 descriptors and data from a 2‐year rodent study stored in a FDA database exhibited 76% sensitivity and 84% specificity (Contrera et al, QSAR modeling of carcinogenic risk using discriminant analysis and topological molecular descriptors, Curr. Drug Discov. Technol. 2, 2005, 55–67, see also Regul. Toxicol. Pharmacol. 40, 2004, 185–206). Rodent carcinogenicity studies are required for the marketing of most chronically administered drugs. These studies are the most costly and time consuming nonclinical regulatory testing requirement in the development of a drug. The cost is approximately $2 millions for a study on rats and mice, requiring 2 years of treatment, and at least an additional 1‐2 years for histo‐pathological analysis and report writing. Thus, computational or predictive toxicology computations have potential regulatory and drug development applications that can ultimately benefit the public health as well as reduce the use of animals in the assessment of safety

Example 5: compound optimization, myocardial infarction and ligandbased screening
In a recent review, Clark (Expert Opinion on Drug Discovery (2008) 8: 841‐851) commented on Aggrastat (Tirofiban). This molecule, from Merck, a GP IIb/IIIa antagonist (myocardial infarction, it is an anticoagulant and platelet aggregation inhibitor, protein-protein interaction inhibitor) results from a lead compound that was further optimized using ligand-based pharmacophore screening and medicinal chemistry. This compound appears to modulate a protein‐protein interaction (between Integrin glycoprotein Alpha IIb and Beta III and Fibrinogen receptors on platelets). It is among the first drug whose origins can be traced back to in silico designed. (See Hartzman et al. (1992). "Non-Peptide Fibrinogen Receptor Antagonists. Discovery and Design of Exosite Inhibitors". J Med Chem 35: p4640)

Example 6: 1,2,4-Oxadiazoles identified by virtual screening and their non-covalent inhibition of the human 20S proteasome
Although several constitutive proteasome inhibitors have been reported these recent years, potent organic, noncovalent and readily available inhibitors are still poorly documented. Two studies have been performed by two different groups, one using experimental HTS screening, the other virtual screening. Ozcan et al. screened 50000 molecules coming from Chembridge while Marechal et al., screened in silico 400000 molecules from the same vendor and tested experimentally some molecules to end up, like in the case of the HTS work on oxadiazole noncovalent proteasome inhibitors (see figure to the left). The cellular effects of these compounds validate their utility as potential pharmacological agents for anti-cancer pre-clinical studies

Example 7: Relenza, combining Xray studies with computer modeling and medicinal chemistry
Zanamivir is a neuraminidase inhibitor (transition‐state analogue inhibitor) used in the treatment and prophylaxis of influenza caused by influenza A virus and influenza B virus. Zanamivir was the first neuraminidase inhibitor commercially developed, the initial steps were indeed performed by a small company and in a university in Melbourne. It is currently marketed by GlaxoSmithKline under the trade name Relenza as a powder for oral inhalation. The strategy relied on the availability of the Xray structure of influenza neuraminidase while computational chemistry techniques were also used. The active site was investigated in silico and suggestions were made to optimize the initial hits up to the design of Zanamivir

Example 8: A recent 2012 report from GlaxoSmithKline about the contributions of in silico drug design

  • Direct contributions to the discovery and design of 2 molecules that reached positive proof-of-concept clinical decisions
  • Direct contributions to 8 candidate and pre-candidate decisions
  • 37 contributions resulting in new hit/lead series
  • 18 examples of significant contributions to lead optimization
  • More than 70 examples of screening data analysis resulting in program progression
  • Contributions to drug-discovery programs recognized in 25 published manuscripts and 12 issued or published patents
    See Green DV et al., 2012, J Comput Aided Mol Des (2012) 26:51–56

Example 9: SYK
We found, in collaboration with the group of Dr. P Dariavach, tyrosine kinase SYK non-enzymatic inhibitors and potential anti-allergic drug-like compounds by using virtual and in vitro screening.

The most likely binding area of these compounds that are inhibiting a protein-protein interaction instead of acting on the kinase site was found using binding pocket prediction and validated by site directed mutagenesis. This region was not known prior to our computational analysis. Some compounds are working on animal models and some molecules are patented. (See Villoutreix et al., PloS One 2011; Mazuc et al., J Allergy Clin Immunol. 2008).



Example 10: in big pharma

You can find additional comments about the impact of in silico methods in drug discovery in this review Hillisch et al., ChemMedChem 2015, 10:1958-62. The authors mentioned that for 20 molecules in clinical trials developed in the company, 50% of them strongly benefited from in silico computations.


In silico methods help the drug discovery process, they can be (or are) combined with biophysical approaches, experimental high throughout screening and biology/chemistry/toxicology/clinical studies; they assist decision making, contribute to reducing the cost, to the generation of new ideas and concepts, they bring solutions to problems, allow to rapidly test new hypothesis and to explore “areas” that could not be assessed experimentally either because the experiments could not be performed, or because they would cost too much or else as they would not be ethical. They for instance allow to investigate new compounds before they are even synthesized. In silico tools help to analyze, mine and rationalize millions of (heterogeneous) data points coming from multiple sources, assist in defining the functions of a molecule, help to understand and to predict polypharmacology, off‐targets, ADMETox properties, they provides supplemental information to resolve conflicting experimental results, reduces the necessity to repeat studies or to perform some experiments, they accelerate clinical trials by supporting the entry of subjects in clinical trials before standard toxicology studies are completed, support risk‐based testing and the reduction of animal testing, and supply additional supporting information for the selection of the first dose in humans to be used in standard phase I clinical trial…
Although promising, in silico methods are not without limitations… and thus have to be continuously developed and challenged and that research and funding in this field are needed, not only for applications but also for methodological developments. Apart from the screening tools, we now see an aggressive development of data mining and pipelining tools to keep pace with the massive amount of data generated by both experimental and computational experiments. Choosing the right strategy is critical and increasing the interaction between experimentalists and computational groups should increase the quality and efficiency of the lead discovery stage and the development of new and safer drugs.

Additional comments about cost and the need to combine and integrate in silico and in vitro strategies

In silico strategies usually contributes to understand better a molecular event, they tend to shed new light and contribute new ideas, they allow exploring data that the human mind cannot rationalize. This is very valuable as mentioned above but the impact on cost and time is also important. For example, in a recent talk from P. Ertl, these points were commented: It is possible to predict some ADMETox properties like interaction of a compound with the potassium ion channel protein hERG (potential fatal disorder called long QT syndrome). One experimentally measured value of hERG blockade with the patch‐clamp technique occupies one laboratory assistant for 1 day and consumes many research chemicals. Thus, if the prediction method is accurate, one can easily conceive the impact on time and money. Further, computer models allow making predictions of compounds that are not yet synthesized (for a research chemist it typically takes 1 week to synthesize a compound if everything goes right). But clearly computer tools are not only about cost and number crunching, they allow us to gain new knowledge and can guide experimental design

Note about experimental work

In silico tools are sometimes considered to be inaccurate by some scientists. While this may be true in some cases (such problems are usually alleviated when the tools are used by experts in the field), it is important to note that there is nearly no experimental measurement without error. Even for a simple log P value, different scientists working in different laboratories will measure different values. So wisdom and humility are needed on both sides, at the bench and behind the computer.

  • Last updated on .




Follow me

© Bruno Villoutreix. A first version of this Website was launched in 2006. Thank to Natacha Oliveira