1. Resources
  2. Articles
  3. AlphaFold2 + ZINC20, open a new era of virtual drug screening!
AlphaFold2 + ZINC20, open a new era of virtual drug screening!
In the past, Archimedes said; “Give me a lever, and I can pry up the whole earth, which shocked the physics world”. Likewise, AlphaFold2: “give me an amino acid sequence, and I can accurately predict the three-dimensional structure of proteins” . The appearance of AlphaFold2 surprised the whole biological science community. Why has AlphaFold2 become so popular in predicting proteins? What is the appeal of ZINC20[1], a database of 1.4 billion compounds ? Read on for a systematic analyses of methods for obtaining protein three-dimensional structures and compound databases.

The three-dimensional structure of a protein can be easily obtained by knowing the composition of linear amino acid sequence of the protein. However, it is not as simple as it seems. Currently, there are about 250 million protein molecules with known amino acid sequence, but as of today, the three-dimensional structures of proteins included in the RCSB PDB (www.rcsb.org) database are only 18,1295, which is less than 0.1% of the total number of proteins[2]. Usually, obtaining the three-dimensional structure of proteins by professional technologies such as X-ray diffraction (X-ray), nuclear magnetic resonance (NMR) or cryo-electron microscopy (EM) are time-consuming and laborious and requires a lot of financial investments. On the other hand, computer prediction of protein structure has many limitations. SWISS-MODEL requires sequence homology >30%, I-TASSER requires sequence to penetrate existing structures, and ROBETTA requires amino acid sequence <200. Scientists have been wondering about reliable ways to obtain the three-dimensional structure of proteins until AlphaFold2 appeared.

Figure 1. Cartoon representation of primary structure, secondary structure, tertiary structure, and quaternary structures of proteins.
Figure 1. Cartoon representation of primary structure, secondary structure, tertiary structure, and quaternary structures of proteins.
Discovery of AlphaFold2

At the end of 2020, in 14th Protein Structure Prediction Competition (CASP14), the AlphaFold2 program developed by the AI team of DeepMind won the first place. The efficiency of AlphaFold2 increased the accuracy of protein structure prediction from 40 points to 92.4 points, which achieved the structure prediction with atomic precision or close to atomic precision. This news amazed the biological world.

On July 16, 2021, the DeepMind team published the source coding of AlphaFold2 in Nature, and just a week later, the DeepMind team published the AlphaFold2 dataset again in Nature[3] . The data set of AlphaFold2 not only covers almost the entire human proteome (98.5% of human proteins), but also includes proteome data of 20 species commonly used in scientific research such as Escherichia coli, Drosophila, and mouse. The total number of protein structures in the dataset is over 350,000, and 58% of the predicted structures in the dataset had good reliability, with 35.7% of them achieved high reliability.

Figure 2. Alphafold dataset website.
Figure 2. Alphafold dataset website.
(Free and open URL: alphafold.ebi.ac.uk)

We found from the AlphaFold2 computing model that it draws on the Transformer architecture that has emerged in recent AI research, instead of using the ResNet-like residual convolutional neural network used by AlphaFold. The Transformer architecture in AlphaFold2 uses the amino acid sequence of a protein as a data structure similar to text. Through multiple sequence alignment, the protein structure and biological information are integrated into the deep learning algorithm, so as to obtain a highly reliable predicted structure through calculation and simulation. On the other hand, AlphaFold2 does not use AlphaFold's usual simplified atomic distance or contact diagram, but directly trains the atomic coordinates of the protein structure, and predicts the credible topological structure of the protein through machine learning methods. Statistical analysis of the structures predicted by AlphaFold2 found that about 2/3 of the protein predictions were as accurate as those measured by structural biology experiments.

Figure 3. AlphaFold2 computational model of the 3D protein structure
Figure 3. AlphaFold2 computational model of the 3D protein structure[3].
ZINC20 adds billions of molecules

AlphaFold2 revolutionized the drug discovery research. AlphaFold2 can predict disease-related protein structures at low cost, and then find potential drugs for these diseases through drug repositioning, virtual screening and other methods. As an important tool for virtual screening, the compound database also determines the speed and quality of small molecule drug development.

ZINC is not only a public database summarizing information about compounds, but also a service website. It can support us to download 2D and 3D compound molecular structures, and can also quickly search for compounds and compound analogs. Currently, the number of compounds in the ZINC database is close to 2 billion, of which 1.3 billion compounds are available from 150 companies, and there are a total of 310 product catalogs. While the number of compounds in stock worldwide is growing by a few percent per year (about 14 million compounds today), the number of on-demand compounds is growing almost exponentially. At present, the number of customized compounds has grown to ten billion, and the number of required compounds will reach hundred billion in the near future. ZINC20 (zinc20.docking.org) added tens of billions of custom-built compounds that were not added to the ZINC database, and which significantly outperformed physical screening databases in terms of scaffold and molecular diversity .

Figure 4. On-demand compound growth demand (NPMI analysis)
Figure 4. On-demand compound growth demand (NPMI analysis)[1].
VisualFlow, a 5-hour virtual screening of 1 billion molecules

On the one hand, the rapid analysis of protein structures, the rapid development of synthetic methodologies, and the exponential growth of compound databases require more scientists to use professional experimental equipment to test and obtain early drugs. Because virtual screening can quickly screen out pharmaceutically active compounds from dozens to millions of compounds, this technology not only reduces the number of biological experiments to verify compounds, shortens the cycle of experiments and research, but also reduces the cost of drugs. Therefore, virtual screening technology is increasingly favored by many medicinal chemistry scientists. On the other hand, cloud platforms and AI algorithms are becoming more and more popular. For example, the docking time of a protein with each ligand is 15 seconds on average, and it takes 475 years to screen 1 billion compounds on one CPU, while the VirtualFlow platform can use 160,000 CPUs to screen 1 billion compounds in only 15 hours. With higher hit rate, faster computing speed, and stronger iterative ability, virtual screening has never been left behind in the drug development process.

The virtual screening team owned by MCE is very professional and have high performance computer servers and, high standard of data privacy management. We offer professional molecular-docking, virtual screening services, and have more than 40 high-throughput compound libraries, covering 6 million compounds, which are purchasable, reproducible, structurally diverse, and drug-like compounds. The final project report not only includes protein background research, process overview, and result analysis, but also a 2D/3D molecular docking pattern diagram that meets the requirements of article publication.

MCE's one-stop drug screening platform, including virtual screening, compound bioactivity screening, and ion channel-based compound screening platform is a complete package for drug discovery and drug screening projects.

Virtual Screening Service