Improving genetic risk prediction and drug target discovery using primate DNA and advanced artificial intelligence

Petko Fiziev, Jeremy McRae, Tobias Hamp, Hong Gao, Kyle Farh; published June 1, 2023


Interpreting the impact of genetic variants on human health is an essential step in unlocking the promise of personalized genomic medicine. Genetic risk prediction and drug target discovery are two areas that can benefit tremendously from reverse engineering the genetic architecture of common diseases such as type 2 diabetes, cardiovascular disease, and cancer. In the context of genetic risk prediction, accurate estimation of effects of genetic variants on disease risk is key for identifying individuals at high risk at population scale. In the case of drug target discovery, understanding the underlying genetics can substantially facilitate the successful development of novel therapeutics.1 One exciting strategy is to identify genes where loss-of-function (LoF) variants protect against diseases and to design drugs that similarly inhibit function. For example, having a nonfunctional copy of the PCSK9 gene lowers low-density lipoprotein (LDL) cholesterol levels and thus protects against cardiovascular disease, which in turn renders this gene a viable drug target.2

In the past decade, genome-wide association studies (GWAS) have identified tens of thousands of common variants associated with many traits and complex diseases.3,4 Each associated variant typically explains a very small fraction of the trait. Multiple associated variants can be combined into a polygenic risk score (PRS), which can explain a considerable portion of disease risk. However, uncovering the specific genes that mediate disease risk from GWAS is challenging because most GWAS variants reside in the noncoding part of the genome.

Unlike GWAS, rare variant studies connect variants in specific genes directly to clinical phenotypes. Rare variants are routinely examined via sequencing for rare genetic disorders and cancers, and can explain why a disease occurred and thus improve clinical management.5,6 Recent whole-genome and whole-exome sequencing studies suggest that rare variants contribute to common polygenic diseases as well.7 However, efforts to study rare variants in common diseases have been hindered due to imprecise interpretation of variant function and insufficient cohort sizes for rare variant analyses.

To maximize insights from human variant studies, we developed PrimateAI-3D, a deep-learning network trained on 4.5 million common genetic variants from 233 primate species. This state-of-the-art classifier accurately quantifies missense variant pathogenicity in humans, which improves discovery of genes affecting clinical phenotypes. Integrating rare and common variant PRS models into a unified risk score provides a more comprehensive understanding of disease risk, bringing us one step closer to personal genome sequencing for the general population. Furthermore, using PrimateAI-3D scores in rare variant analyses finds genes with protective effects on disease risk that can be considered as candidate drug targets, including known examples such as PCSK9 ($1 billion market), HMGCR (target of lipid-lowering statins, $14 billion market), ANGPTL3, and NPC1L1.

Nonhuman primate sequencing reveals benign variants in humans

Previously we showed that information from closely related primate species can help to infer pathogenicity of human variants, and thus improve clinical variant interpretation on a genome-wide scale.8 Our earlier work used 385,000 missense variants from sequencing 134 individuals across six primate species. We have now expanded this over 10-fold to 4.5 million primate missense variants by sequencing another 703 individuals across 211 species. The selected species represent approximately half of the 521 extant primate species and cover all major primate families. We targeted an average of 3.5 individuals per species to make sure that we primarily sampled common variants rather than rare mutations. As a result, we dramatically increased the number of common variants that can be used to train a machine-learning classifier.

These missense variants from primates are largely benign in humans. Missense variants found in at least one species from the primate cohort were classified as benign or likely benign 99% of the time in the ClinVar database compared to 63% for ClinVar missense variants in general. The regions of human disease genes that contained many ClinVar pathogenic variants were also depleted for benign primate common variants (Figure 1).

Overall, our primate population database consists almost entirely of benign variants with previously unknown significance and has 50-fold more annotated missense variants than the ClinVar database. We have made the primate variants publicly available as a resource for the genomics community (

Figure 1. Example of pathogenic missense variants in the CACNA1A gene.

Top: gnomAD (green) or primate (blue) missense variants observed in each amino acid position in the CACNA1A gene. Red circles represent the positions of annotated ClinVar pathogenic missense variants.
Bottom: The scatter plot shows that PrimateAI-3D predicted pathogenicity scores for all possible missense substitutions along the gene.
Sources: gnomAD, Genome Aggregation Database.

PrimateAI-3D: A deep-learning network for predicting variant pathogenicity

We used the 4.5 million primate missense variants with likely benign consequence to train PrimateAI-3D. This semi-supervised 3D-convolutional neural network, developed by Illumina scientists, incorporates evolutionary conservation and protein 3D structure from AlphaFold DB9 to predict the pathogenicity of missense variants. Unlike earlier deep-learning architectures that relied on linear protein sequence,8,10 PrimateAI-3D uses 3D convolutions to recognize key structural and evolutionary patterns that may not be apparent from protein sequence alone (Figure 2). The network learns to infer pathogenicity based on the local enrichment or depletion of common primate variants, taking only the protein’s multiple sequence alignment and 3D structure as inputs instead of engineered features. By not relying on annotations from clinical variant databases, this approach enables PrimateAI-3D to provide an unbiased review of variant pathogenicity.

Augmenting the primate variant classification task, we also taught the network to predict masked amino acids from the surrounding 3D context, a technique borrowed from language models that are trained to predict missing words in sentences.11 In a separate task, we used language models and multiple sequence alignments to incorporate evolutionary amino acid constraints across diverse species (Figure 2).

We evaluated PrimateAI-3D against 15 published missense pathogenicity prediction methods. PrimateAI-3D outperformed all other classifiers by accurately distinguishing pathogenic variants from benign across four cohorts—the UK Biobank, a neurodevelopmental disorders cohort (DDD), an autism spectrum disorders cohort (ASD), and a congenital heart disease cohort (CHD). In addition, PrimateAI-3D was the best classifier for separating benign and pathogenic variants in the ClinVar database and had the highest average correlation with deep mutational scan assays (Figure 3).

Figure 2. PrimateAI-3D architecture.

Human protein structures and multiple sequence alignments are voxelized (left) as input to a 3D convolutional neural network that predicts pathogenicity of all possible point mutations of a target residue (middle). The network was trained using a loss function with three components (right): common human and primate variants, fill-in-the-blank of a five-protein structure, and score ranks from language models.

Figure 3. PrimateAI-3D variant classification performance.

Bar plots show method performance for six testing datasets, with PrimateAI-3D and PrimateAI-language model (LM) only in orange. DMS = deep mutational scan; UKBB = UK Biobank; DDD = neurodevelopmental disorders cohort; ASD = autism spectrum disorder cohort; CHD = congenital heart disease cohort.

Using PrimateAI-3D improves rare variant testing in common diseases

Rare variant burden tests look for the association between the combined effect of multiple rare variants and specific phenotypes. We used rare variant burden tests to identify genes underlying 90 complex human traits and diseases in 454,712 exome-sequenced individuals from the UK Biobank. At an allele frequency (AF) threshold of up to 0.1%, we detected 3035 gene-phenotype associations when combining missense and LoF variants. When PrimateAI-3D was used to classify missense variants, we found 73% more gene-phenotype associations compared to the standard burden tests.

Next, we explored the correlations between PrimateAI-3D scores and clinical phenotypes. We found correlations between LDL cholesterol levels and variants in LDLR, a gene in which pathogenic mutations can contribute to heart disease by modifying LDL levels.12,13 PrimateAI-3D scores of LDLR missense variants were correlated with LDL levels (Figure 4 top). Individuals with low-scoring PrimateAI-3D variants had lower LDL cholesterol levels compared to individuals with high-scoring PrimateAI-3D variants who had much higher LDL levels. Among individuals diagnosed with dyslipidemia, those with the most deleterious missense variants developed disease ~15 years earlier, similar to LoF carriers.

We also examined rare variants in PCSK9, a target of cholesterol-lowering medications.2,14 Rare variants with higher PrimateAI-3D scores typically matched with lower LDL cholesterol levels, opposite to the effect of LDLR variants (Figure 4 bottom). LDL cholesterol levels increased steadily with age, but individuals with variants with high PrimateAI-3D scores had lower LDL cholesterol levels at all ages. As a consequence, fewer of these carriers had hypercholesterolemia or elevated CVD risk,15 while those that did experienced these symptoms later in life.

Figure 4. Example of correlation between PrimateAI-3D scores and clinical phenotypes.

Top left: Positive correlation of LDL cholesterol levels (y-axis) with PrimateAI-3D scores (x-axis) for rare missense variants in LDLR.
Top right: PrimateAI-3D score is predictive of age of onset for dyslipidemia in carriers of rare missense variants in LDLR.
Bottom left: Negative correlation of LDL cholesterol concentrations with PrimateAI-3D scores for rare missense variants in PCSK9, a down-regulator of LDLR and target of cholesterol-lowering drugs.
Bottom right: LDL cholesterol concentrations increase with age at a similar rate regardless of carrier status, but carriers of prioritized rare variants in PCSK9 by PrimateAI-3D have lower LDL concentrations across all ages.

Rare variant polygenic risk scores (PRS) identify individuals at high risk for common diseases

Rare variants are frequently excluded from common variant PRS models due to challenges in interpreting variants of unknown significance (VUS) and estimating the effects of these low-frequency variants. We developed a rare variant PRS model that uses PrimateAI-3D for variant effect estimation. Using cholesterol metabolism as an example, our rare variant PRS model identified 31 genes where low-frequency variants had an effect on serum cholesterol levels; 25 of these genes play key roles in lipid homeostasis, including PCSK9, ANGPTL3, NCP1L1, and HMGCR, which are known drug targets for cholesterol-lowering medications.16 Our rare variant PRS model accurately identified individuals at phenotypic extremes, who were 10 times as likely as the overall population to have a rare variant polygenic score in the 0.1st or 99.9th percentile (Figure 5). Clinical screening and risk management can be considered for these high-risk individuals.

Up until around four-fold increased odds of disease, the common variant PRS identified more at-risk individuals, whereas after this threshold the rare variant PRS overtook the common variant PRS (Figure 5). Because the rare and common variant PRS models use nonoverlapping sets of variants, by combining them into a unified model we can identify significantly more individuals at high risk of common diseases than common variant PRS alone.

Figure 5. Rare variant polygenic risk scores have benefits at phenotype extremes and for disease risk.

Left: Phenotype outlier individuals were defined as exceeding a certain z-score cutoff (X-axis), and the Y-axis shows the enrichment of outlier PRS in phenotype outlier individuals versus the baseline population, aggregated across 78 phenotypes.
Middle and right: Number of individuals at risk for type 2 diabetes and dyslipidemia, respectively, identified by rare and common variant PRSs at varying risk thresholds (x-axes). Rare variant PRSs identified more individuals at higher risk (> 3.8 higher odds for type 2 diabetes, and > 4.4 higher odds for dyslipidemia) than common variant PRSs.

Rare variant PRS retains robust performance across ethnicities

Common variant PRS models depend on their training cohorts and do not transfer easily between populations. Most PRS models are trained primarily on individuals with European ancestry. Unfortunately, these models perform worse in non-European populations, which contributes to health disparities.17 Our rare variant PRS model is less ancestry dependent, as it leverages PrimateAI-3D, built from 233 nonhuman primate species, to estimate missense variant effect size.

Our PRS models used individuals of European ancestry for training. We evaluated our common and rare variant PRSs in individuals of non-European ancestry from the UK Biobank along with an independent cohort, the Massachusetts General Brigham Biobank. As expected, the common variant PRS performed much worse in individuals with African, East Asian, and South Asian ancestries (Figure 6). In contrast, the rare variant PRS was substantially more portable, with a smaller drop in performance in these ancestries. Despite rare variant PRSs generalizing better across ethnicities than common variant PRSs, their overall performance still fell slightly in non-European populations. This should become less of a problem over time as population allele panels become more accurate and diverse.

Figure 6. Performance of common and rare variant PRS models across ethnicities.

Performance is shown relative to European individuals in the UK Biobank. P-values indicate whether the difference in performance versus held out Europeans is significant. MGB = Massachusetts General Brigham Biobank; EUR = European ancestry; AFR = African ancestry; EAS = East Asian ancestry; SAS = South Asian ancestry.


Uncovering the role of rare, but penetrant, variants in common diseases is crucial for personalized medicine and can facilitate drug development. The inability to predict variant pathogenicity has hindered efforts to study these rare variants. We have addressed this issue by developing PrimateAI-3D, a novel deep-learning method to predict variant pathogenicity from protein 3D structure that is trained with many more common primate variants. We integrated PrimateAI-3D predictions in rare variant burden tests and found 73% more gene-phenotype associations compared to not using variant prioritization. This effectively reduces cohort sizes required to discover disease-associated genes. We developed PRS models that incorporate rare variants to identify individuals at the greatest risk of disease, for whom preventive screening and clinical management would be most impactful. Our rare variant PRS models were more portable across ethnicities than common variant PRSs, and so could help alleviate current health care disparities. Finally, identifying pathogenic genetic variants underpinning disease can reveal novel drug target candidate genes, further increasing the clinical actionability for such variants. The success of PrimateAI-3D highlights the value of nonhuman primate genomes in understanding our own genomes and underscores the importance of preserving these irreplaceable primate species, many of which are critically endangered.

PrimateAI-3D publication:

Rare variant PRS publication:


Primate variant database:


  1. Nelson MR, Tipney H, Painter JL, et al. The support of human genetic evidence for approved drug indications. Nat Genet. 2015;47(8):856-860. doi:10.1038/ng.3314
  2. Pasta A, Cremonini AL, Pisciotta L, et al. PCSK9 inhibitors for treating hypercholesterolemia. Expert Opin Pharmacother. 2020;21(3):353-363. doi:10.1080/14656566.2019.1702970
  3. Buniello A, MacArthur JAL, Cerezo M, et al. The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 2019;47(D1):D1005-D1012. doi:10.1093/nar/gky1120
  4. Tam V, Patel N, Turcotte M, Bossé Y, Paré G, Meyre D. Benefits and limitations of genome-wide association studies. Nat Rev Genet. 2019;20(8):467-484. doi:10.1038/s41576-019-0127-1
  5. Delvecchio M, Pastore C, Giordano P. Treatment Options for MODY Patients: A Systematic Review of Literature. Diabetes Ther Res Treat Educ Diabetes Relat Disord. 2020;11(8):1667-1685. doi:10.1007/s13300-020-00864-4
  6. Smith KL, Isaacs C. BRCA mutation testing in determining breast cancer therapy. Cancer J Sudbury Mass. 2011;17(6):492-499. doi:10.1097/PPO.0b013e318238f579
  7. Karczewski KJ, Francioli LC, Tiao G, et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature. 2020;581(7809):434-443. doi:10.1038/s41586-020-2308-7
  8. Sundaram L, Gao H, Padigepati SR, et al. Predicting the clinical impact of human mutation with deep neural networks. Nat Genet. 2018;50(8):1161-1170. doi:10.1038/s41588-018-0167-z
  9. Jumper J, Evans R, Pritzel A, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596(7873):583-589. doi:10.1038/s41586-021-03819-2
  10. Frazer J, Notin P, Dias M, et al. Disease variant prediction with deep generative models of evolutionary data. Nature. 2021;599(7883):91-95. doi:10.1038/s41586-021-04043-8
  11. Lee J, Yoon W, Kim S, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinforma Oxf Engl. 2020;36(4):1234-1240. doi:10.1093/bioinformatics/btz682
  12. Brown MS, Goldstein JL. Expression of the familial hypercholesterolemia gene in heterozygotes: mechanism for a dominant disorder in man. Science. 1974;185(4145):61-63. doi:10.1126/science.185.4145.61
  13. Goldstein JL, Brown MS. Binding and degradation of low density lipoproteins by cultured human fibroblasts. Comparison of cells from a normal subject and from a patient with homozygous familial hypercholesterolemia. J Biol Chem. 1974;249(16):5153-5162.
  14. Sabatine MS. PCSK9 inhibitors: clinical evidence and implementation. Nat Rev Cardiol. 2019;16(3):155-165. doi:10.1038/s41569-018-0107-8
  15. Grundy SM, Stone NJ, Bailey AL, et al. 2018 AHA/ACC/AACVPR/AAPA/ABC/ACPM/ADA/AGS/APhA/ASPC/NLA/PCNA Guideline on the Management of Blood Cholesterol: A Report of the American College of Cardiology/American Heart Association Task Force on Clinical Practice Guidelines. Circulation. 2019;139(25):e1082-e1143. doi:10.1161/CIR.0000000000000625
  16. Luo J, Yang H, Song BL. Mechanisms and regulation of cholesterol homeostasis. Nat Rev Mol Cell Biol. 2020;21(4):225-245. doi:10.1038/s41580-019-0190-7
  17. Martin AR, Kanai M, Kamatani Y, Okada Y, Neale BM, Daly MJ. Clinical use of current polygenic risk scores may exacerbate health disparities. Nat Genet. 2019;51(4):584-591. doi:10.1038/s41588-019-0379-x