April 17, 2026

MEDAISY

Small Steps Towards a Healthier Life

Rare disease gene association discovery in the 100,000 Genomes Project

Rare disease gene association discovery in the 100,000 Genomes Project

Rare disease genomes from the 100KGP

The Health Research Authority Research Ethics Committee East of
England—Cambridge South (Ref. 14/EE/111) gave ethical approval for the
100KGP. Patients with rare diseases and affected and unaffected family members
were enrolled to the 100KGP through 1 of the 13 NHS Genomic Medicine Centres
across England, Northern Ireland, Scotland and Wales2. Consent was obtained
from all participants to the 100KGP. The recruiting clinicians assigned each
proband to a specific disease (according to a hierarchical disease
classification available in the project described below) and provided
patient’s phenotypic data according to the HPO49. An initial cohort of
74,061 genomes (35,548 single probands and larger families) from the rare
disease pilot and main programme of the 100KGP (data release v.11) was available
for analysis (March 2021). Genomes were sequenced using the TruSeq DNA
polymerase-chain-reaction-free sample preparation kit (Illumina) on a HiSeq 2500
sequencer, which generates a mean depth of 32× (range, 27–54)
and a depth greater than 15× for at least 95% of the reference human
genome. Whole-genome sequencing reads were aligned to either the Genome
Reference Consortium human genome build 37 (GRCh37) for the minority of earlier
samples, or build 38 (GRCh38), with the use of Isaac Genome Alignment Software.
Family-based variant calling of single-nucleotide variants and insertions or
deletions for chromosomes 1 to 22, the X chromosome, and the
mitochondrial genome (mean coverage, 2,814×; range, 142–16,581)
was performed with the use of the Platypus variant
caller50. Quality control performed by Genomics
England highlighted that 81 of the probands had been recruited and sequenced
twice and these duplicates were removed from our cohort. In addition, the
required data for our Exomiser-based gene burden analysis, for example,
recruited disease category and phenotypic terms, was not available for 16
families and these were also excluded from our cohort. The demographics of the
cohort, presented in Supplementary Table 1, were obtained using Labkey in the Genomics England
Research Environment. Genetic ancestry inference was performed by Genomics
England by principal component analysis. A random forest model was subsequently
trained to predict ancestry across five super-populations (European, African,
Admixed American, South Asian, East Asian), with people assigned to ancestries
on the basis of a probability threshold of greater than 0.8
(https://re-docs.genomicsengland.co.uk/ancestry_inference/).

Pool of rare, putative disease-causing variants for gene burden
testing

The variant prioritization tool Exomiser11 (v.12.1.0 with default
settings and latest 2007* (July 2020) databases) was then run on all
available 35,451 single proband and family-based VCF files to obtain a pool of
rare, protein-coding, segregating and most predicted pathogenic (per gene)
variants to use in an rare variant gene-based burden testing analysis for the
discovery of new rare Mendelian disease–gene associations as described
below. For each proband/family and each gene, Exomiser selected a single
configuration of contributing variants, that is, the most predicted (REVEL and
MVP) pathogenic, rare (less than 0.1% autosomal/X-linked dominant or homozygous
recessive, less than 2% autosomal/X-linked compound-heterozygote recessive;
using publicly available sequencing datasets including gnomAD) protein-coding
homozygous/heterozygous variant or compound-heterozygote variants that
segregated with disease for each possible mode of inheritance. Coding variants
(including canonical splice acceptor and donor and splice region (bases one to
three of exon to three to eight of intron)) were selected by Exomiser by
removing all those classified as FIVE_PRIME_UTR_EXON_VARIANT,
FIVE_PRIME_UTR_INTRON_VARIANT, THREE_PRIME_UTR_EXON_VARIANT,
THREE_PRIME_UTR_INTRON_VARIANT, NON_CODING_TRANSCRIPT_EXON_VARIANT,
UPSTREAM_GENE_VARIANT, INTERGENIC_VARIANT, REGULATORY_REGION_VARIANT,
CODING_TRANSCRIPT_INTRON_VARIANT, NON_CODING_TRANSCRIPT_INTRON_VARIANT and
DOWNSTREAM_GENE_VARIANT. The Exomiser analysis did not return any candidate
variants for 29 families, generally for larger families with several affected
people where no rare, putative disease-causing variants remained after
filtering, leading to an interim dataset size of 35,422 single probands and
larger families. To control for false positive variant calls and/or relatively
common variants in the project itself, we further discarded variants on the
basis of how often they were observed in the Exomiser master dataset itself
(frequency greater than 2% for variants in a compound-heterozygote genotype,
greater than 0.2% for mitochondrial DNA genome variants, greater than 0.1% for
heterozygote/homozygote variants). This led us to discard data from a further 41
families. Finally, potentially digenic probands with more than one recruited
disease category were discarded from the analysis, leaving a total of 35,008
probands. As part of the sample quality control, kinship coefficients were used
to control for cryptic relatedness. Genomics England provided kinship
coefficients only for 29,180 of the 35,008 probands. Therefore, we calculated a
genetic variant overlap measure (number of variants in common in the Exomiser
results for two probands per total number of variants in the Exomiser results
for the two probands) for all pairwise combinations of the 35,008 probands, and
demonstrated that this was correlated strongly with the available kinship
coefficients (Supplementary Fig. 2).
Within each recruited disease category, we then identified genetically related
probands (kinship coefficient greater than 0.088 corresponding to second-degree
relatives or above, or an equivalent variant overlap score (0.1) threshold where
kinship coefficient was not available). The proband in the pair with the most
Exomiser results (less refined list of rare, putative disease-causing variants,
usually from a smaller family size) was then dropped from all further analyses.
This removed 157 probands/families, leading to a final input analysis dataset of
34,851 single probands and larger families (40,402 probands and affected family
members and 32,288 unaffected family members) and 4,643,230 Exomiser-based
candidate heterozygote/homozygote variants and compound-heterozygote genotypes
(Supplementary Table 1 and Fig.
1). Furthermore, for each
case–control analysis (see case–control definition below), a
further variant quality control was applied with variants seen in at least one
case (as heterozygote, homozygote or in a compound-heterozygote genotype) being
discarded if seen in at least one control (as
heterozygote/homozygote/compound-heterozygote, homozygote/compound-heterozygote,
homozygote/compound-heterozygote, respectively) to mimic a Mendelian, fully
penetrant-like disease model.

Exomiser-based rare variant gene burden testing for Mendelian
diseases

A rare variant gene-based burden case–control analytical
framework that exploits rare, putative disease-causing variants as annotated,
filtered and scored by the variant prioritization tool Exomiser was used to
identify new rare Mendelian disease–gene associations. The annotation of
variants to genes comes from this Exomiser analysis using its default settings
to identify the most damaging consequence to the set of Gencode-basic tagged
Ensembl transcripts. The framework has been described
previously1 and extended in this study
( Briefly, as to the application of the analytical
framework to the rare disease component of the 100KGP, cases and controls were
defined exploiting the hierarchical disease classification in the project itself
where the recruiting clinicians assigned each proband to any of 228
‘specific diseases’ (level 4); the ‘specific
diseases’ are in turn grouped into less specific 91 ‘disease
sub-groups’ (level 3), each of which corresponds to 1 of 20 broad
‘disease groups’ (level 2) (Supplementary Table 6). Two specific diseases (pontine tegmental
cap dysplasia and childhood onset leukodystrophy) were never used for
recruitment in the end, leaving a final set of 226 level 4 disease categories. A
case set was then defined as all probands recruited under each of the 226 level
4 disease categories and its corresponding control set as all recruited probands
except those under the level 2 category containing the specific level 4 disease,
for example, hypertrophic cardiomyopathy cases were compared with all
non-cardiovascular disorders probands as controls. As with the gene burden
testing, the gene-based enrichment of variants in cases versus controls was
quantified using the cohort allelic sums test (CAST)51 statistic under four
proband genotype scenarios (irrespective of the mode of inheritance): (1)
presence of at least one rare, predicted LoF variant; (2) presence of at least
one rare, highly predicted pathogenic variant (Exomiser variant score of at
least 0.8 (either predicted LoF or missense variants predicted to be
pathogenic)); (3) presence of at least one rare, highly predicted pathogenic
variant in a CCR and (4) presence of a rare, de novo variant (restricted to only
trios or larger families where de novo calling is possible and provided by the
user). These CCR regions were defined previously by looking for the absence of
variation in gnomAD15 at various levels of certainty and in the
application to the rare disease component of the 100KGP we used the 95%
percentile download. Given that Exomiser selects by default a single
configuration of ‘contributing’ variants for each
proband/family, each gene and each possible mode of inheritance as compatible
with available family-based data and we calculated the CAST statistic as the
best observed irrespective of the mode of inheritance, the CAST statistic
corresponds to a sum test statistic40. The gene burden association is then
assessed either using a binary case–control status versus the CAST
statistic in a right-tailed Fisher’s exact test as in the original
implementation of the analytical framework1, or using the CAST
statistic as a covariate in a Firth’s logistic regression
model46 that is tailored to testing unbalanced
case–control datasets with rare events. In the application to the rare
disease component of the 100KGP, Firth’s logistic regression models were
adjusted for age, sex, family size (single proband/duos/trios and larger
families) and inferred genetic ancestry (Supplementary Table 1). To maintain statistical validity and power,
the analysis was limited to those disease–gene associations where an
arbitrary set of at least five cases exist for the specific disease tested and,
for each of the four gene-based proband genotype scenarios above, where relevant
variants in the gene were seen in at least four probands, of which at least two
were cases (we would not follow up associations signals driven by single
cases/families in the first instance). Only gene-based enrichments that were
more frequent in cases than controls (putative disease-causing as opposed to
protective) were considered in the statistical correction for multiple testing.
The Benjamini and Hochberg method52 was used to correct for multiple
testing; an overall FDR-adjusted P value threshold of 0.5% was used for claiming
statistically significant disease–gene associations for further
triaging.

Triaging

First in silico triage

The statistically significant associations were further
filtered for those where (1) the gene was protein-coding as the Exomiser
coding variant filtering settings also identified variants disrupting
non-protein-coding RNA genes (gene type definitions from the human gene
nomenclature committee (HGNC) website); (2) for dominant, LoF signals there
was gnomAD evidence for haplo-insufficiency (gnomAD v.2.1.1
observed/expected LoF less than 0.5) and (3) none of the cases driving the
signal were already assigned a molecular diagnosis in other genes as part of
the 100KGP routine diagnostic pipeline.

Application of ClinGen framework for gene–disease
validity

Classification of the disease–gene associations
according to ClinGen criteria
( was applied using in silico approaches where possible.
The case-level variant score was calculated from scoring and summing all
case variants that support a particular mode of inheritance for a
disease–gene association. LoF variants (stop gain, frameshift or
splice acceptor/donor) scored 1.5 points or 2 if de novo, whereas others
scored 0.1 points or 0.5 if de novo. A case–control study score of 5
points for an odds ratio greater than 5, 4 points for odds ratio greater
than 3 or 3 points for odds ratio less than 3 was assigned. The larger of
the case-level variant score or case–control study score was used as
the genetic evidence score, capped at a maximum of 12 for those associations
that had many supporting case variants. Experimental evidence categories
were calculated using a variety of sources. Existing evidence for a gene
function fitting the likely disease mechanism was assessed using PubMed
searches using the disease and gene name and the background knowledge of the
experts in the various disease-specific GeCIPs. Scores of 0, 1 or 2 were
awarded depending on whether there was no, some tenuous or lots of evidence.
Gene expression was assessed using GTEx Project data through the web portal
of the Human Protein Atlas
( (ref. 53); and/or publicly available relevant
RNA-seq datasets34,35 processed with STAR v.2.7.6a and Salmon
v.1.4.0, and a score of 0, 1 or 2 assigned for no, widespread or solely
specific expression in the relevant disease tissue. Defaults of one point
for protein–protein association evidence (high quality, direct
experimental interactions scoring greater than 0.7 in StringDB with genes on
the disease panel from PanelApp16) and 2 points for mouse/zebrafish
evidence (phenotypic similarity as calculated by Exomiser between the
patient’s phenotypes and the mouse/zebrafish phenotypes where the
orthologous gene was disrupted) were used. The rounded sum of genetic and
experimental evidence points was used to assign the final ClinGen
classification of the evidence for the association as being limited
(0.1–6 points), moderate (7–11 points) or
strong (12–18 points). Definitive evidence for an
association is considered to be a score of 12–18 as well as
convincing replication of the result in more than two publications over more
than 3 years. Therefore, none of our associations will be classified
as definitive at this early stage.

Visual representation of variant location in lollipop plots

Visual representations of the variant locations in the protein were
generated by extending the Mutplot software54. The x axis represents the amino acid chain and
its annotated protein domain from UniProt. Each lolly indicates a variant by its
protein change annotated on the MANE Select transcript or MANE Plus Clinical if
a stronger impact is predicted (the transcript used is specified in the plot)
and the frequency is shown on the y axis. Its shape indicates the genotype found in the
proband. The colour indicates the type of variant and the variant’s
functional annotation. If the variant has both a p. change annotation and a
number in parenthesis it means that the original p. change was annotated on a
different transcript and the amino acid position in parenthesis indicates the
re-annotation on the selected transcript. If the only annotation available
indicates a number in parenthesis it means that the variant was in the
non-coding region for that transcript; therefore, the lolly was placed on the
closest amino acid.

PhenoDigm patient similarity comparisons

During the assessment of some disease–gene associations,
the phenotypic similarity between the probands driving the signal was calculated
using their HPO term annotations and the Exomiser API to give a
PhenoDigm55 score between 0 and 1. The mean of the
pairwise, reciprocal, non-self hits was calculated and compared with those
obtained from 1,000 iterations when the same number of probands was selected at
random from the set of cases with that disease.

Co-expression network analysis

Co-expression network analysis of our candidate genes and known
genes linked to the potentially associated disease (green genes in PanelApp
v.1.120) was performed using GTEx v.6 tissue-specific modules and the CoExp tool
accessible at

(ref. 56).

Peripheral blood mononuclear cell expression analysis

RNA-seq data from peripheral blood mononuclear cells collected from
three volunteer donors was analysed (poly A-selected libraries, mean of
two replicates untreated and two replicates treated with cycloheximide for
1 h to inhibit protein translation and mimic integrated stress
response)57. In-house, R was used for DeSeq2
normalizations per library and calculation of the mean values for each
transcript for the two replicate libraries per donor per condition. For global
evaluations, across all three donors, the mean base value,
log2 fold change post cycloheximide and
Benjamini–Hochberg adjusted P value were then calculated.

Gene and variant look-up in independent rare disease cohorts

In a cohort of patients from the Irish Kidney Gene
Project58 (278 cystic kidney disease and 141 chronic
kidney disease cases), rare (gnomAD minor allele frequency less than 0.1%) LoF,
missense, splicing or intronic variants were extracted for our new renal
disease-associated genes. A further cohort of more than 3,000 Dutch renal
patients was queried for likely pathogenic/pathogenic variants in those genes
using the Alissa bioinformatics pipeline. Similarly, a sequencing cohort of 212
participants with inherited corneal diseases, recruited in the United Kingdom
and Czech Republic and pre-screened for known genetic causes, was interrogated
for any rare variants in the candidate gene POMK.

Reporting summary

Further information on research design is available in
the Nature Portfolio Reporting
Summary linked to this article.

link

Leave a Reply

Your email address will not be published. Required fields are marked *

Copyright © All rights reserved. | Newsphere by AF themes.