About the Genotype-Phenotype Map (GPMap)
Developed at the
MRC Integrative Epidemiology Unit (IEU)
at the University of Bristol, The Human Genotype-Phenotype Map (GPMap) is an integrated
discovery engine designed to bridge the gap between GWAS discovery and functional follow-up.
While standard browsers identify genes in proximity to lead SNPs, the GPMap uses rigorous
fine-mapping and colocalization to identify causal links between thousands of complex traits
and molecular layers (eQTL, pQTL, sQTL, and methQTL).
For licensing, privacy, and service terms, please see our
Terms of Use.
Core Capabilities
-
Causal Locus Resolution: Transition from "nearest gene" heuristics to empirical
evidence. By scanning Colocalization Groups (CGs), you can identify the specific
phenotypes and molecular mechanisms sharing a genetic architecture at a single locus.
-
Systemic Pleiotropy & Comorbidity: Instantly visualize "pleiotropic
neighbors." The GPMap allows you to deconvolve whether a variant affects multiple traits
independently (horizontal pleiotropy) or acts through a molecular mediator like a
protein (vertical pleiotropy).
-
Precision MR Instruments: Streamline Mendelian Randomization by selecting
instruments backed by high colocalization posterior probabilities (H4>0.8). This
minimizes "LD-contamination" and ensures your IVs are functionally relevant.
-
User-Led Extensibility: Beyond our library of 4,500+ traits, you can upload your
own GWAS summary statistics. The platform will automatically run fine-mapping and
colocalization against our entire multi-omic database to identify supported mechanisms
for your novel hits.
Accessing the Map
The GPMap is an open-access resource available via our web interface or the
gpmapr R package for programmatic analysis.
Quick Start with gpmapr
# Install the package
devtools::install_github('MRCIEU/gpmapr')
# Search for traits, genes, or variants
gpmapr::search_gpmapr('Haemoglobin')
# Retrieve high-resolution data for a specific target
gpmapr::gene('TREM2')
# Project your own results against the map
gpmapr::upload_gwas(file = 'my_gwas.tsv.gz', name = 'Discovery Study', ...)
Project Components
The Genotype-Phenotype Map is comprised of 3 distinct efforts:
Nomenclature
We have defined these terms as follows:
-
Traits: A trait denotes the outcome variable assessed in any of the GWAS from
which we have taken summary statistics for the GPMap. The trait names remain as defined
in the original study. Other commonly used names for trait might be 'phenotype' or
'study'.
-
Complex Trait: represent polygenic organismal phenotypes and clinical
disease states. All complex traits that have genome wide associations with them.
-
Molecular Trait: representing discrete cellular processes such as mRNA
expression or protein abundance. Molecular traits do not typically have genome
wide data associated with them, but rather have significant cis (and sometimes
trans) signals extracted in specific genomic regions, and are associated with a
gene and tissue.
-
Trait Category: Traits were categorized into 23 categories. This was
achieved by generating a prompt for OpenAI model gpt-5-nano for each complex
trait analyzed (excluding molecular phenotypes) requesting the best matching
category for each trait name along with a confidence score for the match. Trait
to category mapping was manually inspected. A number of traits had a low
category match confidence score but were retained for completeness. Some traits
were too broad and therefore were manually set to undefined
-
A third category of trait, which falls in between these, are ultra-specific
measurements which have genome-wide associations, as opposed to cis- and trans-
windows. These have been denoted as 'Cell Traits' (e.g. 'IgD- CD27- B cell %B
cell') and 'Targeted Protein Measure' (e.g. 'VDBP plasma levels') are neither
considered a complex trait nor a molecular trait.
-
Pleiotropy score: There are two pleiotropy scores calculated on both the variant
and gene level. The first is the number of distinct trait categories, and the second is
the number of distinct protein coding genes, that fall in colocalization groups tagged
by the variant (variant level) or containing gene-specific QTLs (gene level). Rare
variant results are not included in the calculation.
-
Coverage: dense vs. sparse. Summary statistics that only published results of
SNPs that reached a specific p-value threshold are considered sparsely populated, all
others are considered densely populated. Sparsely populated summary statistics had their
missing values 0-padded and both imputation and fine-mapping steps were skipped.
-
Cell Type: Some QTLs were derived from 'single cell' expression assays which
derive gene expression measurements from specific cell subsets following single-cell
transcriptional profiling. A list of cell types currently in the map are in
Supplementary Table 11.
-
Gene Annotation: There are two different types of gene annotation that occur for
QTLs, 'gene' and 'situated gene'. 'Gene' refers to the gene that was assayed and is
annotated by the QTL resource itself (e.g. GTEx gene expression of TREML2 in blood is
annotated as 'TREML2'). 'Situated gene' is currently only applicable for rare variants
studies conducted using whole-exome sequencing data and denotes the gene in which the
variant is physically situated. Common variant QTL studies have a 'Gene', most rare
variant phenotypic studies have a 'Situated Gene', and rare variant QTL studies have
both a 'Gene' and a 'Situated Gene', which may differ. Genes have been assigned to
methQTLs based on the proximity of the assayed CpG site to a gene. These were taken from
the Illumina Methylation EPICv2 manifest(1), tagging CpGs by their proximity to gene
bodies and promoter regions.
-
Colocalization definitions:
-
Colocalizing Pair: A colocalization pair is a single colocalization test
run between two traits using coloc(2). Two traits are considered to be a pair if
H4 ≥ 0.8.
-
Colocalization Group: A set of traits that have been grouped together by
a graph-based clustering and pruning methods from the results of pairwise
colocalization analysis. A detailed explanation of the clustering and pruning
methods can be found in Supplementary Notes 2-3.
-
Group Connectedness Percentage: The connectedness of the colocalization
group is calculated as the total count of the H4≥0.8 across all colocalization
pairs (edges) in the group, divided by the total number of possible edges in the
colocalization group with n nodes, n(n-1)/2. A higher connectedness means that
the colocalization group is more strongly connected.
-
Candidate Variant: Each colocalization group is assigned a SNP. The SNP
is chosen as the variant with the highest cumulative sum of the log bayes factor
(LBF), calculated by SuSiE(3), across every trait in the colocalization group.
In some situations, the candidate variant may not be the variant with the
highest LBF for a given study in the colocalization group, or, if the variant
was not genotyped or imputed for that study, may be missing. This variant should
not be interpreted as the causal variant for traits in the group, it is instead
tagging a shared colocalizing signal in the region.
-
P-value thresholds:
-
Genome wide significance (GWS): We utilize the standard European
genome-wide significant p-value threshold of 5e-8 for most calculations in the
results.
-
Suggestive Significance: We have extracted and analyzed all loci above a
'suggestive significance' of 1.5e-4. As there are approximately 1 million
independent loci across the genome (0.05 / 1m = 5e-8), across 3 billion base
pairs. Meaning there are approximately 333 independent regions across every 1Mb,
0.05/333 ~= 1.5e-4. For the resource, both colocalization group and
colocalization pair data is available. Users should be aware of their own
p-value threshold for multiple testing correction.
-
Cis and Trans: For common variant QTLs, the cis window includes any finemapped
loci within ±1 Mb of the SNP tagging the QTL, and trans is defined as any region outside
of that window. For rare QTLs, any SNP falling directly within the gene that was assayed
is considered cis (i.e. the 'gene' and the 'situated gene' match). If this is not the
case, the variant is considered to be a trans signal for the assayed gene measure.
Acknowledgements
We express our sincere gratitude to the research participants and the investigators of
the various studies and consortia whose data contributed to the development of the
GPMap. This work would not have been possible without the altruistic contribution of
hundreds of thousands of individuals to genetic research.
We specifically acknowledge the following biobanks and consortia for providing the GWAS
summary statistics and functional genomic data: UK Biobank, The GTEx Consortium,
FinnGen, and the NHGRI-EBI GWAS Catalog. We also thank the IEU OpenGWAS project for
providing the computational infrastructure and data standardization that facilitated
this large-scale integration.