FAQ

How do I use the R package?

The R package is used to download the data from the API and perform the colocalization and rare variant analysis.

Installation and usage:

devtools::install_github('MRCIEU/gpmapr')
library(gpmapr)
search_gpmapr('haemoglobin')

Usage: There are a series of vignettes available. Please start with the introductory vignette.

How do I upload my own data?

You can upload your own GWAS summary statistics to run colocalization and rare variant analysis against the GPMap database. You can optionally specify one or more existing GWAS upload GUIDs to also compare your upload against those (in addition to the main database).

Use the Upload GWAS for Comparison form on the homepage.

Use the gpmapr R package:

gpmapr::upload_gwas(
  file = 'gwas.tsv.gz',
  name = 'My new GWAS',
  email = 'me@example.com',
  column_names = list(...),
  ...
)

Is my uploaded data available to others?

Your uploaded data is not automatically added to the official GPMap database that others can view or search for. However, as we do not require a login to use the website, we cannot prevent you from sharing the URL of your uploaded data with others. Anyone who knows the GUID of your uploaded data can access it. The results are not discoverable by other users, but are available to anyone who knows the GUID.

If you wish use the comparison functionality but keep your data private, you can contact us to collaborate.
If you have already uploaded your data and wish to remove it, contact us to have it removed.

How does the GWAS upload pipeline work?

The GWAS upload pipeline is a process that allows you to upload a GWAS and perform the colocalization and rare variant analysis. It uses the same data processing pipeline used to created this resource, with some caveats. There are a series of steps that are taken to process the data, some of which will remove the data and make it look potentially inconsistent.

Filtered list of comparisons: Due to server constraints, only studies that have a minimum p-value of 1e-6 for that ld block will be compared with your GWAS.
No GWAS QC step: Due to server constraints, DENTIST (which is used in the pipeline) is not run on any GWAS Uploads.
Sparesly Populated Studies Not Supported: Only studies with a minimum of 150 samples in a significant ld block will be processed.
Rare Variant Analysis Not Supported: Only samples with a MAF of 0.01 or greater will be processed.
Conditional Imputation: Imputation is only performed if the correlation between origina and imputed SNPs is greater than 0.7.
Conditional Finemapping: If finemapping only finds a single credible set or does not converge, then the LBF values are merely calculated from the original summary statistics.
Missing LBF Values: There are a series of finemapping filtering steps that find and remove over inflated LBF values, these are removed from the analysis. Hence you may see some missing LBF values in the results.

How are common and rare variant results linked?

To integrate common and rare variant datasets, we explicitly linked 1,485 studies that utilized identical UK Biobank data fields, while the remaining 14,512 studies were treated as independent phenotypic entries to maintain a conservative approach toward cross-study phenotypic equivalence.

We acknowledge that a significantly larger number of studies likely share overlapping biology or phenotypic definitions. However, to maintain a conservative strategy, we restricted explicit cross-category linking to instances where the phenotypic definitions were identical (e.g., matching UK Biobank data fields). For the remaining studies, we treat them as independent entries in the resource to avoid making assumptions about phenotypic equivalence across different cohorts or coding systems. We encourage the user to consider the potential for phenotypic equivalence of traits when investigating the results of their specfic trait of interest.

Choosing p-value thresholds

It is important to note that GPMap is intended to serve as a general-purpose research tool. Consequently, individual researchers should select p-value thresholds that are appropriate for their specific use case, ranging from conservative multi-testing corrections required for hypothesis-free discovery to more relaxed thresholds suitable for hypothesis-driven investigations or the validation of established signals.

How do I interpret this graph?

Graph options

Study P-value: The p-value threshold for the traits to be displayed in the legend.
Include Trans Markers: Whether to include trans markers in the graph.
Trait Types: The type of trait to be displayed in the graph. 'Molecular Only' will still include the phenotype in question on the phenotype view, 'Phenotype Only' will not include molecular traits.
Trait Categories: The category of trait to be displayed in the graph. If no categories are selected, all traits will be displayed.

Trait view

Displays colocalised results of the study in question, and shows all studies which colocalise with it, overlayed on top of the the Manhattan plot of the phenotype. Also displays significant rare and non-colocalising results. To compare 2 specific traits, please use the 'Filter Results By' dropdown.

Colocalised Results: Displays colocalised results of phenotype in question, and shows all studies which colocalise with it
Rare association Results: These are not colocalization groups, but single SNP associations that both show significant association with the phenotype in question.
Circle size: The size of the circle is proportional to the number of traits in the colocalisation group. The larger the circle, the more traits are in the colocalisation group.
Result Filtering: To compare 2 specific traits, please use the 'Filter Results By' dropdown, this will filter the results to only show the traits that are selected.

Variant view

SNP view displays the colocalisation results for a single SNP. Each circle represents a trait that is in the colocalisation group.

How the SNP is chosen: The SNP is chosen as the cumulative sum of the log bayes factor, which is returned by susie. Every trait in the colocalisation group is included in the cumulative sum, and the maximum is taken.
Node size: Each node is sized by the p-value of the SNP for that specific trait, the smaller the p-value, the larger the node.
Links: Each link represents a colocalization pair result, as returned by coloc.
Link Strength: The strength of the link is displayed as the H4 value, which is a measure of the strength of the colocalization pair result. A significant link (H4 > 0.8) is displayed in blue, a weak link (0.8 > H4 > 0.5) is displayed in orange.
Group Connectedness: The connectedness of the colocalisation group is calculated as the sum of the H4 values of the coloc pairs, divided by the total number of coloc pairs in the gorup. A higher connectedness means that the colocalisation group is more strongly connected to other colocalisation groups.
Trait Type: The type of trait is displayed in the legend.
Common vs Rare: The common groups are visualised in the graph, but the rare results are not, they are included in the results table and forest plot
VEP annotation: The data displayed under VEP annotation is data related to the SNP, as provided by the Ensembl Variant Effect Predictor

Gene and region view

Displays the colocalisation results for a gene or region. Each circle represents a result that has a study marked with that gene. Results may not align with the exonic region of the gene, as some studies may have QTLs which are in the regulatory region of the gene.

Colocalised Results: Displays colocalised results of phenotype in question, and shows all studies which colocalise with it
Rare association Results: These are not colocalization groups, but single SNP associations that both show significant association with the phenotype in question.
Circle size: The size of the circle is proportional to the number of traits in the colocalisation group. The larger the circle, the more traits are in the colocalisation group.
Surrounding Genes: Below the graph, the surrounding genes are displayed. These are genes that are in the same region as the gene in question, and have a study marked with them.
Result Filtering: You can filter by another surrounding gene by clicking on the gene in question. To compare 2 specific traits, please use the 'Filter Results By' dropdown, this will filter the results to only show the traits that are selected.
Pleiotropy Scores: There are two pleiotropy scores displayed, the first is the number of distinct trait categories and the second is the number of distinct protein coding genes that the trait is associated with. They are calculated by counting the number of distinct trait categories and protein coding genes that the gene is associated with in the colocalisation results. Rare variant results are not included in the calculation.