Datasource integrations

We constantly work on allowing our users to leverage an ever-growing collection of high-quality datasources provided by prominent organisations. Providing the license terms are met, these are ready to be investigated, transformed, combined and remodelled to provide you with an instant data advantage.

Currently available integrations include:

The Biological General Repository for Interaction Datasets (BioGRID) is a curated biological database of protein-protein interactions, genetic interactions, chemical interactions, and post-translational modifications.

Example concepts: HumanProtein, HumanProteinProteinInteraction

Licence: MIT - found here.

The Cancer Biomarkers database is curated and maintained by several clinical and scientific experts in the field of precision oncology. It contains information on genomic biomarkers (genes, including genomic coordinates of the variants in cDNA and gDNA) for cancer drugs and clinical targetability in solid tumors.

Example concepts: Gene, Drug, TumorType, GeneBiomarkerForDrugInTumor

Licence: Creative Commons Universal v1.0 - no copyright - found here.

ChEMBL is a manually curated database of bioactive molecules with drug-like properties. It brings together chemical, bioactivity and genomic data to aid the translation of genomic information into effective new drugs.

Example concepts: Target, Molecule, Disease, TargetRelation, DrugMechanism

Licence: Creative Commons Attribution-ShareAlike v3.0 - found here.

ClinGen is a National Institutes of Health (NIH)-funded resource dedicated to building an authoritative central resource that defines the clinical relevance of genes and variants for usein precision medicine and research. It contains curated gene information related to clinical setting grouped into the following sections: Gene Disease Validity, Dosage Sensitivity, Clinical Actionability, Variant Pathogenity and Pharmacogenomics.

Example concepts: Gene, Disease, GeneDiseaseValidity

Licence: Creative Commons Universal v1.0 - no copyright - found here.

ClinicalTrials.gov, a largest source of clinical trials data in the US, is a Web-based resource that provides patients, their family members, health care professionals, researchers, and the public with easy access to information on publicly and privately supported clinical studies on a wide range of diseases and conditions. The Web site is maintained by the National Library of Medicine (NLM) at the National Institutes of Health (NIH). Information on ClinicalTrials.gov is provided and updated by the sponsor or principal investigator of the clinical study.

Example concepts: ClinicalStudy

Licence: ClinicalTrials.gov - specific. Free to access at no charge, requires attribution. - found here.

Database that aims to advance understanding about how environmental exposures affect human health. It provides manually curated information about chemical–gene/protein interactions, chemical–disease and gene–disease relationships. These data are integrated with functional and pathway data to aid in development of hypotheses about the mechanisms underlying environmentally influenced diseases.

Example concepts: Gene, Disease, Chemical, ChemicalGeneInteraction, ChemicalDiseaseAssociation

Licence: CTD - specific. Requires permission for commercial purposes. - found here.

The Disease Ontology has been developed as a standardized ontology for human disease with the purpose of providing descriptions of human disease terms, phenotype characteristics and related medical vocabulary disease concepts. Releases: https://github.com/DiseaseOntology/HumanDiseaseOntology/releases

Example concepts: Disease, DiseaseRelation

Licence: Creative Commons Universal v1.0 - no copyright - found here.

The Experimental Factor Ontology provides a systematic description of many experimental variables available in EBI databases, and for external projects such as the NHGRI GWAS catalogue. It combines parts of several biological ontologies, such as anatomy, disease and chemical compounds.

Example concepts: ExperimentalFactor, ExperimentalFactorIsA

Licence: Apache v2.0 - found here.

Ensembl aims to provide a centralized resource for geneticists, molecular biologists and other researchers studying the genomes of our own species and other vertebrates and model organisms. Ensembl is one of several well known genome browsers for the retrieval of genomic information.

Example concepts: Gene

Licence: Ensembl - specific / no license - found here.

G2P is a publicly-accessible online system designed to facilitate the development, validation, curation and distribution of large-scale, evidence-based datasets for use in diagnostic variant filtering. Each G2P entry associates an allelic requirement and a mutational consequence at a defined locus with a disease entity. A confidence level and evidence link are assigned to each entry.

Example concepts: Gene, Disease, GeneDiseaseAssociation

Licence: EBI - specific. No restrictions except when presenting data from other sources. - found here.

Genomics England PanelApp is a publicly-available knowledgebase that allows virtual gene panels related to human disorders to be created, stored and queried. It includes a crowdsourcing tool that allows genes and genomic entities (short tandem repeats/STRs and copy number variants/CNVs) to be added or reviewed by experts throughout the worldwide scientific community, providing an opportunity for the standardisation of gene panels, and a consensus on which genes have sufficient evidence for disease association.

Example concepts: Gene, Phenotype, GeneToPhenotype

Licence: Genomics England PanelApp terms and conditions. Approval required for commercial use. - found here.

The Gene Ontology resource provides a computational representation of our current scientific knowledge about the functions of genes (or, more properly, the protein and non-coding RNA molecules produced by genes) from many different organisms, from humans to bacteria. It describes our knowledge of the biological domain with respect to three aspects: Molecular Function (molecular-level activities performed by gene products), Cellular Component (locations in cellular structures in which a gene product performs a function, Biological Process (‘biological programs’ done by multiple molecular activities, not pathway). Those are semi-disjoint - no is_a link between the 3 but other like part_of and regulates exist.

Example concepts: GeneClass, Protein, GeneClassRelation, GOAssociationHumanProtein

Licence: Creative Commons Attribution v4.0 - found here.

HGNC is responsible for approving unique symbols and names for human loci, including protein coding genes, ncRNA genes and pseudogenes, to allow unambiguous scientific communication.

Example concepts: Gene, CodingGene, HumanGene, OrthologGene, OrthologyPrediction

Licence: HGNC - specific / no license / permissive / free for all - found here.

The International Mouse Phenotyping Consortium (IMPC) is an international effort by 21 research institutions to identify the function of every protein-coding gene in the mouse genome. To achieve this, the IMPC is systematically switching off or ‘knocking out’ each of the roughly 20,000 genes that make up the mouse genome. Subsequently, the knock out mice undergo standardised physiological tests (phenotyping tests) across a range of biological systems in order to infer gene function, before the data is made freely available to the research community on our website.

Example concepts: Gene Gene, Disease, Mouse Model, Disease Model Summary

Licence: Creative Commons Attribution v4.0 - found here.

IntOGen is a framework for automatic and comprehensive knowledge extraction based on mutational data from sequenced tumor samples from patients. The framework identifies cancer genes and pinpoints their putative mechanism of action across tumor types. It builds the results by running multiple cancer driver gene detection methods on hundreds patient cohorts obtained from cBioPortal, pediatric cBioPortal, ICGC, TCGA, PCAWG, Hartwig Medical Foundation, Target, StJude and literature gathered sequencing projects projects.

Example concepts: Cohort, Gene, Cancer, CompediumGeneCancerDriver

Licence: Creative Commons Universal v1.0 - no copyright - found here.

Medical Subject Headings is the National Library of Medicine - controlled vocabulary used for indexing articles for PubMed and other NLM databases.

Example concepts: Disease, Pathway, Gene, GenePathwayAssociation, DiseasePathwayAssociation

Licence: KEGG - specific - found here.

Medical Subject Headings is the National Library of Medicine - controlled vocabulary used for indexing articles for PubMed and other NLM databases.

Example concepts: Descriptor

Licence: National Library of Medicine Terms and Conditions - found here.

Monarch Disease Ontology. A semi-automatically constructed ontology that merges in multiple disease resources to yield a coherent merged ontology.

Example concepts: Disease, DiseaseIsA

Licence: Creative Commons Attribution v4.0 - found here.

A collection of literature-curated human and rodent signalling pathways. Aggregates over 50 sources in total, ~ 32 for PPIs including SigNOR and BioGRID.

Example concepts: Protein, ProteinProteinInteraction

Licence: Omnipath - specific - found here.

The Open Targets Platform is a comprehensive and robust data integration for access to and visualisation of potential drug targets associated with disease. It brings together multiple data types and aims to assist users to identify and prioritise targets for further investigation. The Platform supports workflows starting from a target or disease, and shows the available evidence for target – disease associations. Target and Disease profile pages showing specific information for both target (e.g baseline expression) and disease (e.g. Disease Classification) are also available.

Example concepts: Target, Disease, DiseaseMapping, Pathway, TargetDiseaseAssociation

Licence: Open Targets - specific. As an aggregator they do not put any restrictions themselves - found here.

Orphanet is a unique resource, gathering and improving knowledge on rare diseases so as to improve the diagnosis, care and treatment of patients with rare diseases. Orphanet aims to provide high-quality information on rare diseases, including epidemiology and ensure equal access to knowledge for all stakeholders. Orphanet also maintains the Orphanet rare disease nomenclature (ORPHAcode), essential in improving the visibility of rare diseases in health and research information systems. Data available via http://www.orphadata.org/

Example concepts: Disease, Gene, GeneRareDiseaseAssociation

Licence: Creative Commons Attribution v4.0 - found here.

Phenome-wide association studies (PheWAS) analyze many phenotypes compared to a single genetic variant (or other attribute). This method was originally described using electronic medical record (EMR) data from EMR-linked in the Vanderbilt DNA biobank, BioVU, but can also be applied to other richly phenotyped sets.

Example concepts: Gene, Phenotype, Phecode ICD10 mapping, GenePhenotypeAssociation

Licence: Creative Commons Attribution v4.0 - found here.

REACTOME is an open-source, open access, manually curated and peer-reviewed pathway database.

Example concepts: Pathway, Human Pathway, Disease, Gene, Pathway Hierarchy

Licence: Reactome-specific - found here.

The SIGnaling Network Open Resource. The core of SIGNOR is a collection of approximately 12,000 manually-annotated causal relationships between over 2800 human proteins participating in signal transduction. Other entities annotated in SIGNOR are complexes, chemicals, phenotypes and stimuli.

Example concepts: Protein, ProteinProteinInteraction

Licence: Creative Commons Attribution-ShareAlike v4.0 - found here.

Universal Protein resource. A database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from the research literature. a central repository of protein data created by combining the Swiss-Prot, TrEMBL and PIR-PSD databases.

Example concepts: Protein, Disease, Variant, InvolvementInDisease, GeneProteinVariantDiseaseAssiociation

Licence: Creative Commons Attribution v4.0 - found here.

If the dataset you are interested in is not listed or you need additional assistance in integrating your internal resource, we are here to help.