Presentation Overview: Show
Although a plethora of ontologies have been developed in a wide
variety of domains, there is often a sense in which it is difficult to measure progress in the field of applied ontology. In some domains there is a mindset that treats ontologies as being as arbitrary as software code, so there is no point in evaluating them, and there cannot possibly be any consensus on which ontologies to use. In other domains, there is an abundance of ontologies but no understanding of their relationships, leading to a perception of continually reinventing the wheel. Far too often, the only criteria for selecting ontologies are political, not technical. If we proceed further down this road, we ultimately risk irrelevance. Against this viewpoint, I would offer an approach to ontology design that focuses on formalizing the intended semantics of an ontology, so that sharability and reusability is guaranteed.
Presentation Overview: Show
Many organizations face challenges in managing and analyzing data, especially when such data is obtained from multiple sources, created using diverse methods or protocols. Analyzing heterogeneous, structured datasets requires rigorous tracking of their interrelationships and provenance. This task has long been a Grand Challenge of data science, and has more recently been formalized in the FAIR principles: that all data be Findable, Accessible, Interoperable and Reusable, both for machines and for people. Adherence to these principles is necessary for proper stewardship of information, for testing regulatory compliance, for measuring efficiency, and for effectively being able to reuse data analytical frameworks. Interoperability and reusability are especially challenging to implement in practice, to the extent that scientists acknowledge a “reproducibility crisis” across many fields of study. We developed CORAL, a framework for organizing the large diversity of datasets that are generated and used by moderately complex organizations. CORAL features a web interface for bench scientists to upload and explore data, as well as a Jupyter notebook interface for data analysts, both backed by a common API. We describe the CORAL data model and associated tools, and discuss how they greatly facilitate adherence to all four of the FAIR principles.
Presentation Overview: Show
SARS-CoV-2 is the pathogen of the COVID-19 disease. It
is commonly agreed that SARS-CoV-2 originated from
some animal host. However, the exact origin of
SARS-CoV-2 remains unclear. The origins of other human
coronaviruses including SARS-CoV and MERS-CoV are
also unclear. This study focuses on collection, ontological
modeling and representation, and analysis of the hosts of
various human coronaviruses with a focus on SARS-CoV-2.
Over 20 natural and laboratory animal hosts were found able
to host human coronaviruses. All the viruses and hosts were
classified using the NCBITaxon ontology. The related terms
were also imported to the Coronavirus Infectious Disease
Ontology (CIDO), and the relations between human
coronaviruses and their hosts were linked using an axiom in
CIDO. Our ontological classification of all the hosts also
allowed us to hypothesize that human coronaviruses only
use mammals as their hosts.
Presentation Overview: Show
Diagnosis of COVID-19 is critical to the control of COVID-19
pandemic. Common diagnostic methods include symptoms
identification, chest imaging, serological test, and RT-PCR.
However, the sensitivity and specificity of different diagnosis
methods differ. In this study, we ontologically represent
different aspects of COVID-19 diagnosis using the community-
based Coronavirus Infectious Disease Ontology (CIDO), an
OBO Foundry library ontology. CIDO includes many new terms
and also imports many relevant terms from existing ontologies.
The high level hierarchy and design pattern of CIDO are
introduced to support COVID-19 diagnosis. The knowledge
reported in the literature reports and reliable resources such as
the FDA website is ontologically represented. We modeled and
compared over 20 SARS-CoV-2 RT-PCR assays, which target
different gene markers in SARS-CoV-2. The sensitivity and
specificity of different methods are discussed.
Presentation Overview: Show
Medical practitioners record the condition status of a patient through qualitative and quantitative
observations. The measurement of vital signs and molecular parameters in the clinics gives a
complementary description of abnormal phenotypes associated with the progression of a disease. The
Clinical Measurement Ontology (CMO) is used to standardize annotations of these measurable traits.
However, researchers have no way to describe how these quantitative traits relate to phenotype
concepts in a machine-readable manner. Using the WHO clinical case report form standard for the
COVID-19 pandemic, we modeled quantitative traits and developed OWL axioms to formally relate
clinical measurement terms with anatomical, biomolecular entities and phenotypes annotated with the
Uber-anatomy ontology (Uberon), Chemical Entities of Biological Interest (ChEBI) and the Phenotype and
Trait Ontology (PATO) biomedical ontologies. The formal description of these relations allows
interoperability between clinical and biological descriptions, and facilitates automated reasoning for
analysis of patterns over quantitative and qualitative biomedical observations.
Presentation Overview: Show
With the advances in Next Generation Sequencing (NGS) technologies, a huge volume of clinical genomic data has become available. Efficient exploitation of such data requires linkage to a patient's complete phenotype profile. Current resources providing disease-phenotype associations are not comprehensive, and they often do not cover all of the diseases from OMIM and particularly from ICD10, which are the primary terminologies used in clinical settings. Here, we propose a text-mining system which utilizes semantic relations in the phenotype ontologies and statistical methods to extract disease-phenotype associations from the literature. We compare our findings against established disease-phenotype associations and also demonstrate its utility in covering mouse gene-disease associations from Mouse Genome Informatics (MGI). Such associations serve as necessary information blocks for understanding underlying disease mechanisms and developing or repurposing drugs.
Presentation Overview: Show
In the poorly studied field of physician suicide, various fac-tors can contribute to misinformation or information distor-tion, which in turn can influence evidence-based policies and prevention of suicide in this unique population. Here, we report on the use of nanopublications as a scientific publishing approach to establish a citation network of claims drawn from a variety of media concerning the rate of suicide of US physicians. Our work integrates these vari-ous claims and enables the verification of non-authoritative assertions, thereby better equipping researchers and to advance evidence-based knowledge and make informed statements in the advocacy of physician suicide prevention.
Presentation Overview: Show
Motivation:
Today, we have an enormous amount of biomedical data and its size, as well as complexity, have been increasing over time. Implementation of standards represents one of the key drivers in the life sciences research as well as the technology transfer. More specifically, standards enable data accessibility, sharing, integration and therefore facilitates data harnessing and accelerates research and innovation transfer.
The life sciences community has widely developed and used Semantic web technology standards for data representation and sharing. However, given the success of unsupervised machine learning methods such as Word2Vec and BERT, there is a need to develop new standards for sharing the (pre-trained) vector space embeddings of the entities to facilitate reusability of data and method development. Motivated by this, we propose data and metadata standards for the FAIR distribution of vector embeddings and demonstrate utilization of these standards in Bio2Vec, a platform providing a flexible, reliable and standard-compliant data representation, sharing, integration and analysis.
Availability:
The proposed metadata standard and an example are available in the ShEx format at Zenodo.
Presentation Overview: Show
Genome Wide Associations Study (GWAS) have been widely used to identify potentially causative variants of genetic disease or trait given the patient phenotypes. However, generally it cannot present the complete picture, particularly on how the studied trait related to other similar traits; because, often not all of the available phenotype information is exploited in the analyses.
Here, we propose to use Ontology-Wide Genome Associations Study (OWAS) to complete the phenotype profiles of diseases and perform GWAS on UK Biobank.
More specifically, with OWAS, we utilize the phenotype information that exists in the literature as well as in semantic resources to expand the GWAS to the cases that are not explicitly associated with the phenotypes. Our initial results show that our approach has the potential to increase the statistical power of GWAS as well as identify associations for the phenotypes which have not been explicitly observed.
Presentation Overview: Show
Human pluripotent stem cells (PSC) are immortal, represent the genotype of the donor and can differentiate into all cell types of a human body. These features establish their enormous potential for modelling diseases and tissues in vitro, drug- and toxicity testing and regenerative medicine. To translate these potencies into reality, large numbers of PS- lines are being generated from a wide spectrum of donors to make them available for the diverse applications. For users to identify suitable PSC- lines, information about the donors, cell generation, characterization and quality are essential. The human pluripotent stem cell registry hPSCreg contains more than 3000 cell line that are richly annotated. To make the hPSCreg resource more accessible and interoperable, we developed hPSCreg-CLO, a new CLO branch that represents various hPSC lines from hPSCreg. hPSC specific design patterns were generated and used to support computer-assisted ontology development. The hPSCreg-CLO includes over 2,400 hPSC lines and their related information such as cell donors, anatomical entities, and original cell types. DL queries were performed to demonstrate the query capability of hPSCreg-CLO. hPSCreg-CLO will further be integrated with the hPSCreg project and support the database data integration and advanced analyses.
Presentation Overview: Show
Although knowledge graphs (KGs) are used extensively in biomedical research to model complex phenomena, many KG construction methods remain largely unable to account for the use of different standardized terminologies or vocabularies, are often difficult to use, and perform poorly as the size of the KG increases in scale. We introduce PheKnowLator (Phenotype Knowledge Translator), a novel KG framework and fully automated Python 3 library explicitly designed for optimized construction of semantically-rich, large-scale biomedical KGs. To demonstrate the functionality of the framework, we built and evaluated eight different parameterizations of a large semantic KG of human disease mechanisms. PheKnowLator is available at: https://github.com/callahantiff/PheKnowLator.
Presentation Overview: Show
The rapid increases in scientific knowledge have never been more obvious than in the wake of the emergence of the COVID-19 virus, where we have seen hundreds of new research articles being published every week as scientists and medical researchers rush to share knowledge about predicting disease spread, and management or treatment of the disease. This has left scientists scrambling to navigate and synthesise large amounts of information. We have been developing a system we call COVID-SEE (Scientific Evidence Explorer) that leverages natural language processing methods to structure key information in COVID-19-related literature, and facilitates navigation of the literature through a relational lens. I will introduce our approach, and discuss the many ways ontologies enable and support the project.
Presentation Overview: Show
Many protein function databases are built on automated or semi-automated curations and can contain various annotation errors. The correction of such misannotations is critical to improving the accuracy and reliability of the databases. We proposed a new approach to detect potentially incorrect Gene Ontology (GO) annotations by comparing the ratio of annotation rates (RAR) for the same GO term across different taxonomic groups, where those with a relatively low RAR usually correspond to incorrect annotations. As an illustration, we applied the approach to 20 commonly-studied species in two recent UniProt-GOA releases and identified 250 potential misannotations in the 2018-11-6 release, where only 25% of them were corrected in the 2019-6-3 release. Importantly, 56% of the misannotations are “Inferred from Biological aspect of Ancestor (IBA)”, i.e. reviewed computational annotations based on phylogenetic analysis. This is in contrast to previous observations that attributed misannotations mainly to “Inferred from Sequence or structural Similarity (ISS)”, probably reflecting an error source shift due to the new developments of function annotation databases. The results demonstrated a simple but efficient misannotation detection approach that is useful for large-scale comparative protein function studies. The code and list of identified misannotations are available at https://zhanglab.ccmb.med.umich.edu/RAR.
Presentation Overview: Show
The Open Biological and Biomedical Ontology (OBO) is a collective of ontology developers committed to collaboration and shared principles. The OBO Foundry mission is to develop a family of logically well-formed and scientifically accurate interoperable ontologies. Participants voluntarily adhere and contribute to the development of an evolving set of principles including open use, collaborative development, non-overlapping and strictly-scoped content, common syntax and relations. OBO provides services to the community such as hosting persistent URLs and ontology files, recording metadata, and supporting discussion forums and regular calls.
We developed a set of key top-level ontology terms that unify the many OBO Foundry ontologies, termed Core Ontology for Biology and Biomedicine (COB). COB simplifies the identification of ontology terms simplifying navigation across OBO projects. It includes logic that links ontologies together, allowing interoperability problems to be detected and corrected. Related ontology terms from multiple ontologies can be viewed at the same time, illustrating how OBO ontologies and their terms are related and ensuring interoperability.
COB is still in active development; we are eager to obtain community feedback. We want to collect actionable suggestions on what users most want in COB and what this community would find most useful to their daily practices.
Presentation Overview: Show
An Immune Exposure is the process by which components of the immune system first encounter a potential trigger. The ability to describe consistently the details of the Immune Exposure process was needed for data resources responsible for housing scientific data related to the immune response. This need was met through the development of a structured model for Immune Exposures. This model was created during curation of the immunology literature, resulting in a robust model capable of meeting the requirements of such data. We present this model with the hope that overlapping projects will adopt and or contribute to this work.
Presentation Overview: Show
Background: Variation graphs are a novel way to describe genomic variation across a population. Variation graph tools present a significant improvement in mitigating reference bias compared to the linear reference ecosystem. Existing toolkits focus on algorithms processing pangenome graphs. Yet, they have limited capabilities in integrating various annotations of the biology and providing an interface for large scale visualizations.
Description: To interpret biological meaning in variation graphs by integrating various kinds of annotations for further analysis, FAIR data interchange formats are demanded. Borderless technology such as the Semantic Web allows variation graph toolkits and pangenome tools to focus on their core competence while allowing bioinformaticians to integrate, analyze, and visualize the data.
Result: We demonstrate how we can represent a graphical pangenome with pangenome ontologies using a standard declarative graph query language. Then we show how the vg RDF and Pantograph RDF can represent data ready for the Semantic Web and how we can combine existing data from INDSC and UniProt without conversions or loss of information into a single Variation and Knowledge Graph.
Presentation Overview: Show
RGD (https://rgd.mcw.edu) is a multi-species knowledgebase which provides a substantial corpus of genomic, genetic, phenotypic and disease-related data and an innovative suite of tools for analyzing these data. Researchers can leverage cross-species manual annotations from RGD and annotations imported from external sources to search for an appropriate model. As an example, a researcher studying Wilson disease can find a list of associated genes using RGD's OLGA tool. An integrated toolbox facilitates submission of gene lists to other analysis tools to explore annotations across ontologies and across species. In analyses related to Wilson disease, ATP7B is one gene which commonly appears. The association between ATP7B and Wilson disease is well-documented at RGD via disease and phenotype annotations and associated pathogenic variants. Links on the gene page provide access to data for other species, such as an extensive list of mouse phenotypes. For rat, RGD's PhenoMiner tool provides related quantitative measurement data. RGD's strain record details a large Atp7b deletion in the LEC/Hok strain, a Wilson disease model. RGD's Variant Visualizer provides functionality to explore pathogenic or damaging variants for human, rat and dog.