Krzysztof J. Kochut: research overview

Research

Interesting Links

Some of my current projects

mOntage

Lead students: Shima Dastgheib (help from Arsham Mesbah)

mOntage is a system for composing domain ontologies from fragments of other ontologies (selected resources or parts of selected resources) and other data sources, such as relational databases, and CSV and XML, files. The system allows an ontology engineer to define the schema (the TBox, including all classes and their hierarchies, properties, and other elements) and then map the classes and properties onto other information sources. The maps are expressed as SPARQL CONSTRUCT queries, where the graph templates specify how the new domain ontology should be populated (its ABox). The new ontology can be thought of as a montage of fragments of other ontologies and other data sources. mOntage includes an ontology population module, which creates a population plan (ordering of the CONSTRUCT maps) and then executes them to populate the ontology. Ontologies created using mOntage can be repopulated at regular intervals, or as often as any of the underlying data sources is updated.

ProKinO

Lead students: Shima Dastgheib and Daniel McSkimming (Dr. Natarajan Kannan's lab)
Former students: Gurinder Gosal

ProKinO is a joint project with the lab of Dr. Natarajan Kannan from Biochemistry and Molecular Biology at UGA. Protein kinases are an important, diverse family of enzymes that are genomically altered in many human cancers. ProKinO represents a wealth of knowledge and data related to protein kinases. It includes a comprehensive schema (TBox) representing a wealth of knowledge related to protein kinases, such as kinase genes, domains and classification (known as the Kinome Tree), sequences, mutations, structure, functional features, reactions and pathways, as well as related diseases. Furthermore, ProKinO includes a vast amount of data, represented as ontology individuals (ABox). The data is acquired from a variety of sources, including UniProt, KinBase, COSMIC, and Reactome.

The knowledge and data represented in ProKinO allows biologists to formulate and execute integrative, hypothesis-driven queries. We have written a software system to automatically repopulate ProKinO at regular intervals, or whenever any of the utilized data sources is updated to a new version. A Web-based ontology browser for ProKinO is available for public access.

Ontology-aided Text Analysis

Lead students: Mehdi Allahyari
Former students: Maciej Janik

This project has a few sub-projects, including Text Classification, Topic Labeling, and Topic Identification.

Text Classification

"Traditional" text classifications methods require a training set of pre-classified documents to train a classifier. In this project, we have been working towards classification methods which do not require training sets and instead, rely entirely on an ontology to provide a thematic knowledge, necessary for classification.
Topic Labeling

This project focuses on using ontologies to provide better models in the family of Latent Dirichlet Allocation (LDA). Subprojects include leveraging ontologies to improve topic labeling, topic coherence, and others.

Kinase Mutation Impact Text Mining

Lead students: Bhargabi Chakrabarti, Reshmi De, Daniel McSkimming (Dr. Natarajan Kannan's lab), Shima Dastgheib, Anish Narayanan (Dr. Natarajan Kannan's lab)

This project's goal is to extract all information from full-text scientific articles regarding the impact of mutations in protein kinases. We have been processing all full-text articles available from PubMed Central (nearly 1 million articles that are available for bulk download). This project includes two subsystems (and hence, subprojects):

KiMIner (lead student: Bhargabi Chakrabarti)

The Kinase Mutation Impact Miner (KiMIner) performs the actual text mining task. This subsystem (1) identifies sentences with potentially useful information about kinase mutation impacts, (2) performs the NLP-based phrase structure and dependency structure analysis, (3) identifies dependency patterns commonly used in describing mutation impacts, and (4) formulates mutation impact predictions. Mutations, as well as the kinase names and synonyms are obtained from the ProKinO ontology.
CuraMI (lead student: Reshmi De)

CuraMI is a curatorial environment for reviewing mutation impact predictions, generated by KiMIner. The curatorial workflow requires that every mutation impact prediction is approved by at least 2 curators (disagreements between curators are resolved by additional assessments by more senior curators). Furthermore, curatorial assessments which overturn predictions created by KiMIner are frequently used to create and store additional dependency patterns. These new patterns are later used by KiMIner to generate more accurate predictions.

Once all mutation impacts have been extracted and approved, they will be included in our ProKinO ontology. We expect to finish text mining all articles and curating the extracted mutation impacts in late 2015 or early 2016.

Qrator

Lead students: Matthew Eavenson (help from Amitabh Priyadarshi)

Qrator is a Web-based curatorial environment for reviewing and approving glycan structures to be included in the GlycO ontology. GlycO is intended to have high-quality information and include only verified glycan structures. Qrator implements a curatorial workflow, where a scientist, having identified a new glycan structure can submit it using GLYDE-II encoding, or even create its representation using a provided glycan builder. In all user interactions, Qrator provides a graphical representation of glycans, consistent with the scientist's expectations. First, the new structure is matched against a canonical glycan tree of a suitable type (a union of all previously approved structures of this type, e.g. N-glycans, O-glycans, etc.) to provide an initial evaluation of how well it conforms to the known characteristics of the glycan class. Subsequently, the scientist selects the best fit to the GlycO canonical tree and submits the structure for review. Expert glycobiologists acting as curators are then called on to review and evaluate the new structure. Curators are presented with a graphical representation of the candidate structure in which any features that are not consistent with the canonical representation in GlycO are highlighted. If no inconsistencies are found, the curator can decide to accept or reject the candidate structure based on his/her personal knowledge of glycan structure and metabolism. If there are no problems, the structure is approved for inclusion in the GlycO ontology.

Process Constraints Modeling and Execution

Lead students: Shasha (Amy) Liu
Former students: Manuel Correa, Anuj Shetye

BPMN and BPEL are two of the most commonly used languages for the specification of processes. However, both languages are not equipped to model constraints and other Non-Functional Requirements within process specifications. We have created a Process Constraint Language (PCL) which can be used to create a variety of constraints imposed on normal process definitions (e.g., in BPMN or BPEL). PCL is somewhat similar to the Object Constraint Language (OCL). Expressions formulated in PCL may refer to constraint attributes and operations, which are defined in an extensible Process Constraint Ontology (ProContO). ProContO allows process designers to define a common vocabulary, such as operations to compute user's geospatial proximity to other users or places. Constraints specified in PCL are executable. While executing a process instance,a process engine (BMPN or BPEL-based) evaluates the constraints and, if needed, throws exceptions to handle any constraint that has not been satisfied.

Dr. Krzysztof J. Kochut