Semantic Similarity on Ontologies [0]

When using ontologies for annotation and query, it’s often a concern how similar two terms are in the ontology vocabulary. This metric facilitates the search of ontology-based databases and knowledge bases and provides insight into how good this ontology is. Many methods have been proposed to calculate the semantic similarity of a pair of terms in an ontology. This post will explore how they were designed, their strength and weaknesses.

Background

What is an ontology

An official definition of an ontology is a set of concepts and categories in a subject area or domain that shows their properties and the relations between them. It’s a group of ideas with a defined structure designed for a domain or a field.

A parent-child relationship, or IS_A relationship, is the most common type within an ontology. It describes one term as a (more specific description) of another term. For example, Abnormality of body height has a child of Short stature. This means that Short stature describes a concept that is Abnormality of body height, but more specific.

With the IS_A relationship, the ontologies usually have a tree-like structure. Most of the time, it’s not strictly a tree because a child may have more than one parent. It’s a Directed Acyclic Graph (DAG), a directed graph with no circles. However, the ontology may be called a “tree” in the later context for simplicity. Keep in mind this is not an accurate description.

What is semantic similarity

Semantic similarity in ontology refers to the similarity between two terms in a way that makes sense in the field in which the ontology is concerned. For example, difficulty running may be similar to the limitation to joint mobility, as they could have some correlation or appear together, simply because it makes sense that if a subject has one of them, he may have another too.

However, the “similarity” between terms is hard to quantify. Not only because the two similar terms may be very far apart from the ontology tree, but it’s difficult to define precisely how similar two things are when the concept itself cannot be described mathematically. For two signals, we can define a metric that provides a score. For two images, we can determine, for example, a colour indicator that shows how much area is in what colour, but how would one define the nature of a concept Phenotype abnormality?

Thus, in most cases, semantic similarity is defined using a corpus, which is another information set that annotates this ontology.

Semantic similarity based on information content

What is information content

Information content is a way of quantifying how much information one expression carries. It is calculated from the probability of occurrence of an event. For example, if I say, “The sun came up from the east today,” this is not a surprise because it happens daily. In this case, the probability of this event (as we observed till now) is 100%, so my statement brings no new information to the observation of this event. Thus, it has an information content of 0. However, if I state, “throwing a die, I just got a 4”, this event does not occur every time. So, my statement brings some new information to the observation of me throwing a die and perhaps improves the probability observation by adding this new information.

Mathematically, the information content is defined as:

$IC(x) = -\log(P(x)) = \log\left(\frac{1}{P(x)}\right)$

Where is the probability of x, which, in statistics, is often the time of occurrence of x divided by total events.

Annotations and frequencies

In ontologies, there are annotations or mappings between two sets of concepts. For example, HGNC gene symbols annotate HPO terms, representing the relationship between a phenotype and some genes (for example, those genes may control such phenotypes). This can be treated as an event: assuming all annotations are equal in weight (or “annotation strength,” as of how related that gene is to the phenotype), randomly picking one annotation should give you a phenotype and a gene. By calculating the occurrence of a phenotype in this way, we can get a sense of how much information can be extracted as related to genes if a phenotype is given. The ones with more annotation will carry less information on what genes may be in play.

Specifically, the annotation count usually includes the annotation on children, grandchildren, etc. Since a child node is a more specific version of the parent node, the annotation also applies to the parent node. So, the IC of a node can be defined as:

$IC(n) = \log\left(\frac{Total\ annotation\ count}{annotation(n\ and\ children)}\right)$

Most informative common ancestor

One important concept is the Most Informative Common Ancestor (MICA). This refers to a common ancestor of the two DAG nodes with the highest information content. This measures how far away the two terms diverged in the tree. For example, if MICA is a direct ancestor of the two nodes, they are essentially siblings and may be highly similar; on the other hand, if MICA is very close to the root node, the two nodes may not even be the same type of concept. For example, one may be a phenotype abnormality, while the other could be a quantifier such as “larger than.”

Especially if node A is an ancestor of node B, the MICA would be A itself. If nodes A and B are the same node, the MICA would also be A (or B) itself.

Existing methods based on information content

Multiple methods of similarity measurement based on information content exist. Here, we review them as they develop.

Resnik method

The Resnik method is defined as:

$sim_{Resnik}(t_1, t_2) = IC(MICA)$

Where the information content of MICA is used directly for the similarity measurement. The idea was that MICA represents how deep in the tree those two terms diverge, so it can be used to show how similar they are. However, this method does not capture information from or itself, so any terms with the same MICA would have the same similarity (which, intuitively, does not seem right).

Lin method

Lin method is proposed based on Resnik method, and can be expressed as:

$sim_{Lin}(t_1, t_2)=\frac{2 \times IC(MICA)}{IC(t_1) + IC(t_2)}$

Where the IC of two terms are also involved in the calculation. This improved on the Resnik method by penalizing the terms that differ too much on their own information content, and is normalized. However, it focuses on the commonality between two terms, and does not fully consider the difference between the two terms.

Relevance method

The Relevance method is proposed based on Lin method:

$sim_{Rel}(t_1, t_2)=\frac{2 \times IC(MICA)}{IC(t_1) + IC(t_2)} \times (1 - P(MICA))$

This method adds a relevance factor to Lin’s method, penalizing terms whose MICA is very general (with a high probability of observing).

Code for calculation

Here are some Python snippets that:

Loads the HPO ontology from owl file (a type of file designed specifically for ontologies)
Loads the HPO-gene annotation from a TSV file
Loads both into networkx graphs
Calculate the information content, find the MICA, and calculate Relevance similarity.

To use it, make sure you’ve downloaded the files from HPO GitHub repository.

import math
import itertools
import networkx as nx
import pandas as pd
from owlready2 import get_ontology
from multiprocessing import Pool


# Global graph variable
hpo_graph = None
gene_annotation_graph = None

max_annotation_count = None
root_node = '0000001'


def load_hpo() -> nx.DiGraph:
    """
    Load the HPO from an OWL file into networkx graph
    :return:
    """
    file_path = 'file://hp.owl'

    hpo_ontology = get_ontology(file_path).load()
    hpo_classes = list(hpo_ontology.classes())

    g = nx.DiGraph()
    added_terms = set()

    for hpo_class in hpo_classes:
        if hpo_class.name.startswith('HP_'):
            # This is an HPO term
            # HPO OWL file contains terms from other ontologies for annotation purpose
            term_id = hpo_class.name.split('_')[-1]     # Keep only the numerical ID

            if term_id not in added_terms:
                g.add_node(term_id)
                added_terms.add(term_id)

            for children in hpo_class.subclasses():
                if children.name.startswith('HP_'):
                    child_id = children.name.split('_')[-1]

                    if child_id not in added_terms:
                        g.add_node(child_id)
                        added_terms.add(child_id)

                    g.add_edge(term_id, child_id)

    return g


def load_gene_annotation(hpo_g: nx.DiGraph) -> nx.Graph:
    """
    Load the annotation of HGNC gene code to HPO terms
    :return:
    """
    file_path = 'genes_to_phenotype.txt'

    base_graph = nx.Graph()

    # Load all HPO nodes into this graph too
    for node in hpo_g.nodes:
        base_graph.add_node(node)

    gene_annotation = pd.read_csv(file_path, sep='\t')

    for _, row in gene_annotation.iterrows():
        if row['gene_symbol'] == '-':
            # No corresponding HGNC code
            continue

        gene_symbol = row['gene_symbol']
        hpo_id = row['hpo_id'].split(':')[-1]

        if gene_symbol not in base_graph.nodes:
            base_graph.add_node(gene_symbol)

        base_graph.add_edge(gene_symbol, hpo_id)

    return base_graph


def find_mica(graph, node1, node2):
    node1_ancestors = set(nx.ancestors(graph, node1)) | {node1}
    node2_ancestors = set(nx.ancestors(graph, node2)) | {node2}  # Nodes themselves are also their ancestors
    mica = node1_ancestors.intersection(node2_ancestors)

    if not mica:
        raise ValueError(f"No common ancestors found between {node1} and {node2}")

    max_ic = -1
    mica_node = None

    for n in mica:
        if graph.nodes[n]['ic'] > max_ic:
            max_ic = graph.nodes[n]['ic']
            mica_node = n

    return mica_node


def relevance_similarity(graph, mica, node1, node2) -> float:
    return 2 * graph.nodes[mica]['ic'] / (graph.nodes[node1]['ic'] + graph.nodes[node2]['ic']) * (
            1 - graph.nodes[mica]['annotation_count'] / max_annotation_count)


def initializer(graph, max_anno_count):
    """
    Initializer function for the pool workers to load the graph into memory.

    Rebuilding the graph would cost too much time and memory on the worker initialization.
    """
    global hpo_graph, max_annotation_count
    hpo_graph = graph
    max_annotation_count = max_anno_count


def calculate_similarity(node_pair):
    node1, node2 = node_pair
    try:
        mica = find_mica(hpo_graph, node1, node2)
        relevance_sim = relevance_similarity(hpo_graph, mica, node1, node2)

        return node1, node2, relevance_sim
    except ValueError:
        return None


if __name__ == '__main__':
    # Prepare the graphs
    hpo_graph = load_hpo()
    gene_annotation_graph = load_gene_annotation(hpo_graph)

    # Calculate the annotation count by propagating the association count from the children to the parents
    for node in nx.dfs_postorder_nodes(hpo_graph, root_node):
        if node == root_node:
            hpo_graph.nodes[node]['annotation_count'] += gene_annotation_graph.degree(node)
            continue

        # Get the annotation count of node from the gene_annotation_graph
        if 'annotation_count' not in hpo_graph.nodes[node]:
            hpo_graph.nodes[node]['annotation_count'] = 0

        hpo_graph.nodes[node]['annotation_count'] += gene_annotation_graph.degree(node)

        for parent in hpo_graph.predecessors(node):
            if 'annotation_count' not in hpo_graph.nodes[parent]:
                hpo_graph.nodes[parent]['annotation_count'] = 0

            hpo_graph.nodes[parent]['annotation_count'] += hpo_graph.nodes[node]['annotation_count']

    max_annotation_count = hpo_graph.nodes[root_node]['annotation_count']

    # Calculate the IC of each node
    for node in hpo_graph.nodes:
        hpo_graph.nodes[node]['ic'] = -1 * math.log(hpo_graph.nodes[node]['annotation_count'] / max_annotation_count)

    # Keep only the ones with information content
    nodes = [node for node in hpo_graph.nodes if 'ic' in hpo_graph.nodes[node]]

    # Use a multiprocessing Pool with initializer
    with Pool(4, initializer=initializer, initargs=(hpo_graph, max_annotation_count)) as p:
        # tqdm is used to show a progress bar
        results = p.map(calculate_similarity, itertools.combinations(nodes, 2))

Semantic similarity based on co-annotation vector

What is co-annotation vector

The co-annotation vector refers to the common part of the annotation set that is being used to annotate both the two terms. In the following diagram:

C is used to annotate 2, and D is used to annotate 4. Since all children is also a parent, D also annotates 2 by inference. A and B annotates 3. In this case, 2 and 3 has no common annotation.

Existing methods based on co-annotation vector

The basic method to calculate semantic similarity base on co-annotation vector is proposed by Bodenreider:

$sim_{Bodenreider} = \frac{t_1 \cdot t_2}{\left| t_1 \right| \times \left| t_2 \right|}$

Where and are the annotation vector of the terms. This method captures the common part and different part of the annotation of the two terms. However, in common use cases, the annotation vectors are not clearly defined as in mathematical space, where each vector is confined. Later, Marthur and Dinakarpandian developed a method which modifies the Jaccard distance:

$sim_{MD} = \frac{\frac{n \left( t_1 \cap t_2 \right)}{n \left( t_1 \cup t_2 \right)}}{\frac{n \left( t_1 \right)}{N} \times \frac{n \left( t_2 \right)}{N}}$

Where is the amount of nodes that are used to annotate both terms, etc. and is the number of total annotating nodes. This converts the vector products to the count of annotation nodes, which is easier to define and calculate. However, the semantic similarity score calculated by this method is not bounded, and its maximum value is related to the maximum annotation count.

The code to calculate the semantic similarity based on co-annotation vector can be easily adapted from the previous code, so it’s not repeated here. In the next blog, we will look at some possible methods to improve the existing methods.

Existing methods: Semantic Similarity on Ontologies [0]