In the previous post, we explored the commonly used methods to calculate semantic similarity. In this post, I will introduce some thoughts and possible process improvements.
Please note that all the methods in this post are prototypes of ideas that have not been mathematically proven. I’m sharing them here, hoping they may be useful or inspirational to someone else. I do not have the skill or time to prove them strictly or publish them properly.
Weights on information content
The concept behind information content, in its basis, is that if a term appears many times in the corpus, the term is relatively “general”. This can be calculated by the number of annotations to a node and its descendants. However, the quality of the corpus and/or the annotation determines the quality of the calculated result. For example, to determine the information content of a word in the literature world, you would ideally get all literature that was ever published and calculate the frequency. However, the corpus used in the annotation is usually a curated set of terms. The corpus frequency is “flattened” (for example, each gene appears once in the set). This does not reflect the real-world scenario. Thus, the annotation may benefit from weights.
Qualitatively speaking, if a gene is being used to annotate many ontology terms, this would mean that the gene is relevant to many pathways or products, indicating that it’s more likely to appear in the gene pool. Thus, the weight of the annotation should be relatively higher, reflecting the increase in the gene frequency. With weight added, the previous calculation of information content of a term in ontology A with annotation from corpus B:
Can be updated into:
Where each annotation, instead of weighting 1, now has a different weight
. The weight should be assigned to each corpus node, thus applying to the annotations from the node. The weight of a corpus node may be defined straightforwardly:
Where
is the amount of annotation this node has, and
being the maximum amount of annotation a node in the corpus may have. Or, it can be defined using a recursive method, where:
Because the annotation and the ontology structure are unknown, there may (very likely) not be an analytical expression of the IC. An iterative algorithm may be employed to estimate it. This is similar to how the Page Rank algorithm handles the estimation of importance (which is related to what we try to achieve but on a different setup). An initial weight of 1 may be assigned to terms in A:
so that IC may be calculated for B:
Then, the IC may be used as the weight for B:
The weight of A, which is the information content, may be updated with the new weight of B:
And continue the iteration until in one iteration, until is sufficiently close to
for all terms in both sets. There may be a threshold set for the stopping criteria. However, in the case where the algorithm does not converge due to any reason, including the structure of the ontology or annotation, the initial value selected, or anything else, a dampening factor may be added:
This would prevent oscillation if the annotations share too much in common, so updates on each iteration significantly affect another set. Considering information content does not have a theoretical maximum (unbounded), the weight may be updated as:
So, the weight is bounded between 0 and 1, and all weights sum up to a constant value. This ensures that the annotation sum with the weight will not grow out of bounds during the iteration.
By applying weight to the annotation while calculating the information content, not only the distribution of the annotation vectors on ontology A is considered, the information initially ignored (distribution of annotation vectors on corpus B) is also taken into consideration.
Weights on co-annotation vectors
Similarly, a weight may be introduced in co-annotation vector calculation. However, unlike the information content approach, where we must compensate for the frequency difference between real-world samples and a curated set, co-annotation vectors consider the commonality between two terms by their common annotations. In this setup, when a node in a corpus has more annotations, its weight should be lower because it will be commonly present in many terms’ annotations, making it less valuable as a metric. Also, there’s no need for recursive calculation because the co-annotation vector for two nodes within the corpus only shows how similar they are without reflecting helpful information for the ontology set itself.
Based on the above analysis, the weight of each node in corpus B may be defined as:
This value is bound between
and decrease smoothly as the
increases.
Leave a Reply