Personizing the prediction of future susceptibility to a specific disease

A traceable biomarker is a member of a disease’s molecular pathway. A disease may be associated with several molecular pathways. Each different combination of these molecular pathways, to which detected traceable biomarkers belong, may serve as an indicative of the elicitation of the disease at a different time frame in the future. Based on this notion, we introduce a novel methodology for personalizing an individual’s degree of future susceptibility to a specific disease. We implemented the methodology in a working system called Susceptibility Degree to a Disease Predictor (SDDP). For a specific disease d, let S be the set of molecular pathways, to which traceable biomarkers detected from most patients of d belong. For the same disease d, let S′ be the set of molecular pathways, to which traceable biomarkers detected from a certain individual belong. SDDP is able to infer the subset S′′ ⊆{S-S′} of undetected molecular pathways for the individual. Thus, SDDP can infer undetected molecular pathways of a disease for an individual based on few molecular pathways detected from the individual. SDDP can also help in inferring the combination of molecular pathways in the set {S′+S′′}, whose traceable biomarkers collectively is an indicative of the disease. SDDP is composed of the following four components: information extractor, interrelationship between molecular pathways modeler, logic inferencer, and risk indicator. The information extractor takes advantage of the exponential increase of biomedical literature to automatically extract the common traceable biomarkers for a specific disease. The interrelationship between molecular pathways modeler models the hierarchical interrelationships between the molecular pathways of the traceable biomarkers. The logic inferencer transforms the hierarchical interrelationships between the molecular pathways into rule-based specifications. It employs the specification rules and the inference rules for predicate logic to infer as many as possible undetected molecular pathways of a disease for an individual. The risk indicator outputs a risk indicator value that reflects the individual’s degree of future susceptibility to the disease. We evaluated SDDP by comparing it experimentally with other methods. Results revealed marked improvement.

The nouns "Alport syndrome" and "COL4A5" are semantically related. The nouns "COL4A3" and "COL4A4" are semantically related to the noun "autosom". However, each of the nouns "Alport syndrome" and "COL4A5" is unrelated to each of the nouns "COL4A3", "COL4A4", and "autosom", because the two sets of nouns belong to two different independent clauses connected by the preposition modifier "while".
Example 4: Consider the sentence: "In response to stimulating, Nck1 exhibited decreased CD69 expression but Nck2 exhibited increased CD69 expression". Below is the syntactic structure of the sentence in terms of its constituents of independent clauses: In response to stimulating, Nck1 (N) exhibited (V) decreased CD69_expression (N), but (PREP) Nck2 (N) exhibited (V) increased CD69_expression (N) The nouns "Nck1" and ""CD69 expression" are semantically related. The nouns "Nck2" and ""CD69 expression" are semantically related. However, the nouns "Nck1" and "Nck2" are semantically unrelated, because they belong to two different independent clauses connected by the preposition modifier "but".

Sentences Containing Pronouns Defining Antecedents
According to linguistics, an antecedent noun is usually related to the subsequent noun(s), if the subsequent noun(s) is connected to the antecedent by a pronoun (such as "which", "who", "it", "whom", and "that") [3]. We propose our second semantic rules based on this linguistic observation, as follows: 1. An antecedent noun is semantically related to a subsequent noun(s), if the two nouns are connected by a pronoun. Towards this, we replace each pronoun with the closest noun found under the predecessor independent clause. This conforms to grammar and linguistics, which treat a pronoun as a word that can be substituted by a noun or noun phrase. In Examples 5-9, we strikethrough each pronoun and replace it with the closest noun found under the predecessor independent clause. 2. An explicit or implicit pronoun preceded by a conjunction (i.e., "and" and "or") refers to the subject of closest predecessor independent clause. In Examples 5-9, we strikethrough each pronoun preceded by a conjunction and replace it with the subject of closest predecessor independent clause. In the case of an implicit pronoun preceded by a conjunction, we also replace it with the subject of closest predecessor independent clause.
For the sake of clarification, we perform the following in Examples 5-9: 1. We type the subject of the first independent clause using a different font. 2. We type each noun that replaces a pronoun: (1) in italics, (2) in a different font, and (3) place quotation marks around it. The replacement noun plays the role of the subject of the independent clause that comes immediately after the pronoun. In Examples 5-9, we demonstrate how these semantic rules conform to the linguistics theory stated above. We determine the semantic relationship between each pair of generic biomedical nouns. Recall that all nouns (including the replacement nouns) within an independent clause are related.
Example 5: Consider the sentence: "The two variants of Hemoglobin are HbSS which causes death, and HbAS which protects against malaria". The following is the syntactic structure of the sentence in terms of its constituents of independent clauses. The subject noun "Hemoglobin" is semantically related to the noun "HbSS". In the second independent clause, the pronoun "which" is replaced by the closest noun found under the predecessor independent clause (i.e., "HbSS"), which becomes the subject of the second independent clause. Therefore, the nouns "HbSS" is semantically related to the nouns "death" and "HbAS". In the third independent clause, the pronoun "which" is replaced by the closest noun found under the predecessor independent clause (i.e., "HbAS") Therefore, the nouns "HbAS" and "malaria" are semantically related.
Example 6: Consider the following sentence: "Zbtb7A is a repressor of the tumor suppressor p14ARF that in turn lowers the expression of p53 gene and it is a central regulator in oncogenesis". The following is the syntactic structure of the sentence in terms of its constituents of independent clauses. Zbtb7A (N) is (V) a repressor of the tumor suppressor p14ARF (N) that "p14ARF" in turn lowers (V) the expression of p53 gene (N) and it p14ARF" is (V) a central regulator in oncogenesis (N).
The subject noun "Zbtb7A" is semantically related to the noun "p14ARF". The pronoun "that" is replaced by the closest noun found under the predecessor independent clause (i.e., the noun "p14ARF"), which becomes the subject noun of the second independent clause. Therefore, the nouns "p14ARF" and "p53 gene" are semantically related. Since the pronoun "it" follows the conjunction "and", it is replaced by the subject noun of the closest predecessor independent clause (i.e., the noun "p14ARF"), which becomes the subject of the third independent clause. Therefore, the nouns "p14ARF" and "oncogenesis" are semantically related.
Example 7: Consider the following sentence: "p16 is a cell-cycle protein, which interacts with the sequester MDM2, and it inhibits the ability of CDK4 to interact with cyclins D". The following is the syntactic structure of the sentence in terms of its constituents of independent clauses. The subject "p16" is semantically related to the noun "cell-cycle protein". The pronoun "which" is replaced by the closest noun under the predecessor independent clause (i.e., the noun "cell-cycle protein"), which becomes the subject of the second independent clause. Therefore, the nouns "cell-cycle protein" and "MDM2" are semantically related. Since the pronoun "it" follows the conjunction "and", it is replaced by the subject noun of the closest predecessor independent clause (i.e., the noun "cell-cycle protein"), which becomes the subject of the third independent clause. Therefore, the nouns "cell-cycle protein" and "CDK4" are semantically related.
The subject noun "p14ARF gene" is semantically related to the noun "mdm2 protein". In the second independent clause, the implicit pronoun that follows the conjunction "and" is replaced by the subject noun of the closest predecessor independent clause (i.e., the noun "p14ARF gene"), which becomes the subject of the second independent clause. Therefore, the nouns "p14ARF gene" and "p53 protein" are semantically related. In the third independent clause, the pronoun "which" is replaced by the closest noun found under the predecessor independent clause (i.e., "p53 protein"), which becomes the subject of the third independent clause. Therefore, the nouns "p53 protein" and "p21 protein" are semantically related. In the fourth independent clause, the pronoun "which" is replaced by the closest noun found under the predecessor independent clause (i.e., "p21 protein"), which becomes the subject of the fourth independent clause. In the fifth independent clause, the implicit pronoun that follows the conjunction "and" is replaced by the subject noun of the closest predecessor independent clause (i.e., "p21 protein"), which becomes the subject of the fifth independent clause. Therefore, the nouns "p21 protein" and "cyclin-CDK" are semantically related.
Example 9: Consider the following sentence: "Zbtb7A protein binds to HIV type I and interacts with BCL-6".

Zbtb7A protein (N) binds (V) to HIV type I (N)
and "Zbtb7A protein" interacts (V) with BCL-6 (N) The subject noun "Zbtb7A protein" is semantically related to the noun "HIV type I". The implicit pronoun that follows the conjunction "and" is replaced by the subject noun of the closest predecessor independent clause (i.e., the noun "Zbtb7A protein"), which becomes the subject of the second independent clause. Therefore, the nouns "Zbtb7A protein" and "BCL-6" are semantically related.

Appendix B
A complete set of inference rules for T2D. Table 1 shows the complete set of inference rules for T2D. Table 2 shows abbreviation of the terms used in Table 1.

Description of how risk indicators are computed
An indicator value is assigned to a combination of MPs of a disease for an individual as follows. Let S be the set of MPs of a specific disease. We assign a score to each combination c  S. The score reflects the degree of association between the combination c and the disease. Specifically, it reflects the dominance status of c relative to each other combination c  S.
First, we compute the pairwise beats and looses for each combination. This is performed based on the co-occurrences of the MPs in the combination in the abstracts of biomedical publications associated with a disease d under consideration.
Combination ci beats combination cj, if the number of times that the co-occurrence weight of ci is greater than that of cj in abstracts. Eventually, each combination c is assigned a score, which is the difference between the number of times that c beats the other combinations and the number of times it loses. This concept is formalized in Definition 1.

Definition 1-A keyword's pairwise score:
Let ci > cj denote that the number of incidents where the number of occurrences of MP combination kj is greater than that of combination ki in the publications associated with a disease under consideration. The pairwise score of combination ci equals the following: where C is the set of combinations.
To this end, the combination ci will be assigned a dominance score i c S , which is determined as follows. If we sum the dominance scores of all combinations, we find that the result is zero. The highest possible score is (t -1) and the lowest possible score is -(t -1), where t is the number of combinations. Finally, the combinations are ranked based on their dominance scores.

Example 1:
Consider that there are 10 combinations of MPs: c1-c10. Consider that the number of co-occurrences of each of the 10 combinations in 3 biomedical publications (p1-p3) associated with the disease under consideration is as shown in Table  1. Table 2 shows how the score t S of each of the 10 combinations is computed based on its number of occurrences in the 3 publications presented in Table 1. For example, let c9 be the combination of MPs, to which detected traceable biomarkers from an individual belong. The individual will be given the risk indicator 3. Table 1: The number of co-occurrences of each of the 10 MP combinations in 3 publications associated with a disease as described in Example 1 Table 2: Beats/looses scores of the combinations of the MPs described in Example 1 based on their number of co-occurrences on the 3 publications as shown in Table 1 "+" denotes: combination ci beat combination cj. "-" denotes: combination ci lost to combination cj. "0" denotes: ci and cj have the same number of beats and looses. Sci is the dominance score of ci.

Illustration that shows how Table 2 is filled out:
We show below how the fields (g3, g1), (g4, g1), and (g1, g2) are filled out:  The field (g3, g1) is assigned the symbol "-" because: g1 beat g3 one time and it lost to g3 two times.
 The field (g4, g1) is assigned the symbol "+" because: g4 beat g1 two times and it lost to g1 one time.
 The field (g1, g2) is assigned the symbol "0", because: g1 beat g2 one time and it lost to g2 one time.
Illustration that shows how the dominance scores "Sc" in Table 2 are calculated: Sci = Number of "+" of ci -Number of "-" of ci.