Figure 1.
Overview of methods and results.
For each of the 2,362,950 possible drug-indication pairs, we calculated 9 empirical features (e.g., co-mention count) from the free text of clinical notes in STRIDE and 16 domain knowledge features (e.g., similarity in known usage to other drugs used to treat the indication) from Medi-Span and Drugbank. These features were used by an SVM classifier trained on a gold standard dataset to recognize the used-to-treat relationship, yielding a set of predictions that were filtered for known usages, near misses in the indications, and support in two independent and complementary datasets (FAERS and MEDLINE). Predicted usages that appeared to be drug adverse events listed in SIDER 2 were removed. The resulting set of 403 well-supported novel off-label usages were binned using indices of risk and cost.
Figure 2.
Training and testing a classifier to recognize used-to-treat relationships.
We created a gold standard of positive and negative examples of known drug usage. Positive examples were taken from Medi-Span. We created negative examples by randomly selecting positive examples and then randomly choosing a drug and indication with roughly the same frequency of mentions in STRIDE as the real usage. These were then checked against Medi-Span to filter out inadvertently generated known usages. The gold standard dataset contained 4 negative examples for each positive case. For each drug-indication pair in the gold standard, we calculated features summarizing the pattern of mentions of the drugs and indications in 9.5 million clinical notes from STRIDE. We used Medi-Span and Drugbank to calculate features summarizing domain knowledge about drugs and their usages. 80% of the gold standard was used to train an SVM classifier, and the resulting model was tested on the remaining 20%.
Table 1.
Performance of classifier on hold-out test set using different feature sets.
Figure 3.
Distribution of indication classes in predicted novel usages.
Each indication for the 403 high confidence novel usages with support in FAERS and MEDLINE was mapped to the first level of the NDF-RT disease hierarchy. 63 usages were not mapped to NDF-RT and were left out of this chart.
Table 2.
Selected predicted novel off-label usages.
Table 3.
Predicted off-label usages binned by risk and cost and ranked by support in FAERS.
Figure 4.
Using prior knowledge to calculate drug-drug and indication-indication similarity.
We represent known usage as a matrix where row i represents drug i and column j represents indication j. A check in entry (i,j) indicates that the drug i is used to treat the indication j, while a cross indicates the converse. We are interested in whether a given drug, lamotrigine, is used to treat migraine disorders. We thus ask — how similar is the known usage of lamotrigine to other drugs we know are used to treat migraine disorders? Topirimate is used to treat migraine disorders, and lamotrigine is similar to it in that both are used to treat tonic-clonic seizures and myoclonic epilepsies, but not non-Hodgkin's lymphoma. This similarity in usage profile suggests that it is more likely to be used to treat migraine disorders than, say, Rituximab. We measured this similarity using the maximum cosine and Jaccard similarity of lamotrigine versus all drugs known to treat the indication. We calculate the similarity between indications based on known usage using the same data, with the roles of drugs and indications reversed.