Beyond Captions: Linking Figures with Abstract Sentences in Biomedical Articles

Although figures in scientific articles have high information content and concisely communicate many key research findings, they are currently under utilized by literature search and retrieval systems. Many systems ignore figures, and those that do not typically only consider caption text. This study describes and evaluates a fully automated approach for associating figures in the body of a biomedical article with sentences in its abstract. We use supervised methods to learn probabilistic language models, hidden Markov models, and conditional random fields for predicting associations between abstract sentences and figures. Three kinds of evidence are used: text in abstract sentences and figures, relative positions of sentences and figures, and the patterns of sentence/figure associations across an article. Each information source is shown to have predictive value, and models that use all kinds of evidence are more accurate than models that do not. Our most accurate method has an -score of 69% on a cross-validation experiment, is competitive with the accuracy of human experts, has significantly better predictive accuracy than state-of-the-art methods and enables users to access figures associated with an abstract sentence with an average of 1.82 fewer mouse clicks. A user evaluation shows that human users find our system beneficial. The system is available at http://FigureItOut.askHERMES.org.


Introduction
The rapid growth of electronic full-text biomedical articles has enabled the development of information systems that allow researchers to search and navigate large literature databases. Key content of many articles resides in images, charts, plots, tables or diagrams, and there is considerable interest in developing new figure aware systems. Because of the important role of figures, they often are referred to and discussed explicitly and implicitly throughout an article. However, nearly all existing systems for figure search rely solely on the text in captions, and thus fail to consider other key document elements. We present novel algorithms for automatically ''linking'' or ''associating'' sentences in the abstract of a scientific article with figures in the article body. These and related methods will help figures become a key part of next generation search systems. We use the terms ''associating'' and ''linking'' to indicate that a figure and a sentence in the abstract are related. In particular, the figure gives supporting information for the sentence in the abstract. This use of these terms is common in data mining and text analysis. It should not be confused with genetic, biological or medical uses of the terms ''links'' and ''association''.
Our approach uses three types of evidence to predict whether or not an abstract sentence is associated with a figure. The first type of evidence is text. While the textual representation of a sentence is simply the terms in the sentence, the appropriate textual representation of a figure is not so clear. We investigate textual figure representations based on terms in the figure's caption and/ or its referencing paragraphs. We use probabilistic language models to assess the textual similarity between an abstract sentence and a figure. The second type of evidence is the relative positions of a sentence and figure. Previous work by our group [1] has shown that sentences at the beginning of an abstract are more likely to be associated with figures near the beginning of an article, middle sentences are more likely to be associated with middle figures, and so on. We use probabilistic distance models to reason about the relative positions for both linked and non-linked instances. The third type of evidence is patterns of sentence/ figure links across an article. Since the presence or absence of a link for one instance can affect the likelihood of a link for other instances [1], we introduce novel approaches for representing linkage patterns based on hidden Markov models (HMMs) [2] and conditional random fields (CRFs) [3].
Our experimental evaluation uses a corpus of 114 biomedical articles annotated by their authors for all links between abstract sentences and figures. Figure 1 shows the annotated linkages between figures and the abstract sentences of one such article. We use supervised learning to learn language, distance, and linkage flow models, and use probabilistic methods to effectively combine predictions of the three models. Cross-validation experiments are used to evaluate our methods. The key findings are (i) each type of evidence has predictive value, (ii) predictions of models that combine evidence sources are more accurate than the predictions of models that use a single evidence source, (iii) across articles the average maximum F1 score of our combined approach is 69%, and (iv) our predictions would save users an average of 1.82 mouse clicks when searching for a figure associated with an abstract sentence in a conceptualized literature search system.
The work presented here extends previous work of our group on linking abstract sentences with figures [1] in several significant ways, and makes important contributions to our understanding of this problem. The present system uses supervised learning approaches while previous methods are unsupervised. We introduce probabilistic language models, position models, and HMM and CRF linkage flow models for this task. We evaluate two kinds of figure representations, one based on text in figure captions and the other on text in referencing paragraphs. A new evaluation measure based on the average number of saved clicks is introduced. Finally, the accuracy of predictions is significantly improved.
Our system is fully implemented and contains over 150,000 open access full-text biomedical articles that can be accessed at http://figureitout.askhermes.org.

Related Work
In this section we discuss relationships to previous work in four areas: text-based literature search systems, classification and search methods for images in documents, textual entailment, and summarization.
A number of medical and biological text-based literature search systems have been constructed. These include systems that respond to users' queries, such as PubMed and AskHERMES [4] for medical literature. Textpresso [5] was originally designed to assist biological database curation but also functions as an information retrieval system. Arrowsmith [6] helps biologists formulate hypotheses through text mining of two topics, such as a drug and an adverse event. Other systems attempt to find specific kinds of information. For example, GeneWays [7] extracts molecular interactions related to pathways identified in the literature and iHOP [8] identifies sentences that relate two genes. Additionally, there are numerous annotated databases -for example, the Gene Ontology annotation [9] the mouse Genome Database [10], SWISSPORT, OMIM [11], and BIND [12] -that provide different levels of annotated literature information about genes and molecular interactions. See the review article [13] for other information systems.
In addition to text, the importance of biomedical figures and images for document classification and retrieval has been recognized. The earliest image mining system is the Subcellular Location Image Finder (SLIF) system [14][15][16][17][18] which extracts and analyzes fluorescence microscope images from biomedical full-text articles. Other studies have looked at applying supervised machine-learning algorithms for image categorization using flat [19] and hierarchical [20] classification schemes. These methods showed that image classification benefits document classification. Besides the image content itself, associated text has been shown to be important for image mining. Caption words, for example, can improve image classification [19]. BioText searches biomedical images on the basis of image captions [21,22], and Yale image finder [23] searches images on the basis of title, abstract, image caption, and the text appearing in an image. More recently, an approach has investigated using figure-associated text for automatically ranking figures by their importance [24]. While these methods utilize text for different tasks, they do not automatically associate images or the figures that contain them with specific document text.
Our approaches of associating figures with text are also related to the problem of textual entailment [25], a task that has application to numerous higher-level problems including passage retrieval, machine translation, paraphrasing, summarization, and question answering. The PASCAL Network of Excellence Recognizing Textual Entailment (RTE) challenge task is to recognize whether two text strings can be semantically inferred (entailed) from each other. Thus, a body of text is said to ''entail'' a hypothesis text if the body of text implies that the hypothesis is true. Our task is similar in that the aim is to determine whether or not one string (a sentence in the abstract) is associated with another string (the text of a figure). The RTE task does not directly apply to the linkage between figures and text as the relationships between linked abstract sentences and figures is generally much weaker than entailment.
Lastly, we find similarities with the computational summarization work of Jing and McKeown [26]. They learn a summarization system from a training set consisting of human-written summary sentences in which words in the summaries are mapped to words in the original article. Their summarization approach, which assumes human summaries are created via a cut-and-paste process, uses two heuristic rules: (1) human summaries are more likely to use whole phrases than single, isolated words, and (2) humans are more likely to merge nearby sentences into a single sentence than combine sentences that are far apart. They model these tendencies with HMMs. These rules parallel patterns of associations between figures and abstract sentences that we represent with HMMs and CRFs. A key difference between these two tasks, however, is that while for summarization Jing and McKeown [26] permit only one-to-one associations between words in summary sentences and words in the original, we allow more general one to many, many to one, and many to many associations between figures and sentences. In this way then, our application represents a more challenging task.

Results and Discussion
We conducted experiments on a corpus of 114 manually annotated biomedical articles to empirically evaluate our approach to predicting linkages between abstract sentences and figures. Our experiments involve training models and making predictions from a progressively increasing number of evidence types. First, we consider only text, and evaluate predictions of our language models (LMs). Next, we add position evidence, and evaluate predictions of combined LMs and distance models (LM+DM). Last, we add (inferred) linkage evidence, and evaluate predictions of our hidden Markov model (HMM) and conditional random field (CRF) methods.
We designed our experiments to test several statistical hypotheses. Each experiment was evaluated using up to 3 performance metrics commonly used in information retrieval, as well as an application specfic performance value (''clicks''). For each test the null hypothesis is that two competing approaches have the same mean measure (H 0 : m 1~m2 ) and the alternative hypothesis is H a : m 1 =m 2 : We report p{values (the probability of the observed data under H 0 ) in all cases where pv0:1) The three hypotheses were: (1) The quality of predictions of our complete language model (CompleteLM) exceeds those of the state-of-the-art approach, which uses only text. Our empirical results support each of these hypotheses. We measure performance using standard measures: precision is the fraction of linked figures that are correctly identified by a system, recall is the fraction of figures linked by a system that are truly linked figures, AROC is the area under the curve defining the false-positive rate as a function of recall, and F1 is the geometric mean of precision and recall.
We also use a new measure that we call ''clicks''. This is not actual clicks by a designated user but a mathematical model meant to estimate the savings over a user reading an abstract sentence and then selecting figures sequentially, looking for supporting information for the sentence. Our model assumes that figures are selected in order until the set of figures relevant to a given sentence is found. This may be an overestimate (if the user has visual clues or has already clicked on some of these figures for a previous sentence), or an underestimate (if the user clicks on all sentences, just to be sure, or clicks on the back button) but it does provide consistent criteria for evaluating methods. More precisely, for any sentence, we assume that if Figure k is the last figure (truly) linked to that sentence, a user without our system would click k times to retrieve the relevant figures, accessing the figures sequentially until obtaining the desired information. If our system scores all of the relevant figures for the sentence within its top q choices, then our user would click q times, and the number of clicks saved would be k{q. We define ''clicks'' to be this difference, averaged over all sentences in the set of abstracts. ''Clicks'' thus represents the average reduction in the number of mouse clicks needed by a user to locate a figure associated with an abstract sentence when the user clicks on figures in the order determined by linkage scores rather than sequentially.

Results for Language Models
We performed a leave-one-article-out cross-validation experiment to assess the performance of different LM approaches. We evaluated four types of figure-specific models (Caption Only, Referencing Only, Pooled, and Mixture) and two types of background models (Variable Size and Fixed Size) and compare to the current state-of-the-art [1] (Baseline). See the methods section for LM specifics. Since the Mixture figure-specific model considers more text than the other two methods and differentiates between caption and referencing text, and since the Fixed Size background model corrects for a bias that Variable Size has against long sentences, we sometimes refer to the LM (Fixed Size, Mixture) as the CompleteLM. We hypothesize CompleteLM will outperform the other LMs as well as Baseline.
These differing performances are given in Table 1 where the column headers on the per-article side of the the table have an over-bar and subscript a to indicate that the reported values are averages across articles. Broadly, note that the scores for perarticle are uniformly higher for all methods and measures than their corresponding scores for the whole-corpus. So, indeed, the methods perform better on articles with fewer links. We will analyze these differences in detail using a permutation test, but first we discuss the results of using differing background models, i.e., differences between top and bottom rows of the table. Reported values in the table are the precisions that arise when the number of predicted linkages is equal to the number of abstract sentences. That is, the precision value Prec J (a) for article a (used in the calculation of Prec J a in the per-article method) is the precision for the top scoring J(a) sentence-figure instances, where J(a) is the number of abstract sentences in article a. Similarly, Prec J for the whole-corpus case is the precision for the top scoring P a J(a) instances. Figure 2 shows whole-corpus recall-precision curves for three LM models and the baseline.
We continue our discussion of results by comparing Comple-teLM with Baseline. We observe that the performance of our approach exceeds the baseline on all measures. To estimate significance we conducted paired t-tests for the three per-article measures. The p-value for F 1 Ã a (0.063) nearly indicated significance at the standard 0.05 level, while the p-value for the important Prec J a case (0.0071) is significant. Thus, we conclude the expected value of Prec J a for CompleteLM is larger than for Baseline. Since the Prec J a measure, unlike AROC a and F 1 Ã a , depends only on labels of top scoring instances, the improvement in Prec J a is especially relevant to literature browsing systems, which are likely to provide access to figures for only a few of the highest scoring instances. In the recall-precision curves ( Figure 2) we observe that, except for very small recall levels less than 0.05, the CompleteLM curve dominates Baseline up to a recall of 0.8.
We now turn to a comparison of our eight LMs. We look first at the performance of different figure models. For a given background model, the LMs for pairings with Mixture and Pooled figure models have consistently better performance than pairings with both Caption Only and Referencing Only models. Paired ttests on the per-article measures confirm that for all cases these differences are significant (pƒ0:01). We conclude that our LM approach successfully combines complementary sources of text. Next, we compare our two background models. The Fixed Sized models, which have background vocabulary sizes set to a constant value because of a potential bias against linking long sentences, have better performance than Variable Size models when comparisons are made between pairings with the same figure model. Differences between Fixed Sized and Variable Size background models paired with Mixture figure models are significant (p-values of paired t-tests ƒ0:01) for all three measures.
To investigate a possible bias favoring links to longer sentences we plot in Figure 3 empirical cumulative distribution curves of sentence lengths for four collections of sentence/figure instances: all 5402 instances (magenta line), all 947 linked instances (black line), and the top scoring 826 instances under (Variable Size, Mixture) (blue line) and (Fixed Size, Mixture) (red line) models. We choose 826 because this is the total number of abstract sentences in our corpus, and thus these curves show sentence length distribution at the Prec J point. The Variable Size method prefers linking short sentences rather than long sentences (gap between the red and black curves for a given sentence length). For example, although only 58% of linked instances have sentences with ten or fewer terms, 76% of all high-scoring instances under the Variable Size approach have sentences at least this short. The Fixed Size approach eliminates this bias, especially for sentences longer than 10 terms. Interestingly, there appears to be an actual preference for longer sentences to be linked as seen by comparing the magenta ''All Instances'' curve to the black ''Linked Instances'' curve. This may be reflective of a positive correlation between sentence length and information content.
Lastly, we look in more detail at uniformly higher performance in the per-article metrics by setting up an experiment to see if the differences are due to in-homogeneity of the data. That is, do some articles have language modelling scores that make linking sentences with figures easier? This may be due to an author's style or the topic being discussed. We explore this by employing a permutation test. We hold model type fixed and compare performance values calculated using the whole-corpus method with their per-article counterparts (that is, we compare values within rows of Table 1). We observe that for all three measures and all models the per-article performance value is larger than the whole-corpus value. While due to the way F 1 Ã a is calculated, we expect the larger per-article values for this measure, there is no calculation bias for area under the ROC curve and precision.
The permutation test shuffled the associations between articles and sentence/figure instances, keeping the number of linked instances associated with each article fixed. Although wholecorpus performance values do not depend on article assignments, per-article performance values do. For each of 1000 permutations we computed AROC a and Prec J a from linkage scores of CompleteLM using the shuffled article associations. Figure 4 shows normalized histograms of observed performance values from the permutation test along with actual whole-corpus and perarticle values. Although there is an article effect for both measures, it is clearest for AROC. On this measure, while the whole-corpus AROC value of 0.777 was near the median (exceeding the perarticle value in 485 of the 1000 permutations) the actual per-article AROC a of 0.805 was substantially larger than any of the permuted values. On precision the whole-corpus Prec J value was larger than 904 of the permuted values, and the actual perarticle Prec J a was larger than 999 of the 1000 permuted values. One consequence of the observed article effect is that since in a literature browsing system linkage scores are only considered one article at a time, whole-corpus performance measures will underestimate system performance in practice. A second and more important consequence is that a single, fixed threshold on linkage score separating positive and negative predictions is not appropriate. It will be too permissive for articles with scoreincreasing effects, and conversely too restrictive for articles with score-reducing effects.

Results for Combined Text and Non-text Models
In this section we evaluate approaches to linkage prediction that utilize both text and non-text features. We consider two kinds of Information gain of non-text features. Table 2 shows the percent information gain (% gain) of non-text features. For comparison, we also show the % gain of the text-based CompleteLM model's linkage scores (S LM (j, k), see methods). A feature's % gain indicates how predictive of linkage that feature is in isolation. It ranges from 0% (not predictive) to 100% (completely predictive). It is not surprising that the % gain of CompleteLM scores (16.30%) is, by a wide margin, the largest, as these scores come from models of numerous text features while the other % gain values in the table are for individual features. After CompleteLM, the next three features form a well separated group with similar % gains. This group comprises two linkage features, EdgesCrossed (7.82%) and FigureDegree (7.37%), along with Distance (7.48%), a positional feature. The relatively high gain of Distance agrees with previous work [1] where we found that predictions based on text and Distance are more accurate than predictions of text-only models. The Distance feature is, however, the only non-text feature previously used for abstract sentence/ figure linkage prediction. Therefore, the present work represents the first time the other features in Table 2 have been considered for this task.
Other than Distance, Initial Sentence, with a modest gain of 1.78%, is the only positional feature with % gain significantly different from 0.0. All linkage features, on the other hand, have statistically significant % gain values. Indeed, four linkage features have gains exceeding 4%. If appropriately modeled, these linkage features may lead to more accurate predictions of abstract sentence/figure associations. Incorporating them into a model, however, is challenging since linkage feature values are unobserved at prediction, and therefore approaches that predict linkages of each instance independently are unable to use linkage features. In fact, a key motivation behind our HMM and CRF approaches was to utilize their collective classification properties to model linkage features.
Evaluation of linkage predictions. To evaluate models of text and non-text features, we performed leave-one-article-out cross-validation experiments similar to those we used to evaluate language models. We look at three approaches to modeling text and non-text. Our CompleteLM+DM approach combines Com-   13). Like the LM approaches, it predicts linkages independently for each sentence/figure pair. In contrast, our other two approaches, HMMs and CRFs, make collective predictions, and moreover utilize both positional and linkage non-text features. We evaluate both the sentences-in-states (SIS) and figures-in-states (FIS) HMM and CRF variants. We compare predictions of our models that merge text and non-text features to predictions of models that consider only distance (DM), only text (CompleteLM), and two baselines: the text-only Baseline described above and a combined text and distance method used in a previous study [1]. This text and distance baseline -called SIM() by its authors, but which for consistency we refer to as Baseline+DM -represents the current state-of-the-art, and is currently the most accurate method for predicting abstract sentence/figure linkage. As above, we use the AROC, F 1 Ã and Prec J performance measures calculated corpuswide and per-article. Additionally, we report per-article values of clicks, labeled Clicks J a . Table 3 shows the performance of various models, and Figure 5 shows whole-corpus recall-precision curves for a subset of models. CRF (SIS), our top performing model, has the highest performance on all measures. Paired t-tests indicate that differences between CRF (SIS) and Baseline+DM for all per-article measures are statistically significant (all p-values v0:001). From the recallprecision curves in Figure 5 we see that, except for recall levels v0:05, the CRF (SIS) curve dominates the Baseline+DM curve. Therefore, we conclude that CRF (SIS) represents a significant improvement over the state-of-the-art for predicting linkages between abstract sentences and figures.
Comparing the CRF and HMM methods, we observe that CRFs usually, but not always, outperform HMMs for the same variant (either SIS or FIS). The HMM (SIS) has higher precision than CRF (SIS) for recall levels below about 0.4 while CRF (SIS) has the higher precision at higher recall levels. However, when we compare the SIS and FIS constructions we see that HMM (SIS), the least performing SIS construction, outperforms both FIS approaches. Hence, it appears model variant (SIS or FIS) has more effect on performance than model type (HMM or CRF). Even so, the differences between CRF (SIS) and HMM (SIS) are significant (p-value v 0.05) for AROC a and LM Ã a (but not Prec J a ). Since aspects of the FigureDegree feature are captured by the SIS CRF but not the FIS variant, the superiority of the performance of SIS over FIS agrees with the relative information gain values ( Table 2) of FigureDegree (7.37%) and SentenceDegree (1.51%). To better understand the performance differences between CRF (SIS) and CRF (FIS), we compared articles for which CRF (SIS) had larger Prec J a score to those where CRF (FIS) had the higher score. Articles won by CRF (FIS) had on average of 1.11 fewer abstract sentences than articles won by CRF (SIS). A permutation test reveals that this 1.11 sentence difference is statistically significant. This suggests a suite of models approach, where the model applied to linkage prediction on a given article depends on the number of abstract sentences or other observable article properties, may be effective.
The overall trend evident in the performance measures of Table 3 and the recall-precision curves in Figure 5 is that performance increases as more types of features are utilized. Of the models that use a single class of features, those that use text only are clearly superior to the DM approach. Combining DM with text models gives a substantial performance boost, most markedly for Baseline+DM versus Baseline. We see another performance bump for models that incorporate linkage features. Thus, we conclude that text, positional and linkage features are complementary for linkage prediction, and that our approaches successfully integrate these diverse types of evidence.

Results for Human Annotators
We invited authors of a disjoint set of 49 additional PNAS articles to provide annotations of abstract sentence/figure associations for their articles, and to evaluate a prototype of our online article browsing system on their articles. We subsequently asked authors to complete a short four question usability survey. A total of 21 authors participated for a response rate of 43%. Further, we asked three bio-medical researchers who are not authors of any of these articles to provide additional annotations from which we obtained linkage annotations for 14 of these articles.
The 14 articles annotated by both authors and non-authors contain a total of 420 abstract sentence/figure instances. Table 4(a) shows the contingency table for the linkage annotations of these instances. Authors and non-authors have a related concept of sentence/figure association (pv0:001 for x 2 test on independence of counts in Table 4). Authors and non-authors agree on linkage status on 81% of instances, and inter-annotator agreement as measured by Cohen's k is 0.47. The concept of association, however, is not precise as non-authors and authors disagree 19% of the time. It is interesting that non-authors, with a 27% linked rate, appear to have a significantly more liberal notion of association than authors, who identify only 17% of instances as being linked.
Besides estimating inter-annotator agreement we can use nonauthor annotations to compare the computational predictions of our models with human predictions. Using author annotations as ground truth, we compare the performance of linkage predictions made by humans (i.e., non-authors) with computational predictions. Using the CRF (SIS) model trained from author annotations for the 114 articles used above, we predicted linkages for the 14 articles that have both author and non-author annotations. For an article with J abstract sentences we predict that the top scoring J instances are linked and that the other instances are not linked. Thus, the CRF (SIS) row of Table 4 Table 4 (average of F 1 calculated at one predetermined point in each article) and the F 1 Ã a measure in Tables 1 and 3 (average of maximum F 1 value in each article). While human annotators have higher performance values on all measures (none of these differences are statistically significant), the performance of CRF (SIS) is competitive with that of the nonauthor humans. On 9 of the 14 articles the human had the higher F 1 score, while on the other five articles CRF (SIS) had higher F 1. Table 5 shows the pilot survey questions and average response values. We observe that users tend to have a positive view of the accuracy and usability of the prototype system. Interestingly, there is a significant positive correlation between an author's score for Q2 (''How useful are current figure-sentence associations?'') and the F 1 score of the system predictions for their article (r~0:44, p~0:04). Similar correlations were observed for other questions. From these results we conclude that methods for making more accurate predictions of sentence/figure associations, including the computational approaches we describe in this article, will lead to more usable online literature browsing systems.

Conclusion.
We have described methods for computationally identifying associations between sentences in the abstract of a scientific article and figures (and tables) in the article body. We use supervised methods for learning. Our models use three types of evidence to predict whether or not an abstract sentence is linked with a figure: text (in the abstract sentence, figure caption, and passages that refer to the figure), the relative positions of the abstract sentence and figure, and patterns of inferred associations for other sentence/figure pairs in the article.
Each type of evidence has predictive value. Our experimental evaluation showed that models that use all evidence types are more accurate than models that use only one or two types of evidence. Our best performing models, based on conditional random fields (CRFs) [3], achieve a macro-average F1 score of 0.69. The area under its ROC curve is 0.86. These performance measures represent a statistically significant improvement on the state-ofthe-art for this task, an unsupervised approach developed earlier [1]. Moreover, disagreement of human annotators on linkage status is nearly as common as prediction errors of our system.
We observed that the use of a language model significantly improved the results of previous work, where a TFIDF cosine similarity was used. Once a larger data set is collected and more detailed user feedback is assembled, a natural area of future exploration is more sophisticated language models. For example, the use of word bigram models, smoothing based on related clusters of articles, and divergence metrics such as Jensen-Shannon are all possible extensions of this work [27].
Automatic methods for predicting linkages between abstract sentences and figures are important for the development of the next generation of literature search and browsing systems. A user    study showed that users find the figure browsing features supported by our linkage predictions to be helpful. We have incorporated linkage predictions into our system (http:// FigureItOut.askHERMES.org).

Data and Features
We re-use our collection of 114 full-text biomedical articles (39 from Cell, 29 from EMBO, 30 from the Journal of Biological Chemistry, and 16 PNAS) from our previous study [1].  [3,13] and [3,11], respectively.
Term vectors. We represent the text content of captions and referencing paragraphs with the ''bag-of-words'' representation, and for abstract sentences we use the ''set of words'' representation. The term vectorT T for a sentence or figure has V elements, one element for each term in the vocabulary. For figures, T(t) is the number of occurrences of term t, while for sentences it is a binary indicator of the presence (1) or absence (0) of t.
Positional features. In addition to text we also use non-text features. The features naturally divide into two groups, positional features and linkage features. The value of the positional features for sentence j and figure k in article a depends on the positions j, k and the total number of abstract sentences (J) and figures (K) in a. We number sentences and figures sequentially as they appear in the article. So, for example, instance (j,k) is for the j th abstract sentence and k th figure in the article. Linkage features. We compute the value of linkage features from article-wide linkage patterns. We represent the linkage of an article with the J-by-K linkage matrix L, where L(j,k)~1 if sentence j is linked with figure k, and 0 otherwise. Figure 6 shows an example, which we use in the following six definitions of linkage features.
This feature is the number of links inconsistent with the preservation of relative ordering across links. The name Edge-sCrossed comes from number of edges that would be crossed by the edge j{{{k in the graph representation of L. In the example in Figure 6, the value of EdgesCrossed(2,3) is 1 because the edge 2{{{3 crosses the single edge 4{{{1, and EdgesCrossed(3,1) is 2 because the edge 3{{{1 would cross 2 edges.
Since L is not observed while predicting, linkage feature values are also hidden. Therefore, these features are not helpful in methods that predict instance linkages independently. The inferred values of linkage features may, however, benefit prediction by techniques like our HMM and CRF approaches.

Language Models
We model text properties of linked and non-linked instances using probabilistic language models (LM). Our LM approach is motivated by the successful application of similar methods to document retrieval [27][28][29]. For document retrieval the LM approach induces for each document a probability model over all terms in the vocabulary. Then, a document's relevance to a query is defined as the probability of the query under its model.
Hiemstra's LM approach [28] uses two kinds of term distributions: a single background distributionb b shared by all documents, and a set of document-specific distributions This mixture distribution corresponds to a generative process for constructing query terms for z that first randomly selects eitherb b ord d z according to the mixing distribution (parameterized by l) and then samples a term from the chosen distribution. To generate a query with L terms, these two steps are repeated L times.
In a similar way we use language models to predict links between abstract sentences and figures by treating abstract sentences as queries and figures as documents. Letb b(t) andd d k (t) be the probability of term t under the background distribution and figure k's distribution respectively. Then, the probability of abstract sentence j given that it is linked to figure k is.
whereT T j is sentence j's length V term vector, 0ƒlƒ1 is the mixing proportion for the background distribution, and j<k denotes that j and k are linked, or equivalently that L(j,k)~1. If j and k are not linked, the background distribution generates all terms in the sentence: The LM score matrix S LM for an article holds the log-odds of the sentence terms given linkage for all instances (j,k). For an article with J sentences and K figures, S LM is J-by-K and.
Figure-specific models. A natural and often-used representation of the document-specific term distributiond d is a multinomial distribution where each probabilityd d(t) has its own parameter. Parameter estimation for the multinomial model typically treats all occurrences of t in the document equally, and setsd d(t) to its frequency in the document. We consider multinomial representations, but we also use a representation that distinguishes caption terms from terms in referencing paragraphs.
Since in our approach to the linkage prediction task, figures play a role analogous to documents, to apply the multinomial approach we need to determine which terms represent a figure. Candidate term sources include terms in the figure's caption as well as terms in the article body close to figure references. We consider three sources: caption terms (Caption Only), terms in referencing paragraphs (Referencing Only) and the combination of terms in either the caption or a referencing paragraph (Pooled).
Let n c k (t) be the number of occurrences of term t in figure k's caption and N c k be the total number of terms in the caption. We similarly define n r k (t) and N r k for figure k's referencing paragraphs. The probabilities d c k (t),d r k (t) and d p k (t) of term t in the Caption Only, Referencing Only, and Pooled representations are simply its frequency in each collection: Note that we do not use pseudo-counts here, as smoothing is unnecessary because the background distributionb b is used for terms that have zero probability in the figure-specific model.
Although the Pooled method has the advantage of including text from multiple sources, it is limited in that it ignores term origin even though there may be meaningful differences between terms in captions and referencing paragraphs. For example, while text in referencing paragraphs can discuss topics unrelated to the figure, caption content nearly always relates to the figure. Our final figure-specific term distribution, which we call Mixture, distinguishes between caption terms and referencing paragraph terms. In the Mixture approach we represent d k (t) itself as a mixture of the Caption Only and Referencing Only distributions, where 0ƒaƒ1 is the mixing proportion for the caption distribution.
Background models. We consider two approaches to setting the background distribution of article a. One approach pools all terms present in abstract sentences, figure captions and referencing paragraphs in a, and sets background probabilities to the smoothed term frequencies, where n(t) is the count of term t in the pooled collection and V a is the number of distinct terms in article a's sentence/caption/ referencing paragraph pool. Since the vocabulary size V a -which depends on the number of distinct terms in abstract sentences, captions and referencing paragraphs -varies from article to article, we call this approach to setting the background distribution VariableSize. Because of finite sampling, however, Equation 8 may lead to biases that favor linking short sentences and against linking long sentences. This bias arises because the probabilities b b(t) of terms present in the pooled collection set according to Equation 8 are too large. Theb b(t) tend to be overestimates because, since the pooled collection is unlikely to contain all terms in the vocabulary, any term not in the pool has (an implicit) background probability of zero. Therefore, the probabilities of the absent terms are underestimated, and their true probability mass is distributed among the probabilities of the present terms. From Equations 1-3 it can be seen that overestimates ofb b(t) cause a corresponding overestimate of Pr (T T j jj k) (and thus underestimation of S LM ) that increases with sentence length.
To correct for the bias in VariableSize, we consider an alternative approach to estimatingb b that uses a fixed vocabulary size of Z terms in all articles. We call this approach FixedSize. We use a pseudo-count of 1 for all terms, and set the background probability for term t to.
We describe below how we set Z from training sets. Learning language models. We evaluate our LMs with leave-one-article-out cross-validation experiments. Our experiments evaluate each of the eight kinds of LMs: one LM for each pairing of a figure-specific-model (four kinds) with a background model (two kinds). For the cross-validation fold in which article a is in the test set, since the background and figure-specific distributions for a are set from only the terms in a and not any linkages, the parameters inb b and thed d k do not depend on the training set of annotated articles. We do, however, use training sets to estimate our other LM parameters: the mixing proportions l and a, and the fixed-vocabulary size Z.
We estimate separate parameter values for each LM. We first set l's for the (VariableSize, Caption Only), (VariableSize, Referencing Only) and (VariableSize, Pooled) approaches. We search over 99 values of l equally spaced from 0.01 to 0.99, and set l to the value that maximizes the mean Prec J a on the training set. Next, we set Z for the (FixedSize, Caption Only), (FixedSize, Referencing Only) and (FixedSize, Pooled) models. For each figure-specific-model we temporarily set l equal to the value just set for its pairing with VariableSize, and then estimate Z by the value that minimizes the absolute value of the Pearson correlation between sentence length and Pr (j<kDT T j ) for all training instances (j,k). With Z set, we then estimate l for these three LMs as we do above, by the value that minimizes the mean Prec J a . Lastly, we set parameters of LMs with Pooled figure-specific-models. We set l and a for (VariableSize, Pooled) with a method similar to the method we use to set l for the other VariableSize models, though now we compute Prec J a for joint l, a settings. So as to maximize the diversity of our parameter search, we define l 1~l , l 2~l1 a, l 1 zl 2 zl 3~1 , and conduct our search on 120 evenly spaced points on the standard 2-simplex: fl 1 , l 2 , l 3 Dl 1 §0, l 2 §0, l 1 zl 2 ƒ1g. Finally, we set Z, l and a for (FixedSize, Pooled) analogous to the method we use above to set Z and l for VariableSize models. We first set Z by minimizing correlation between sentence length and score, and next set l and a by search on the 2-simplex.

Distance Model
We begin our description of non-text models with models of the Distance feature. We consider distance models (DMs) because it has been shown previously [1] that the relative positions of an abstract sentence and figure correlate with linkage status. For example, a sentence near the beginning of an abstract is more likely to be linked with a figure near the beginning of an article than with a figure at the end of the article.
We learn models of discretized values of the Distance(j, k) feature for linked and non-linked instances. Let Distance . (j, k) denote the bin of Distance(j, k) where we have ten bins, and we place bin boundaries so that an approximately equal number of points falls in each bin. We set the bin probability of bin i in the DM of linked instances to the Laplace smoothed fraction of linked instances with Distance . (j, k)~i, where n L (i) is the number of linked training set instances in bin i. We set bin probabilities for the DM of non-linked instances in a similar way, where n(i) N is the number of non-linked training set instances in bin i. The distance model score matrix S D for an article holds the DM log-odds of the article's sentence/figure instances, We construct scores of a combined language and distance model by adding scores, where S LMzD (j, k) is the combined LM and DM score for sentence j and figure k, and Equation 14 follows from Equations 3 and 12 under the assumption that terms and distances are conditionally independent given linkage status.

Hidden Markov Models
In addition to patterns of the Distance feature for individual instances, linkage patterns across instances also have tendencies. For example, given two linked instances, j<k and j 0 <k 0 , from the same article, if jwj 0 , then it is also likely that kwk 0 . We model these kinds of linkage-flow patterns flow using hidden Markov models (HMMs) [2] and Conditional random fields (CRFs) [3], two kinds of probabilistic models widely used for representing structure in sequential problems. Since flow tendencies indicate preferences for linkage patterns that are, to a certain extent, independent of text, we do not want to ignore text. Both HMMs and CRFs are convenient in this regard as they provide a natural way to model both kinds of evidence. We model flow with state transition probabilities learned from a training corpus, and we model text with emission probabilities derived from the scores of learned language models. We first describe our HMM approach, and then we describe our related CRF approach. We consider two HMM constructions: ''sentences in states'' (SIS) and ''figures in states'' (FIS). Our description is in terms of the SIS construction, but FIS can be understood by swapping 'sentence' with 'figure' in the description. Under the SIS construction, an article with J sentences and K figures has an HMM with Jz1 states, f0g,f1g,f2g, Á Á Á ,fJg, and the length K observation sequence (1,2, Á Á Á ,K). State fjg is associated with abstract sentence j, and the non-linked state f0g is not associated with any sentence. We have a transition between every pair of states. Figure 7(a) shows an example HMM.
We associate linkage-predictions with state-sequence paths. For example, for an article with K~5 figures and J~3 sentences, the path f2g,f0g,f2g,f3g,f0g ½ asserts that Figure 1 links with Sentence 2, Figure 2 does not link to any sentence, Figure 3 also links with Sentence 2, Figure 4 links with Sentence 3, and Figure 5 does not link with any sentence. With the SIS construction, a single path can link a sentence to any number of figures (0 to K), while a figure can only link with 0 or 1 sentences. Our CRF approach relaxes this constraint and permits figures to link with multiple sentences.
As articles have different numbers of abstract sentences, their HMMs have different numbers of states. Our approach to this variation is to learn transition probabilities for a single base HMM structure with J Ã z1 states, where J Ã is the maximum number of abstract sentences in any article of the training corpus. Then, for an article with J abstract sentences, we construct a Jz1-state HMM to predict linkages. The transition probabilities of the derived HMM come from the base structure, while the emission probabilities are based on language model scores.
Training the base HMM structure involves estimating the entries of its J Ã z1 ð Þ-by-J Ã z1 ð Þtransition matrix U. The value U(j, j 0 ) is the probability for the transition from fjg to fj 0 g. In other words, U(j, j 0 ) is the probability of j 0 <k given j<(k{1), for all k. Here, we include unlinked figures in our notation by defining 0<k to mean that k is unlinked. We estimate U from the training corpus's transition counts matrix C, where C(j, j 0 ) is the number of times that j<k and j 0 <(kz1) in the training set: Here d indexes training set documents, and L d is the linkage matrix for training document d. We set the U(j, k) to their MAP estimate using Dirichlet priors with hyperparameters set to 1.0: We create the derived HMM with states (f0g,f1g, Á Á Á ,fJg) by extracting the corresponding states and transitions between them from the base structure, and re-normalizing transition probabilities so that the sum of the outgoing probabilities from any state is 1.0. If a test-set article happens to have more abstract sentences than any article in the training corpus, we create its derived HMM by adding states to the base structure. Then, we also add transitions so that the derived HMM is fully connected, assign small probabilities to the new transitions, and re-normalize.
We now describe how we set the emission probabilities of the derived HMMs to model text. Since the observation for an article with K figures is the ordered sequence (1, 2, Á Á Á , K), state fjg emits symbol k only if j<k. We thus set B(j, k), the emission probability for symbol k in state fjg, based on the textual coherence between abstract sentence j and figure k. While our language models, of course, are designed to do just this, setting the B(j, k) directly from LM probabilities gives poor performance. The primary problem with this approach is that the LM probabilities are not well calibrated. As with other naive Bayeslike models, the posteriors of our LMs tend to be extreme, that is, very close to zero or one. Although models with uncalibrated probabilities often perform well in classification tasks [30], when such models are used as components within a larger model like an HMM predictions can be poor as inference in this case involves reasoning with many uncalibrated probabilities. Therefore, we represent the emission probabilities using Gaussian models of LM scores.
We learn one Gaussian model of LM scores for linked instances and another for non-linked instances. As this approach applies to the scores S LM of any language model or the scores S LMzDM of the combined language and distance model, for generality in our description we denote scores by S as the computations are the same in all cases. From the training corpus we calculate the sample mean and variance of S(j, k) over all linked (m L , s 2 L ) and nonlinked (m N , s 2 N ) instances, and use these parameters to define Gaussian distributions for Pr (S(j,k)Dj<k) and Pr (S(j, k)jj k) . LetB B(j, k) be the joint probability under these models of all scores for figure k: S(1, k), Á Á Á , S(J, k), given that it links only with sentence j, j §1: where N(x, m, s 2 ) denotes the value of the probability density function for the Gaussian with parameters m and s 2 at x. Equation 18 assumes the elements of S are independent given L. Similarly we defineB B(0,k) for non-linked k: Lastly, we set the emission probabilities by normalizing theB B's.
We define the HMM score for abstract sentence j and figure k, S HMM (j, k), as the posterior probability that the state occupied on step k, p k , is state fjg.
where the probability is with respect to the article's derived HMM for the standard observation sequence (1, Á Á Á , K). We use the posterior decoding procedure [31] to compute the HMM scores.

Conditional Random Fields
Our HMM approach captures some properties of linkage features well, but fails to capture some others. Consider, for instance, the EdgesCrossed linkage feature and the transition f2g?f1g. Whenever this transition is taken at step k, HMM semantics assert both 2<k and 1<(kz1), which induces a crossed edge in the linkage graph. Now, the HMM may learn a relatively small transition probability for f2g?f1g, but only if other transitions from f2g are more frequent in the training set. Standard HMM representations, however, provide no mechanism for generalization to transitions from other states using, for example, a common penalty for transitions that induce crossed edges. Such a general penalty is especially beneficial when learning transition probabilities, such as for f9g?f8g, from less frequently visited states. Indeed, in our data set there are 98 total transitions from state f2g but only 19 from state f9g. Conditional random fields [3] (CRFs), on the other hand, provide a mechanism for generalization through weights associated with a set of shared transition features.
Our CRF approach is similar in many respects to our HMM approach. Our CRFs also have SIS and FIS variants (we describe here the SIS variant), also associate linkage predictions with state sequence paths, and also generate a length K observation sequence 1, 2, Á Á Á , K for an article with K figures. Furthermore, like HMMs, the likelihood of a path through the model is proportional to the product of K{1 transition terms and K emission terms. There are, however, two key differences between our CRF and HMM methods. First, while each HMM state is associated with one or zero sentences, in CRFs we also have states associated with multiple sentences. These multi-sentence states enable us to link a figure with multiple sentences on a single path. Second, CRFs use a different representation transition affinity. While for HMMs the affinity of a transition is its transition probability, CRF transition affinity is given by a weighted sum of feature values. Sharing of features and weights enables information transfer across transitions.
A CRF for an article with J sentences has a state u for every subset of the sentences f1, 2, Á Á Á , Jg with size ƒM. In our experiments we set M~3. (Figure 7(b) shows an example CRF with M~2.) We use S(u) to refer to the set of sentences associated with state u so, S(u)5f1, 2, Á Á Á , Jg and # S(u) ð ÞƒM where # : ð Þ denotes set cardinality. The state sequence path for an article with K figures, p p~p 1 , Á Á Á , p K , asserts that figure k is linked with all abstract sentences in S(p k ). Thus, the number of abstract sentences linked with figure k (the degree of k), is # S(p k ) ð ÞƒM, and the number of figures linked with abstract sentence j is # fkDj[S(p k )g ð Þis ƒK. Since in the SIS construction, the degree of figure k is entirely determined from p k we are able to readily encode figure degree properties in CRF transition features. On the other hand, as the degree of sentence j depends on the whole path, sentence degree properties are not as amenable to representation as transition features. One of the likely reasons that SIS representations outperform FIS representations is that the linkage feature FigureDegree is substantially more predictive of linkage than SentenceDegree (Table 2).
Our CRFs are parameterized by the transition feature weight vectorw w~w 1 , Á Á Á ,w F , where F is the number of transition features, and weight w i is associated with transition feature f i . The weight vectors are set during training. Givenw w, the CRF probability of the pathp p~p 1 , p 2 , :::, p K is proportional to the product of the start-state affinity (Q s ), K emission affinities (Q e ), and K{1 transition affinities (Q t ): Here, Q s (p 1 Dw w) is the affinity for starting in state p 1 , Q t (p k{1 ,p k Dw w) is the affinity for the transition p k{1 ?p k , and the emission affinity Q e (p k ,k) gives the affinity for linking figure k with sentences S(p k ). The emission affinities represent the textual coherence of the implied linkages, and are defined similarly to the emission probabilities of our HMMs. Also, the emission affinities do not depend onw w, and so, as with HMM emission probabilities, they are not adjusted during CRF training.
Transition affinities. We now describe the transition featuresf f (u,v) we use to represent the start-state and transition affinities. We represent the transition affinity for the transition from state u to v using the standard log-linear model: where f i (u,v) is the value of feature i for u?v. We have a similar representation for the start-state affinity: where f i (0,v) denotes the value of feature i associated with starting in v. To simplify description of our features below, we define S(0)d ef 1.
We now describe our eight transition features. These features are closely related to the linkage features described above. However, as each transition only provides information on linkage of two neighboring figures, each feature f i (u,v) can only depend on two adjacent columns of the linkage matrix L (see Figure 6). We have a group of four binary features related to the number of sentences in the destination state v. The names of these features all begin with '' FigureDegree'' because the degree of figure k's vertex in the graph representation is equal to # S(p k ) ð Þ. Each of these features is a binary test on # v ð Þ, and for every state exactly one of these features is 1 and all other features are 0. Similarly, PreviousSentence(u?v) is the count of the number of times figure k links with both sentence j and the previous sentence j{1: Note that PreviousSentence(u?v) depends only on the sentences S(v) in the destination state. Although additional ''previous sentence'' counts can be inferred from linkages between figure k{1 and source state sentences S(u), to prevent double counting we do not count them on u?v because they get counted on the transition into u. We define the last neighborhood feature, PrevSentAndFig(u?v), to be the count of the number of times figure k links with sentence j while the previous figure k{1 links with the previous sentence j{1: Unlike NumEdgesCrossed(u?v), the values of the three neighborhood features are exact.
Prediction and learning. We use standard algorithms for learning weights and predicting linkages [3]. For learning we use gradient ascent to maximize the probability of training set state sequences. For prediction, we compute posterior distributions over Pr (p k ) using the forward and backward dynamic programming passes.

Performance Measures
Recall-precision, F1, and ROC curves. We compute precision (P), recall (R) and false-positive rate (FPR) for the linkage predictions of a set of sentence/figure instances: TP, TN, FP, FN are the number of true positive, true negative, false positive, and false negative predictions, respectively, where a ''positive'' instance is linked and a ''negative'' instance is not linked. A recall-precision curve plots R vs P, while a receiver operating characteristic (ROC) curve plots FPR vs R. Points on these curves are calculated by varying the threshold on linked score that separates positive predictions from negative predictions. The area under an ROC curve (AROC) ranges from 0.0 to 1.0 where the AROC of a random guess classifier is equal to 0.5. The F1 score of a classifier is the geometric mean of R and P: F1 = 2RP/(R+P).