Combined SVM-CRFs for Biological Named Entity Recognition with Maximal Bidirectional Squeezing

Biological named entity recognition, the identification of biological terms in text, is essential for biomedical information extraction. Machine learning-based approaches have been widely applied in this area. However, the recognition performance of current approaches could still be improved. Our novel approach is to combine support vector machines (SVMs) and conditional random fields (CRFs), which can complement and facilitate each other. During the hybrid process, we use SVM to separate biological terms from non-biological terms, before we use CRFs to determine the types of biological terms, which makes full use of the power of SVM as a binary-class classifier and the data-labeling capacity of CRFs. We then merge the results of SVM and CRFs. To remove any inconsistencies that might result from the merging, we develop a useful algorithm and apply two rules. To ensure biological terms with a maximum length are identified, we propose a maximal bidirectional squeezing approach that finds the longest term. We also add a positive gain to rare events to reinforce their probability and avoid bias. Our approach will also gradually extend the context so more contextual information can be included. We examined the performance of four approaches with GENIA corpus and JNLPBA04 data. The combination of SVM and CRFs improved performance. The macro-precision, macro-recall, and macro-F1 of the SVM-CRFs hybrid approach surpassed conventional SVM and CRFs. After applying the new algorithms, the macro-F1 reached 91.67% with the GENIA corpus and 84.04% with the JNLPBA04 data.


Introduction
The development of biotechnology is contributing to the rapid growth of the biological literature. For example, PubMed (http:// www.ncbi.nlm.nih.gov/pubmed/.), a free resource that is developed and maintained by National Center for Biotechnology Information (NCBI), contains more than 20 million citations of biomedical literature from MEDLINE, life science journals, and online books. The enormous volume of biological literature available provide a massive data resource for researchers, but it also a challenge for mining new information and discovering new knowledge, which has become a very important research subject.
Biological named entity recognition can be regarded as a sequence segmentation problem where each token in a sequence is assigned a biological name label (e.g. PROTEIN, DNA, RNA, CELL-LINE, CELL-TYPE,), which can be used to identify specified biological terms in text [1][2], or label OTHER which represents the term isn't a predefined type of biological one. Biological named entity recognition has a key role in biological text mining. It is fundamental for biological information extraction and mining techniques [2][3][4][5][6], such as biological relation extraction [7][8].
However, it is difficult to correctly identify biological terms in text because they use alphabets, digits, hyphens, and other characters [6,[9][10][11][12]. Arbitrarily referring to biological terms makes it even harder to conduct automatic recognition using computers. In biological text, biological named entities are usually multi-word phrases and some have prefixes and/or suffixes, which makes it harder to determine the boundaries of terms. Biological terms are also affected by their context. In some cases, a biological term has different meaning among species. As a result, it is difficult for computers to recognize biological terms automatically.
Identifying biological terms from text is very important in bioinformatics. In this study, we propose a novel approach for biological named entity recognition.

Related Work
Biological term recognition is one of the hottest research areas. Many researchers are interesting in mining biomedical terms from text, which is a key step in extracting of knowledge with an overall aim of identifying specific terms, such as genes, proteins, diseases and drugs [1][2].
In general, several methods are used for biological named entity recognition [9,11], i.e., dictionary-based approaches [12], rulebased approaches, and machine learning-based approaches. However, dictionary-based approaches tend to miss undefined terms that are not mentioned in the dictionary [12]. The overall results of dictionary-based approaches rely heavily on a predefined dictionary. There is an enormous number of biological terms and new terms are constantly emerging, so it is impossible to produce a complete dictionary containing all biomedical terms. Therefore, the use of a dictionary can provide the highest precision, but we can also miss many terms. In rule-based biological term recognition systems, the rules used for identifying terms are critical, but there are generally no recognition rules that cover all cases [12]. Machine learning-based approaches train models using a training data set and the models can identify predefined types of terms.
Machine learning approaches are now a mainstream method of named entity recognition. Many algorithms are widely used, such as Bayesian approaches, Hidden Markov Model (HMM) [10], Support Vector Machines (SVM) [13][14], Conditional Random Fields (CRFs) [15][16], and Maximum Entropy (ME) [17][18]. For example, AbGene developed by Tanabe et al. [19] has an 85.7% precision rate, 66.7% recall rate, and 76.2% F1 rate when using the Bayesian method with manual post-processing. An HMM-based system designed and implemented by Zhou et al. [20] can recognize protein, DNA, RNA, cell-type, and cell-lines from text. Their system has a 72.55% F1 rate. Kazama et al. [21] used SVMs to identify protein, DNA, cell-type, cell-line, and lipid, with a 73.6% F1 rate. Tsai et al. [22] developed a CRF system to find protein mentions, achieving a 78.4% F1 rate. Lin et al. [23] used ME to recognize 23 categories of biological terms with a 72% F1 rate.
However, many methods that perform well in general text do not work as well as expected [20,[24][25][26][27][28] because there are many obstacles in biological term recognition. First, a biomedical term may have several different written forms, e.g., epilepsy and falling sickness refer to the same disease, which is a disorder of the central nervous system that is characterized by loss of consciousness and convulsions [29]. Second, an entity can be represented using different types, e.g., cancer can be used to represent a disease as well as a genus of crabs in the family Cancridae. Third, abbreviations of terms, especially arbitrarily referred abbreviations, cause even more ambiguity problems. For example, PC may refer to prostate cancer, phosphatidyl choline, or even a personal computer. Fourth, many biomedical terms are phrases or compound words, or they may have a suffix or prefix. All of these factors make it more difficult for computers to identify biomedical terms automatically [9].
Researchers have applied many methods to improve the performance of machine learning approaches, such as combining different approaches and proposing a hybrid approach, conducting post-processing after machine learning, and adding biomedical domain knowledge to machine leaning-based term identification systems. In this paper, we combined all these methods to raise the precision and recall rate.

Results
We used SVM [6], Stanford CRFs [4] and two SVM-CRF hybrid approaches to identify biological terms from text. One SVM-CRF hybrid approach used SVM to separate biological terms from non-biological terms before using Stanford CRFs to identify the type of the biological term, while the other used SVM-CRFs to recognize biological terms before applying our proposed algorithms to improve the prediction results. The parameters for the SVM [6] and Stanford CRFs [4] used in the tests are listed in Table 1 and Table 2.
In the first round, we tested four approaches using data from the GENIA corpus [5]. The F1 score for the SVM-CRFs combined approach with amendment was better than the other three approaches in five classes and it was close to the best in the remaining classes. Its macro-F1 score was greater than those of the other three approaches. The detailed testing results are shown in Table 3. The macro-precision, macro-recall, and macro-F1 rates for the results are shown in Figure 1.
In the second round, we tested four approaches using data from JNLPBA04 [7]. The F1 scores for the two SVM-CRF approaches were better than those of the other approaches. The SVM-CRFs combined approach with amendment had the highest macro-F1 score. The detailed results are shown in Table 4. The macro-precision, macro-recall, and macro-F1 rate results are shown in Figure 2.

Discussion
The results showed that the SVM-CRFs hybrid approach could identify biological terms from text well and they performed better than conventional SVM and CRFs approaches. We found in some cases, that SVM had higher precision but it tended to miss terms and unstable when trained with a small-sized data set. If the positive data are much less than the negative one, its optimal hyper plane will be biased to negative. Moreover, when the number of feature dimensions is much higher than the size of training set, over-fitting is very likely to happen. For example, monocyte macrophage lineage associated surface antigen is a protein term. However, the result by SVM is not correct where the word associated should be tagged as IG#protein. This error is caused because the number of positive samples of the word ''associated'' as IG#protein is much less than that of negative ones.
The results showed that although the performance of CRFs was medium, they maintained a balance between precision and recall rate, indicating that this was a stable approach. All the results suggested that combining SVM and CRFs can provide better performance because this hybrid technique was complementary.
The basic idea of our approach was to make full use of the power of SVMs as a binary-class classifier, which facilitates data labeling with CRFs. However, SVM and CRFs are the two very different algorithms, so simply combining them could cause inconsistencies. The proposed amendment algorithms were designed to correct any inconsistencies and promote their performance.

Materials
There are many benchmark corpuses for biological named entity recognition, such as the GENIA [5] data set, JNLPBA04 shared task data set [7], GENETAG data set [8], and MEDSTRACT data set [8]. The GENIA corpus was developed for applying natural language processing technology to biological text mining. It contains 2,000 MEDLINE abstracts with more than 400,000 words and almost 100,000 annotations of biological terms [5]. JNLPBA04 [7] has several shared tasks for natural language processing in biomedicine and its application. Bio-entity recognition is one of the tasks of JNLPBA04. The JNLPBA04 data set is often used as a benchmark data set for evaluation methods.
In the first round of testing, we divided data from the GENIA corpus into two parts, i.e., one part for training and the other for testing.

SVM Terms Identifier
SVM performs well in solving small sample size, nonlinear, and high-dimensional pattern recognition problems and other machine learning problems [30]. Assume that we are given data Þwhere y i is either 1 or 21, indicating the class of xi. In our previous experiment [31], we used SVM to identify biological terms from text. We used word, word shape, part-of-speech, and morphology as features for identification, as shown in Table 5. The results [31] were good. SVM uses a line or surface to separate the data [30]. Thus, SVM is suitable for binary classification problems but not multiple-class problems where there are more than two candidate objective classes [32]. In most cases, name entity recognition is a multiple-class task. As a result, the initial binary SVM is not fit for most name entity recognition tasks. We can use two main types of approaches to solve multiple-class problems. One is to update an SVM kernel function that can merge the multiple classification surface problems into an optimization so as to solve multiple class classification in one pass. The alternative is to apply multiple binary classifiers until they finish the job [32].

CRFs Terms Identifier
CRFs are often used for the labeling or parsing of sequential data, such as natural language text or biological sequences [33]. CRFs work well in named entity recognition tasks. Many features can be used in CRFs. For example, term appearance (e.g., capitalization, affixes, etc.) and orthographic features (e.g., alphanumeric characters, dashes, Roman numeral characters, etc.) are used frequently.
However, CRFs have many drawbacks. First, CRFs use a limited size of context rather than the whole text because of computational limitation, thereby limiting the contextual information. Second, splitting the context of the whole text into small pieces of context will generally separate inherent relationships among them, and simply combining these pieces of context again cannot reproduce the original context due to the loss of relationships during splitting. For example, a CRF biological term identifier uses a two-word context. The whole text could be split into many pieces of two-word contexts. As a result, the same term in the different places of the text could be tagged with different results due to the variation in the context. However, SVM deals with the whole text so it does not have such restrictions. Third, CRFs are affected by the data distribution. If we want to achieve better results, the data should have an exponential distribution. However, biological terms in texts generally do not meet this data distribution prerequisite.

SVM-CRFs Combined Biological Name Entity Recognition
One of the new research areas in machine learning is combining useful algorithms together to provide better performance or for  achieving smooth and stable performance. SVM and CRFs are two conventional algorithms that can deal with named entity recognition tasks well. As stated earlier, the feature context used by SVM is global and it does not have the same constraints as CRFs. SVM is initially the best fit for binary-class tasks and it does not perform well on multiple-class tasks. CRFs generally require more computational time and space than SVMs. Thus, although CRFs have many drawbacks, they are very good at sequential data tagging tasks, which is a typical problem in name entity recognition. Thus, we combined SVM and CRFs because they can complement and facilitate each other. In our approach, biological named entity recognition was regarded as a two-step task. The first step was to determine whether a candidate term was a biological one. If it was a biological term, we determine its class of entity. The first step was a binary classification task where the result was either yes or no, before we could fully use SVM to complete the task. We then used CRFs to infer the type of biological term. Finally, we merged the results returned by SVM and CRFs, before performing an amendment process.

Inconsistency Removal
In this paper, we used a BIO pattern for the resulting tags: tag that started with the character B began a term; tags starting with the character I represented the can be composed as a complete protein term: NOTCH1 ankyrin repeat region. Given the statement above, we propose a phased approach (Algorithm 1) for determining whether a term is a biological term, as shown in Algorithm 1. Algorithm 1 determined whether a term was a biological one. The input was the word set of all terms. The output was words with the tag Bio showing that the word was part of a biological term or the tag O showing that the word was not a biological term. Words tagged with Bio are further processed by CRFs to determine their biological classes.
However, SVM and CRFs are two different algorithms. Simply merging the results returned by SVM and CRFs could cause inconsistency. For example, the term CsA treated cell is a cell line mention. Its correct tag should be Therefore, we needed to amend any inconsistencies to improve the results. Before the amendment, we determined which terms were inconsistent. We use the following two rules to identify inconsistent terms: N Rule 1: If the precursor and the successor of a word are both middle words of a biological term, the word should be also a middle word of the term. N Rule 2: A term begins with a word tagged with a start tag.
Rule 1 and Rule 2 removed any inconsistencies caused by shifts in context. We used Algorithm 2 to carry out the term consistency analysis, as shown as follows.

Term Length Maximizing
Using Rule 1 and Rule 2, we can identify and eliminate inconsistencies. In the example, the prediction results for the term CsA treated cell

CsA
BG#cell line treated O

cells O
will be treated as correct, although the results are not exactly the best fit. Thus, we propose a new rule to address this type of inconsistency.
N Rule 3: The length of a biological term is expected to be as long as possible.
According to Rule 3, biological terms should be as long as possible. Using our approach, we extend a term from left to right to validate whether the extended terms are biological terms. Thus, given word 0 word 1 , . . . ,word n{1 wordn, if word 1 , . . . ,word n{1 is tagged as a biological term, we have to check: if word 0 word 1 , . . . ,word n{1 is a biological term; if word 1 , . . . ,word n{1 wordn is a biological term; if word 0 word 1 , . . . ,word n{1 wordn is a biological term: If any of the extended terms are in a biological term list, it is definitely a biological term. However, it is impossible to produce a complete biological term dictionary. Therefore, we need to make some deductions to predict the tags of the extended word.
We used a maximal forward and backward probability squeezing approach to extend the term. The maximal forward probability approach determines each forward output probability of state t on the basis of state t21, while the maximal backward probability determines each backward output probability of state t on the basis of the state t+1 [34]. Our approach identifies the output with the maximal product result for the forward probability and the backward probability.
We assume an output sequence O 1 , . . . ,O T Twt ð Þand a hidden state sequence X 1 , . . . ,X T Twt ð Þ. Let a ij be the transfer probability from state t21 to state t, while a t{1 is the probability of observing all of the given data up to state t21. At state t21, given an output sequence O 1 , . . . ,O T and a hidden state X t{1 , we can find the forward output using the following equations [34].
Let, b kj be the output probability from state t to state t +1 and a tz1 be the probability of all future data from state t +1 to state t. At state t+1, given output sequence O 1 , . . . ,O T and hidden state X tz1 , we can conduct inference and find the backward output using the following equations [34].
The final result maximizes the product of the result returned by forward inference and backward inference, as shown in the following equation. An illustration of maximal forward and backward probability squeezing is shown in Figure 3.
The maximal bidirectional probability squeezing method that uses the forward probability and backward probability to predict the outputs of intermediate states tends to lead to bias when dealing with states that are rare. Thus, we add positive gain to rare event cases to reinforce their probability and avoid bias, as shown in Algorithm 3. We also maintain the context window as large as possible, so the output has the maximal positive gain, as shown in Algorithm 4. Algorithm 4 is maximal bidirectional probability squeezing, which uses the forward probability and backward probability to predict the output. Algorithm 4 also maintains a maximal context window so the output has the maximal positive gain.
When we use Rule 3 to maximize the term length, we gradually extend the context window size. We initially set the context window size for the . This is conducted iteratively until the predictive tag result is correct according to Algorithm 2 or we still cannot find the correct result after the whole output sequence has been treated. The amendment of the output sequence in various contexts is performed using Algorithm 5

Performance Evaluation
We evaluate the results in terms of precision, recall rate, and F 1 rate. Precision, recall rate, and F 1 are given by the following equations [3].
For example, when we identify a protein term, the definition of true positive, false positive, true negative, and false negative are regarded as: True positive: protein term correctly identified as protein.
False positive: non-protein term incorrectly identified as protein.
True negative: non-protein term correctly identified as nonprotein.
False negative: protein term incorrectly identified as nonprotein.
We also used macro-precision, macro-recall and macro-F 1 , to evaluate the overall performance of the identifiers. Their definitions are as follows [3]:

Conclusions
The vast biological literatures provide a highly reliable information source for biological research. Mining information and finding new knowledge is a very important new subject, where the identification of biological terms is fundamental. We propose a novel machine learning approach to achieve biological named entity recognition. This approach used an SVM to determine whether the term is a biological term, before CRFs were used to infer the type of a biological term. We then judged whether the merged result was consistent in the new global context and applied an amendment approach that used maximal bidirectional squeezing with positive gain in an adaptive context algorithm for correcting inconsistent terms. The results showed that our approach could achieve biological named entity recognition and it performed better than CRFs and SVM alone.