Unsupervised cross-lingual model transfer for named entity recognition with contextualized word representations

Named entity recognition (NER) is one fundamental task in the natural language processing (NLP) community. Supervised neural network models based on contextualized word representations can achieve highly-competitive performance, which requires a large-scale manually-annotated corpus for training. While for the resource-scarce languages, the construction of such as corpus is always expensive and time-consuming. Thus, unsupervised cross-lingual transfer is one good solution to address the problem. In this work, we investigate the unsupervised cross-lingual NER with model transfer based on contextualized word representations, which greatly advances the cross-lingual NER performance. We study several model transfer settings of the unsupervised cross-lingual NER, including (1) different types of the pretrained transformer-based language models as input, (2) the exploration strategies of the multilingual contextualized word representations, and (3) multi-source adaption. In particular, we propose an adapter-based word representation method combining with parameter generation network (PGN) better to capture the relationship between the source and target languages. We conduct experiments on a benchmark ConLL dataset involving four languages to simulate the cross-lingual setting. Results show that we can obtain highly-competitive performance by cross-lingual model transfer. In particular, our proposed adapter-based PGN model can lead to significant improvements for cross-lingual NER.


Introduction
Named entity recognition (NER) aims to extract named entities and meanwhile identify their semantic types (e.g., person, organization and location) from text, which is one of the fundamental tasks in natural language processing (NLP) [1]. The task can be beneficial for a range of applications, including relation extraction [2], coreference resolution [3] and question answering [4], as the extracted named entities are critical elements for these applications. NER is generally treated as a sequence labeling problem by word-level tagging, where the tags are defined according to the entity boundary information [5]. Fig 1 shows one example for NER modeling. Conditional random fields (CRF) models with neural networks have achieved state-of-the-art performance [6,7]. In particular, equipped with contextualized word-level neural representations such as BERT [8], neural NER systems can reach a performance over 92% by the F-measure on the benchmark ConLL03 English dataset, close to 2% increases over the previous systems [6,9,10].
All these successes are based on supervised setting, assuming that a large-scale manuallyconstructed high-quality corpus is available for model training. However, it is not always the setting in practical, especially for the resource-scarce languages, where there exists no training corpus to learn such a supervised model. According to the official statistics, there are over 7,000 languages today, and most of them do not have any annotated corpus for NER.
The model transfer is one mainstream method for unsupervised cross-lingual adaption [11][12][13], which builds cross-lingual models on language-independent features such as crosslingual word representations, and thus the learned models can be applicable to target languages directly. The method has received great attention as it is quite straightforward and easy to follow. Under the neural setting, previous cross-lingual NER studies are mostly focusing on multilingual word embeddings [14], or a simple exploration of mBERT to another line of transfer strategies [15,16].
In this work, we present the first work to study cross-lingual model transfer with contextualized word representations for NER comprehensively. Here we mainly focus on mBERT and XLM, which are two widely-adopted multilingual contextualized word representations based on the transformer network. Our first goal is to compare the two kinds of word representations, and meanwhile investigate fine-tuning and feature-based methods of exploiting mBERT and XLM [8]. Fine-tuning is the standard strategy because of its high performance, whereas the feature-based strategy by frozen the mBERT or XLM parameters is much more parameterefficient. Here, we adopt the adapter mechanism for the feature-based strategy in order to make the two strategies comparable in performance. Finally, we study single-source and multi-source model transfer, and propose a novel model based on parameter generation network (PGN) [17] to better capture the differences between the source and target languages.
We conduct experiments on the benchmark ConLL dataset to evaluate our models, which includes four languages: English, Spanish, German and Dutch, respectively. The languages are used to simulate the resource-scarce languages, where only the test dataset is available when one is selected as the targeted language. First, we can find that XLM can achieve better performance than mBERT with fair comparisons as a whole. Second, the adapter enhanced feature-based method is one good alternative for model transfer, which achieves very competitive performance with much less learned parameters. Third, multi-source transfer can help the target language greatly, leading to averaged improvements by 2.79 points compared with the best-reported bilingual transfer.
In summary, our major contributions in this article are three folds. (1) We present the first comprehensive work to investigate cross-lingual NER by using model transfer with contextualized word representations, including comparisons between different multilingual contextualized word representations, different exploration methods of word representations as well as multi-source transfer. (2) We are the first work of exploiting the adapter module and parameter generation network to enhance multilingual contextualized word representations for unsupervised cross-lingual model transfer. (3) We empirically evaluate the model transfer method with various settings for cross-lingual NER. The codes and related data are released publicly available at https://github.com/qtxcm/UCT-NER under Apache License 2.0.

Method
Given an input sentence, the goal of NER is to identify all entities with specific named types, such as Person, Organization, Location, and etc. The standard sequence labeling architecture is widely exploited to formalize the task, which transforms entities/non-entities into characterlevel boundary labels by using the BIO or BIOES schema. Here we adopt the the BIOES schema, where each sentential word is labeled as either "O"(non-entity words), "B-XX" (the beginning word of an entity with type "XX"), "E-XX" (the end word of an entity with type "XX"), "I-XX" (the middle word of an entity with type "XX"), or "S-XX" (a single-word entity with type "XX"). Notice that the overlapped or discontinuous entitles are not considered in this work.

The model
We adopt BERT-BiLSTM-CRF as the main model structure for our NER task, which can achieve state-of-the-art performance for NER [9]. The model consists of four components: (1) word representation, (2) Encoding, (3) CRF decoding, and (4) training, and particularly for word representation, we consider the fine-tuning and feature-based strategies for external pretrained parameters. Fig 1 shows the overall architecture by an example. Here we introduce the model in detail.
Word representation. The word representation is the key to cross-lingual NER of model transfer, which serves as the primary bridge to adapt to multiple languages. For a given sentence � w ¼ w 1 � � � w n (i.e., n indicates the sentence length), we first convert it into sequential hidden vectors by using pretrained contextualized language model: Here we only exploit the pretrained transformed-based language models. The overall networks of these pretrained models are stacked by several standard transformer layers, as shown in Fig 2a. Each transformer submodule [18] is organized as follows: ffi ffi ffi ffi ffi ffi ffi ffi ffiffi head p Þ; l 2 ½1; H�; where � x ¼ x 1 . . . x n is the input, which is the word and positional embeddings at the first layer and the previous layer outputs for the other layers, and � z ¼ z 1 . . . z n . In more detail, SPLIT divides a vector into H components, and following, self-attention aggregations are performed for the H components, respectively. Then, the updated hidden vectors after these individual self-attentions are concatenated. The above process is referred to as multi-head self-attention. After the multi-head attention, a layer normalization (i.e., LAYER-NORM) is used to regularize the original input x i and the attention output r i . Finally, a two-layer feed-forward layer is executed, followed by another layer normalization, producing the onelayer transformer output z i .
In this work, we investigate two multilingual word representations: mBERT and XLM [15,16], which are both pretrained language models based on the transformer network. Although the BiLSTM-based ELMo is also a widely-studied language model, we ignore it here because of its lower performance for cross-lingual model transfer. In addition, note that mBERT or XLM leads to subword-level representations, which include inconsistencies with our desired wordlevel outputs, as one full word might be decomposed to several subwords in mBERT and XLM. Here we use the averaged pooling over all the covered subwords for the representation of a full word. Thus, we can finally obtain a sequence of word-level representations � e ¼ e 1 . . . e n as shown in Eq 1 for NER.
Fine-tuning or feature-based. Transformer-based language model pretraining has achieved state-of-the-art results on a wide range of NLP tasks [8]. It accepts full sentences as inputs, outputting contextualized word representations based on a well-pretrained bidirectional Transformer. Fine-tuning is the standard exploration for such pretrained language models. Concretely, the last-layer vectorial output is exploited for feature extraction, and all the internal parameters are fine-tuned along with our NER objective. It is the standard strategy for using transformer-based language models, and can bring great successes for a number of NLP tasks.
Although fine-tuning can achieve remarkable performances [8], the strategy may suffer from the parameter inefficiency problem, where a newly-trained model would introduce a new copy of mBERT or XLM weights (i.e., consuming 110M parameters). This process may lead to great inconveniences in real scenarios which involve multiple NLP tasks and model ensembles, since each model keeps a different copy of BERT weights. Thus, it is highly meaningful to study the feature-based method which freezes the parameters of mBERT or XLM during training. In this way, we can preserve a shared BERT across different NLP models. Here we follow the above observation, investigating the feature-based method of freezing the internal parameters of the large pretrained language models such as mBERT or XLM.
The preliminary experiments show that direct feature extraction from mBERT or XLM leads to significant decreases compared with fine-tuning. To reduce the gap, we exploit the adapter mechanism [19,20], applying it to the unit transformer layers of mBERT or XLM. The overall process of one adapter can be formalized as follows: where W down , W up , b down and b up are model parameters, which are much smaller than those of transformer in scale, and the dimension size of h mid is also smaller than that of the corresponding transformer dimension. The dimension sizes of h in and h out are consistent with the corresponding transformer. Fig 2b shows the differences between the standard transformer and the transformer with adapters. For each transformer, we insert the adapters before the two layer normalization layers [20]: ffi ffi ffi ffi ffi ffi ffi ffi ffiffi head p Þ; l 2 ½1; H�; Note that we only tune the weights of the adapter modules during training, keeping all parameters of the pretrained language models fixed.
Encoding. We use a bidirectional LSTM layer to further abstract the final hidden features for our task [21], where the process can be formalized as: where � h ¼ h 1 . . . h n is the desired features for the next decoding. Actually, the majority of previous studies show that mBERT or XLM with fine-tuning does not require any other extra encoding layers, and the output � e ¼ e 1 . . . e n can be directly exploited for decoding with little loss in performance. Here, we insist on a BiLSTM encoding since the performance is more stable and slightly better after using it. In addition, the BiLSTM encoding can be effective for the feature-based methods.
Decoding. CRF has been a standard decoding strategy for sequence labeling [22]. First, a linear feed-forward transformation layer is exploited to calculate the initial label scores. Then, a label transition matrix T is used to model first-order Markov chains. Let � y ¼ y 1 . . . y n be any output label sequence, its score sð� yj� xÞ can be computed by: where W, b and T are the model parameters. The decoding aims to find the highest-scored � y, and here we utilize the Viterbi algorithm to achieve the goal.
Training. We exploit the sentence-level cross-entropy objective for our task training. Given a gold-standard training instance ð� x; � yÞ, we first compute the conditional probability pð� yj� xÞ based on the score function defined in Eq 6, and then apply a cross-entropy function to obtain the single instance loss: whereỸ denotes all possible candidate predictions for sentence � x.

Cross-lingual model transfer
Based on the above model, we can achieve our cross-lingual model transfer by using multilingual word representations. Moreover, since no language-dependent feature is exploited in the NER model, we can use the model for any language in the world, treating all languages with no discrimination. Thus, we can exploit a training corpus of any language for a target language. In particular, if only one language is exploited for model training, we usually refer to it as singlesource cross-lingual model transfer, whereas if multiple languages are exploited during training, we characterize it as multi-source cross-lingual model transfer.
In this work, we present a new method that enables sophisticated representations of input languages. By this method, we can capture the distances between different languages, which can better help the single-source and multi-source cross-lingual model transfer. The key idea is to use an embedding to denote each language, where several languages may have very similar embedding vectors and others are different. To use the embeddings, we exploit the parameter generate network (PGN) to enhance the word representations of our basic model.
We apply the PGN only to the adapter-based word representations, where the parameters inside pretrained language models are kept frozen. The main reason lies that it is difficult to integrate the language embeddings into the fine-tuned methods, which requires a re-pretraining for the language models by using a large-scale raw corpus. In contrast, only the adapter parameters are required to be tuned for the adapter-based method, which can be combined with PGN and learned from a distant task lightly. Fig 3 shows the concrete network architecture of our adapter-based word representation with PGN. Formally, we collect all adapter parameters as a whole and pack them into a vector V ada , and also the vector can be unpacked for the calculation of individual adapters. In the vanilla adapter model, V ada is shared by all languages with the same values, and after PGN is exploited, V ada is produced dynamically for each language by the following equation: where V ada,l indicates the new adapter parameters for a given language l, O is one model parameter, and e l is the language embedding. Since we do not have any target-language corpus to train our model, thus the determination of e l is difficult. Although we can learn the source language embeddings from the training corpus, the target language embedding is in no way to be obtained. Here we adopt a distant supervision method to pretrain the language embeddings, which is then fixed for our final NER model in the cross-lingual supervision. We exploit a parallel corpus to achieve the goal by a binary classification to judge if a pair of sentences of different languages are translations of each other. We simply use the adapter-based word representations to derive a sentence vector, and then diff the pairwise vectors for classification. The positive instances are sentences from the parallel corpus, while the negative instances are randomly sampled. In this way, we can obtain the language embeddings as well as initialized adapters in advance, which are then fed into our neural NER model, tuning only the parameter O continually.

Experimental settings
Datasets. We construct a benchmark dataset by merging the CoNLL-2002 (Spanish and Dutch) and CoNLL-2003 (English and German) datasets [23,24]. All corpora of the four languages are annotated with 4 entity types: PER, LOC, ORG, and MISC, and each language-specific dataset is split into training, development, and test sets. Table 1 shows the dataset statistics. We use the BIOES scheme to convert NER into the sequence labeling problem.
Network configurations. We leverage the cased mBERT (https://github.com/googleresearch/bert/blob/master/multilingual.md) and XLM (https://dl.fbaipublicfiles.com/XLM/ mlm_17_1280.pth) for the pretrained word representations, where the first pretrained language model includes 12 Transformer blocks, 768 hidden units, 12 self-attention heads, and the second one is with 16 Transformer blocks, 1028 hidden units, 16 self-attention heads. When adapters are applied, the dimensions of middle vectors are all set to 256. The language embedding size is set to 16. For the BiLSTM encoding part, the hidden size is set to 200.
Network training. We implement all our proposed models based on the huggingface Transformers (https://github.com/huggingface/transformers). We use online batch learning to optimize model parameters, where the batch size is set to 32. For the optimizer, we adopt AdamW with a learning rate of 5e-4 [25]. Furthermore, we use gradient normalization as well by a max norm of 5. We train each model for several iterations, and use the model with the best performance on the development set as the final model.
Offline learning of language embeddings. In particular, our PGN network requires offline language embeddings since there is no way to connect the target language embeddings with the source language embeddings in our unsupervised setting. As mentioned in the Model section before, we achieve the goal by a binary classification framework to judge whether two sentences of different languages are with the same meaning. We exploit the Europarl V7 (https://www.statmt.org/europarl/) to obtain the parallel corpora as the positive instances, and the negative examples are limited to the sentence set of these parallel corpora. We sample randomly by a ratio of 1:5 corresponding to positive v.s. negative. In addition, give the representations of a paired sentence, we exploit the absolute value function over their vector-subtraction for feature extraction, which is used for the next binary classification. Evaluation. We use entity-level F1-score as the primary evaluation metric, following previous NER studies [5], and precision (P) and recall (R) are also reported. For each experiment, we conduct 5 runs and report the average F1-score.

Main results
Single-source evaluations. Table 2 reports the main results of single-source cross-lingual NER based on different feature-extraction methods and pretrained language models. There are several interesting findings.
First, the English and Dutch as the sources languages can achieve better performances for cross-lingual transfer. The potential reason is complicated, which includes at least three aspects: the size of the training corpus, the size and distribution of the training entities, and the distances between the source and target languages. As shown, we can see that the training corpus size might be more critical for the transfer.
Second, the XLM as the pretrained word representations brings better performance than mBERT as a whole. Under the fine-tuning, feature-based as well as the feature-based (PGN) settings, the averaged improvements are 73. 27  Third, our feature-based method is better than the standard fine-tuning method. The observation indicates that the feature-based method is a better alternative in both parameter efficiency and performance. We will see that the feature-based method only consumes onethirteenth of the parameters of the fine-tuning method.
Finally, the feature-based model with PGN can capture the language relationship effectively, resulting in significantly better performance in cross-lingual transfer as a whole. Notice that the exploration of PGN might be not as significant (i.e., the improvements larger than 0.5% Table 2. Main results of single-source cross-lingual NER, where lavg indicates the averaged performance for each target language, and avg denotes the overall average F-scores of all source-target pairs. can be regarded as significant by pairwise t-test below 10 −4 ) in some of the language pair transfers. However, PGN is more desirable since it is more robust, which seldom gives degraded performances for all these language pairs. The effectiveness of PGN has been demonstrated in several other tasks, and our observation is consistent with theirs. Therefore, according to the results, we adopt the feature-based XLM model combined with PGN as the most preferred selection for the cross-lingual NER transfer. Multi-source evaluations. Further, we perform multi-source cross-lingual transfer experiments. The leave-one-out manner is adopted to select source languages, i.e., all languages except the target one are regarded as source languages for multi-source transfer. Table 3 reports the results of different methods for multi-source cross-lingual NER. We investigate the multi-source transfer by the same strategies as the single-source transfer, and also, the bestreported results of the single-source transfer are used for comparisons.
By examining the results of various multi-source cross-lingual transfer models, we can see that for each pretrained language model, feature-based adapter models outperform the standard fine-tuning methods. In addition, PGN can further advance the performance of multisource cross-lingual transfer in most cases, and the overall averaged performance can be boosted by 0:61þ0:6 2 ¼ 0:61 for mBERT and XLM. The observation is similar to that of singlesource transfer. Although PGN is not able to give significantly better performance in all scenarios, it is a better choice since it can seldom degrade the transfer performance (i.e., only one exception by 72.86 v.s. 72.87). By comparing mBERT and XLM, we can see that XLM is better. Both observations are consistent with those of the single-source transfer, which further demonstrate the effectiveness of feature-based PGN strategy and the exploration of XLM.
Finally, we compare the results of multi-source cross-lingual transfer with single-source transfer. As shown, the multi-source transfer can give significantly better performance in all languages. The overall increments are 77.89 − 74.11 = 3.78 and 79.09 − 77.29 = 1.80 for mBERT and XLM, respectively.
Comparisons with previous work. Here we compare our proposed models with the previous state-of-the-art methods. For fair comparisons, we take English as the source language and German, Spanish, and Dutch as the target languages for the single-source model transfer. Table 4 reports the results of different methods for single-source and multi-source cross-lingual NER. We list the results of our three models based on XLM. It can be seen that our methods are comparable to the previous state-of-the-art methods. Remarkably, the feature-based model with PGN achieves highly-competitive performances, obtaining the best F1-values in the Spanish language. Moreover, compared with the best method (Wu et al., 2020) [29], which fully exploits both the unlabeled corpus of the target language and the labeled corpus of the source language, our feature-based models are only trained on source language and require much fewer model parameters and computational costs for both training and inference, meanwhile reaching comparable performances.

Analysis mBERT v.s. XLM.
Our main experiments show that XLM can lead to better performance than mBERT in most cases. Considering that mBERT is pretrained simply by concatenating all corpora of different languages directly, while XLM leverages additional parallel corpora of different language pairs as well, the advantage of XLM over mBERT is reasonable, which also indicates the great value of these parallel corpora.
In order to understand the advantage of XML clearly, we offer one example to compare the word-level alignments between XLM and mBERT by a parallel sentence with named entities inside. Intuitively, word alignments can be a good visualizing indicator to demonstrate the transferability of the multilingual word representations. The calculation of word alignments is performed straightforwardly by using the cosine scores between the vectorial representations of pairwise sentential words, where the one with the highest cosine score is chosen as the alignment. Here we only perform one-side alignment, which is shown in Fig 4. As expected, XLM can give an overall better alignment quality, which could guide NER implicitly. The success of XLM indicates that more sophisticated multilingual word representations with certain supervision can bring more gains for cross-lingual NER transfer.
The advantage of adapter. Our feature-based models exploit adapters to extract features from the pretrained transformer-based language model, where the extracted features are used as the basic word representations. As claimed, the method is much more parameter than the widely-adopted fine-tuning architecture for the BERT and XLM language models. Here we analyze the two strategies in detail. The results are shown in Table 5, where both the singlesource and multi-source transfers are reported, and XLM is exploited as the input backbone for the discussion. We consider a gradual way to insert the adapter modules from the top transfer layers incrementally to the bottom transformer layers, i.e., the covered transformer layers by adapter from zero to all 16 layers. As shown, it is apparent that XLM with Adapter is much more parameter efficient, and even when all layers are exploited, the model consumes below one-thirteenth parameter numbers compared with BERT fine-tuning. The straightforward feature-based method without using adapter performs worse than XLM fine-tuning, which indicates the importance of adapter. As the number of covered transformers increases, the performance is gradually boosted, and after 4 layers, the performance is comparable with XLM fine-tuning. Finally, we select the model by applying full 16 layers with adapters since it can achieve the best overall performance.
The effect of source-target languages. Further, we examine the effect of model transfer with respect to different source-target language pairs. Fig 5 shows the heatmap matrix for different language pairs, which are computed according to the pretrained language embeddings. As shown, we can find that the English, German and Dutch languages are highly similar, while the Spanish language is slightly away from the three languages. The observation is reasonable since the English, German and Dutch languages all belong to the Germanic branch of the Indo-European language family, and the Spanish language is from the Italic branch.
Further, we investigate how source-target languages influence the PGN model. Intuitively, language pairs with larger differences can benefit more from the PGN module. As shown in   Table 2, we find that indeed es-other and other-es can obtain more improvements by PGN, which is consistent with our intuition. Case study. Finally, we offer a case study to compare the cross-lingual model transfer methods. We focus on the comparisons between the single-source transfer without PGN and with PGN, as well as the multi-source transfer without PGN and with PGN. All models exploit XLM as the backbone, and the feature-based method is selected used. As shown in Table 6, we can see that PGN can help to obtain more accurate entity boundaries, while multi-source transfer can recall more named entities.

Related work
Here we introduce the related work by six aspects, including NER, cross-lingual NER, model transfer, multi-source cross-lingual transfer, adapter and PGN.

NER
Early NER systems are based on handcrafted rules, lexicons, semantic and syntactic features. These systems are followed by statical machine learning models with careful feature-engineering of humans [1]. Recently, neural network models have become the dominant methods for  NER due to their high performance [6,[34][35][36]. Especially, the pretrained contextualized word representations such as ELMO and BERT have advanced the NER performance greatly [37][38][39][40]. The NER system based on BERT together with CRF decoding can achieve state-of-the-art performance [6,7]. Our basic model is built according to the system, and our work focuses on the unsupervised cross-lingual setting, studying different exploration methods for the BERTalike word representations.

Cross-lingual NER
Cross-lingual NER has been a hot topic in the NLP community [13,41]. There are two mainstream categories for cross-lingual NER. The translation-based category aims to build pseudolabeled data for a target language, and then the data is used to train a target NER model [26,[41][42][43]. The methods of this category always require a mount of parallel corpora (or a translation lexicon). In this work, we also exploit parallel corpora to pretrain language embeddings, while our method is highly different from theirs.

Model transfer
Our work follows another line of cross-lingual NER, namely model transfer [13], which is quite simple and straightforward. The category has been studied intensively before for tasks such as parsing [11,12]. Based on language-independent features such as cross-lingual word clusters [44], word embeddings [14] and gazetteers [45], a model trained on the source language can be directly applied to the target language. Here we exploit multilingual contextualized word representations [15,16] and present several substantial improvements for the model transfer. This is the first work to study the unsupervised model transfer for NER comprehensively.

Adapter
The adapter has been originally investigated in the computer vision community, aiming to adapt a model for multiple domains [46]. Recently, the adapter modules have been applied to NLP for quickly adapting a pretrained transformer model to new domains and tasks without fine-tuning the transformer model [19,20,47] [50] present the work most similarly to our work, which combines PGN and adapter for universal dependency parsing. However, they are not aimed for an unsupervised cross-lingual transfer, and they can learn language embeddings directly through the training data. Our work adopts a similar word representation method, but exploits a distant supervision method for language embedding learning.

Conclusion and future work
In this work, we investigated the unsupervised cross-lingual adaption for NER based on the model transfer framework. We focused on the NER models based on contextualized word representations since they can benefit NER much and lead to a state-of-the-art system. We chosen two types of multilingual contextualized word representation, mBERT and XLM, respectively, comparing their performances by different exploration methods. In addition, we extended single-source model transfer to multi-source transfer as well, as the latter can bring better performance and meanwhile are more suitable for the real setting. We proposed a novel model with sophisticated neural networks to exploit multilingual word representations. Concretely, we applied the adapter mechanism to enhance the feature-based exploration method of the pretrained transformer language models, and further adopted PGN to better encode the relationship between different languages including the source and target languages. In order to learn effective language embeddings for PGN, we suggested a novel pretraining strategy by using parallel corpora of mixed language pairs. Finally, we conducted experiments to verify the performance of various models. We selected a benchmark dataset from ConLL evaluation tasks for simulation evaluations, which includes four languages, including English, Spanish, Germany and Dutch. The results show that XLM can achieve better model transfer performance for different languages with various settings, due to its pretraining with cross-lingual supervision, which hints that multilingual word representations with rich supervised pretraining might be more prospective for the cross-lingual NER. Although fine-tuning can achieve impressive results, the adapter-enhanced feature-based models can be more prospective. In particular, the feature-based adapter model with PGN can greatly boost the final performance for unsupervised cross-lingual transfer.
The model transfer can still have great potentials for unsupervised cross-lingual transfer. For example, the method can be integrated with translation-based methods, where the language embeddings of PGN might be more effectively learned. In addition, there could exist more types of powerful multilingual word representations in the future, which can be exploited as inputs as well. These attempts can be served as future studies.