Improve word embedding using both writing and pronunciation

Text representation can map text into a vector space for subsequent use in numerical calculations and processing tasks. Word embedding is an important component of text representation. Most existing word embedding models focus on writing and utilize context, weight, dependency, morphology, etc., to optimize the training. However, from the linguistic point of view, spoken language is a more direct expression of semantics; writing has meaning only as a recording of spoken language. Therefore, this paper proposes the concept of a pronunciation-enhanced word embedding model (PWE) that integrates speech information into training to fully apply the roles of both speech and writing to meaning. This paper uses the Chinese language, English language and Spanish language as examples and presents several models that integrate word pronunciation characteristics into word embedding. Word similarity and text classification experiments show that the PWE outperforms the baseline model that does not include speech information. Language is a storehouse of sound-images; therefore, the PWE can be applied to most languages.


Introduction
Word representation plays a very important role in natural language processing (NLP). The key issue is how to obtain word semantics. The initial word representation, which simply digitizes the word, cannot capture its semantics. Subsequent models that can capture semantics benefit from the distributional hypothesis of Harris [1], which assumes that words with similar contexts have similar meanings. The most efficient word embedding models are based on this concept.
Neural networks are important machine learning models that show superiority in many areas. The application of neural networks to word embedding models was proposed by Bengio et al. as early as 2003 [2]. Since then, neural network language modeling has gained attention in word representation and is now a popular technique [3][4][5]. However, the computation involved in neural networks is intensive. Mikolov et al. [6] proposed the continuous bag-ofwords model (CBOW) and the Skip-gram model, which are highly efficient and can be trained on large-scale corpora. PLOS  To improve the quality of word embedding, researchers have focused on morphology, which refers to the elements that compose a word, such as prefixes and suffixes in English or the characters that make up a word in Chinese. Various researchers [7][8][9] have incorporated morphology into word embedding training. As such, word embedding models have been developed and have shown superior performance in a variety of NLP tasks, including dialog systems [10], sentiment analyses [11], machine translation [12], and text classification [13].
However, most existing models have focused on the writing itself, ignoring the fact that spoken language expresses the meaning directly, whereas writing is simply a way to record speech. One of the basic principles of modern linguistic theory considers that only spoken words truly reflect the concepts and that writing is simply a record of the spoken language, analogous to a phonograph [14,15].
In speech processing systems, Bengio and Heigold utilize speech signal directly [16]. They project the signal and words into an embedding space where words that sound alike are nearby in the Euclidean sense. Kamper [17] and He [18] conducted a deeper study based on this. Levin [19] applied acoustic segment embedding to zero resource query-by-example keyword search. These acoustic embeddings which mainly capture phonetic structure are different from word embeddings.
Based on this, this paper proposes the pronunciation-enhanced word embedding model (PWE), which incorporates speech information into the model. Pinyin and phonetic symbols are both direct descriptions of word pronunciation. They both capture aspects of speech not present in the writing system. Therefore, this paper incorporates speech information through adding phonetic symbols or pinyin into the model. The PWE is highly scalable from two perspectives. First, the PWE can easily be combined with existing models. Second, the PWE can be applied to most languages. This paper presents several methods for combining words and speech information for the Chinese language, English language and Spanish language to construct the PWE. The PWE outperforms a baseline model in word similarity and text classification experiments. In addition, this paper finds that the pronunciation embedding captures semantics and that word embedding contains speech information by revealing the semantic correlation between word embedding and pronunciation embedding. This paper confirms that including pronunciation improves the quality of word embedding.

Word2vec and related models
Word2vec is an efficient word embedding model that uses a neural network, as proposed by Mikolov et al. [6], and includes the CBOW and the Skip-gram model. The CBOW predicts a target word from its context, and the Skip-gram model uses a word to predict its context. Context is acquired by a sliding window. Given a word sequence D = {x 1 ,x 2 ,. . .,x N }, the CBOW maximizes the following average log probability: whereas the objective of the Skip-gram model is denoted by Here, j is the context window size. Word2vec uses the following softmax function to calculate the probability: where W denotes the dictionary; and v x and v 0 x are the input and output word embeddings of word x, respectively. A large-scale corpus exists; therefore, hierarchical softmax and negative sampling are used to improve training efficiency [20].
Subsequently, many models have been proposed to improve word2vec. Le and Mikolov [21] modified word2vec to represent sentences and documents. Qiu et al. [22] considered proximity and ambiguity to improve word2vec. Levy and Goldberg [23] generalized the Skipgram model moving the focus from linear bag-of-words contexts to arbitrary word contexts.

Improvement from morphology
Several researchers have added morphology to word embedding models to improve the quality of word embedding. Botha and Blunsom [7] assumed that word vectors comprise a linear function of arbitrary sub-elements of the word, e.g., surface form, stem, affixes, and other latent information. For example, the word "unbelievable" consists of "un", "believ" and "able." Xu and Liu [8] observe that morphemes have meaning. For example, words ending with the suffix "able" carry the meaning of "capable." Therefore, morphemes are replaced by their meanings in the model.
For languages such as Chinese, the characters in a word also contain rich semantics. In Chinese, the meaning of the word "教" (classroom) can be extracted from its two characters, "教" (teach) and "" (room). Many similar words exist in Chinese. Therefore, Chen et al. [9] proposed a character-enhanced word embedding model (CWE) that integrates characters into training to jointly learn the characters and word embedding. Suppose that the character sequence of word x t is {c 1 ,c 2 ,. . .,c k }, where c k denotes the kth character of x t . The modified word embedding is defined as where v x t is the original word embedding, k is the number of characters, and v c j represents the jth character. In addition, the CWE proposes several methods to solve character ambiguities, including position-based, cluster-based and nonparametric cluster-based character embedding. The similarity-based character-enhanced word embedding(SCWE) improves the CWE by including the semantic contributions of characters to a word [24].

Writing is symbolized language
Linguists have long discussed the relationship between writing and language. At the beginning of the 20 th century, Saussure [14] proposed that language is a storehouse of sound-images and that writing is a tangible form of those images. Language is primarily an auditory symbol system, whereas the written forms of language are secondary symbols (symbols of symbols) that represent the spoken symbols [25]. For a linguist, writing is, except for certain matters of detail, merely an external preservation device, similar to a phonograph, which stores observations about features of historical speech [15]. Therefore, writing is a symbolization of the language symbols that are the basic principles of linguistics. The Chinese, English and Spanish languages are three different kinds of language in terms of the relationship between words and their pronunciations. In Chinese language, there is no connection between the word and its pronunciation. In English, due to the complex pronunciation rules, we can only guess the pronunciation from the spelling of a word. In Spanish, almost every character has a fixed pronunciation, so we can obtain the pronunciation of a word directly. Therefore, this paper uses the Chinese, English and Spanish languages as examples to create the PWE. In addition, pinyin is used to represent the pronunciation of Chinese characters. For an introduction to Pinyin, please refer to https://en.wikipedia.org/wiki/Pinyin.

The basic model
The PWE reflects the linguistic theory of the relationship between writing and language; the model considers that spoken language is the direct expression of semantics and that writing is a record of spoken language. Therefore, the PWE integrates word pronunciations into the word embedding model. The key addition is the PWE's integration of pronunciation into word embedding. Suppose p is the pronunciation of word w, vector v w represents the word w, and vector v p represents the pronunciation p. Then, the modified word embedding to include the pronunciation is defined as follows: This is the basic idea for obtaining a modified word embedding, and this idea can change according to specific circumstances. After acquiring b v w , other existing word embedding models can be used to train pronunciation and word embedding. In the following sections, this paper introduces concrete PWEs based on specific word embedding models.

The PWE based on word2vec
Word2vec includes 2 models. This section uses the CBOW as an example to introduce the PWE based on word2vec.
The CBOW and PWE based on the CBOW (denoted as CBOW+P) are shown in Fig 1. The difference between the CBOW+P and the CBOW lies in the different methods used to construct the word embedding. The CBOW+P adds pronunciation embedding to the word embedding. Both models predict a target word from context, and the CBOW+P does not attempt to predict the target pronunciation. Therefore, the CBOW+P and the CBOW share the same objective as shown previously in Eq (1).
Let W denote the dictionary and P denote the pronunciation set. A word w i 2W is represented by the word embedding v w i , and its pronunciation p i 2P is represented by the pronunciation embedding v p i . Assume that p t is the pronunciation of word w t . Then, the word embedding of the CBOW+P is defined as, After acquiring the modified word embedding, the CBOW+P can be trained similarly to the CBOW; however, the CBOW+P jointly learns both the word embedding and the pronunciation embedding. This study used hierarchical softmax to improve the training efficiency.
The PWE based on the Skip-gram model (denoted as the Skip-Gram+P) also uses Eq (3) to modify the word embedding, and its objective is Eq (2).

The PWE based on the CWE
The CWE assumes that in languages such as Chinese, a word is usually composed of several characters and contains rich internal information [9]. Therefore, the CWE integrates character information into the word embedding model. This section describes a basic construction method for the PWE based on the CWE, and the following section presents another construction method for the Chinese language example.
In this basic method, word embedding consists of three parts: the word, the word's characters and the word's pronunciation. Let W denote the dictionary, C denote the set of characters and P denote the set of pronunciations. Suppose that the character sequence of word w t 2W is Here, v w t is the original word embedding, v c j is the embedding that corresponds to the jth character in the word, and v p t is the embedding that corresponds to the p t . After acquiring the modified word embedding, the PWE based on the CWE can be trained similarly to the CWE; however, the PWE based on the CWE jointly learns the word embedding, the character embedding and the pronunciation embedding.

The PWE in different languages
This paper implemented the PWE based on word2vec for Chinese, English and Spanish. In addition, this study implemented the PWE based on the CWE for Chinese.
For word2vec, in Chinese, let P denote the pinyin set, and pinyin p j 2P is represented by embedding v p j . Let W denote the dictionary, where word w t 2W is represented by the word embedding v w t . Suppose word w t includes k characters, and the corresponding pinyin sequence is D = {p 1 ,p 2 ,. . .,p k }. The modified word embedding that adds the pronunciation is For English and Spanish, suppose p t is the pronunciation of word w t ; the modified word embedding is This paper proposes two methods for merging the PWE based on the CWE for the Chinese language. The first method is to directly add the pronunciation vector based on the CWE, as described in the previous section, denoted as the CWE+P1. The modified word embedding consists of word, character and pronunciation embeddings. Assume that the character sequence of word w t 2W is D = {c 1 ,c 2 ,. . .,c k }, and its corresponding pinyin sequence is H = {p 1 , p 2 ,. . .,p k }, where p i is the pinyin of character c i . The modified word embedding is defined as follows: where v w t is the original word embedding; and v c j and v p j are the embedding of character c j and pinyin p j , respectively. For example, the word "教" (classroom) includes two characters, "教" (teach) and "" (room). The pinyin of "教" (teach) is "jiao4" and the pinyin of "" (room) is "shi4." The modified word embedding is The second method creates an embedding for each pinyin of the character. Based on the specific pronunciation of the character in the word, this method adds the corresponding embedding to the word embedding to achieve the modified word embedding. This method is denoted as the CWE+P2. As described in Section 3, character c i may have n different pronunciations {p 1 ,p 2 ,. . .,p n }; therefore, the CWE+P2 creates n embedding values that correspond to those n pronunciations. For example, suppose that the character sequence of word w t is D = {c 1 ,c 2 ,. . .,c k } and the corresponding pinyin sequence is H = {p 1 ,p 2 ,. . .,p k }, where p i is the pinyin of character c i in the word w t . The modified word embedding is defined as follows: The embedding v cp j corresponds to the pinyin of the jth character in the word. For the word "教" (classroom), embeddings for the pinyin "jiao4" of character "教" (teach) and the pinyin "shi4" of character "" (room) are created and denoted as v cp 1 and v cp 2 , respectively. The modified word embedding for the word "教" (classroom) is defined as follows:

Word similarity
In the experiment, the cosine similarity of word embedding is used to indicate semantic relatedness of a given word pair. A word similarity experiment is implemented to evaluate the quality of word embedding by comparing the semantic relatedness computed by models with human judgments. For this paper, wordsim-240 and wordsim-296 [26] were used as the evaluation datasets for Chinese; for English, MTurk-771 [27], MEN [28], WS-353-SIM and WS-353-REL [29] were used; for Spanish, WS-353 [30] and RG-65 [31] were used. The numbers of word pairs in these datasets are shown in Table 1. This paper uses the Spearman correlation ρ to evaluate the relatedness between the model results and human judgment; then, the model's performance on the word similarity experiment is evaluated according to ρ. In this experiment, the word pairs that contain new words are ignored. Each model is trained a minimum of 10 times; consequently, the experiment acquired at least 10 different results for each model. Tables 2-4 show the averaged experimental results for the different languages.
From the experimental results of the model based on the CBOW, a number of results can be observed. Regardless of languages and datasets, the best results are obtained by the CBOW +P. The CBOW+P outperforms the CBOW for all the datasets and languages. For Chinese, the ρ of the CBOW+P increases by 4.7% on wordsim-296. For Spanish, the ρ of the CBOW+P increases by 6.8% on RG-65. The models that include pronunciation information generally performed better than the benchmark models. For Chinese, the CBOW+P and the CWE+P1 outperformed the corresponding benchmark models. For English and Spanish, the CBOW+P outperformed the CBOW.
From the experimental results of the model based on the Skip-gram model, we observe that the PWE based on the Skip-gram model achieves good results. For English, the ρ of the Skip-gram+P model is better than that of the Skip-gram model on MTurk-771, WS-353-SIM and WS-353-REL. For Spanish, the ρ of the Skip-gram+P model increases by 4.6% on RG-65. However, the experimental results based on the Skip-gram model are somewhat weaker than the experimental results based on the CBOW. This suggests that it is difficult to predict surrounding sounds based on only one sound, but given a sound sequence that lacks a sound, it is easy to guess the missing sound based on the surrounding sounds, which likely explains why the experimental results based on the CBOW are better than those based on the Skipgram model. From this experiment, we can observe that the PWE performs well for different languages.

Text classification
For text classification experiments, this paper adopted the Fudan (refer to http://www. datatang.com/data/44139), Sogou (refer to http://download.labs.sogou.com/) and Netease (refer to http://www.datatang.com/data/1196) corpora for Chinese, 20Newsgroups (refer to http://qwone.com/~jason/20Newsgroups/) for English and TASS 2017 [32] for Spanish as the experimental datasets. The Fudan corpus contains 20 categories. The number of documents in each category ranges from tens to thousands. For the experiment presented in this paper, 5 categories were selected, each of which contains more than 1,000 documents. Table 5 shows the categories and the corresponding numbers of documents. The documents include various types of papers and news reports. Sogou and Netease are news classification datasets. Sogou includes nine categories; each category includes 1,990 documents. Netease includes six categories with 4,000 documents each. 20Newsgroups contains 20 newsgroups and some of the newsgroups are very closely related to each other. This paper extracted six categories followed by its home page shown in Table 6. The category column indicates the categories that were extracted from the specific categories and the sub-categories column indicates categories that were included in extracted category. TASS 2017 is a sentiment classification dataset that is based on Twitter. TASS 2017 includes "P", "N", "NEU" and "NONE" 4 categories and "P", "N" have the most samples. As such, this paper chose "P" and "N" for the experiment. This experiment used the average word embedding in the document to represent the document. The text classifier was trained with LIBLINEAR [33]. For a corpus that does not distinguish between the training set and test set, we used 5-fold cross validation. The accuracies of the models on different languages are shown in Table 7 and their F-measure scores for each category of different corpus are shown in Tables 8-12 respectively. The F-score is a metric that considers both precision and recall. Several observations can be made from the preceding Tables. (1) In almost all languages and the corpus, the optimal accuracy is obtained by adding pronunciation information to the Improve word embedding using both writing and pronunciation Improve word embedding using both writing and pronunciation model. (2) Regardless of languages and the corpus, the optimal F-score in each category is generally obtained by a model that adds pronunciation information.
(3) After adding the pronunciation embedding, the accuracy and F-score of each category are generally better than the benchmarks. For Chinese, the PWE based on word2vec generally outperforms word2vec, and the PWE based on the CWE is also generally better than the CWE. For English, the accuracy and the F-score of the PWE based on word2vec are generally better than that of word2vec on 20-Newsgroups. For Spanish, the accuracy and the F-score of "Negative" of the CBOW+P model is better than CBOW. The above 3 points demonstrate that including pronunciation information improves the performance of the word embedding model for different languages.

Qualitative analysis
This section evaluates the quality of word embedding and pronunciation embedding by finding the words most similar to a given sound and the sounds most similar to a given word in the Chinese language. All embeddings were trained by CBOW+P. Cosine similarity is used to find the 4 most similar embeddings. The experimental results are shown in Tables 13 and 14. Table 13 shows that a semantic correlation exists between words and the pinyin that are the most similar to the word. For example, the 4 most similar pinyin of the word "投降" (surrender) are "che4", "tui4", "jun1" and "bai4", where "che4" and "tui4" mean "撤退" (withdraw), "jun1" means "军队" (army) and "bai4" means "8D25" (fail). The words "财富" (wealth) and "体育" (sports) also have semantic correlations to similar pinyin.
According to Table 14, the words that are most similar to the pinyin contain the characters of those pinyin. For example, the 4 most similar words of pinyin "bai4" include the characters "8D25" (fail), "拝" (bow), and 和"" (bow), whose pinyin are "bai4". The 4 most similar words of the pinyin "shui4" include the characters "睡" (sleep) and "税" (tax), whose pinyin are both "shui4". The 4 most similar words of the pinyin "cai2" include the characters "才" (just) and "财" (wealth), whose pinyin are both "cai2". This result demonstrates that the word embedding obtained by the PWE contains rich sound information.

Conclusions
According to linguistic principles, spoken language is a direct expression of semantics, and written language is a record of spoken language. This paper proposes the PWE, which integrates word pronunciation into the word embedding model. The PWE is highly expandable from two aspects. First, the PWE can easily be combined with existing models such as Improve word embedding using both writing and pronunciation word2vec and the CWE. Second, language is a storehouse of sound-images; therefore, the PWE can be applied to most languages. This paper introduces a variety of PWEs based on different existing models for different languages. Word similarity and text classification experiments demonstrate that the quality of word embedding is improved after adding sound information, which is beneficial to training. In addition, a qualitative embedding analysis revealed that word embedding contains rich sound information and that pronunciation embedding also contains semantic information. However, this paper adds pronunciation embedding to word embedding. The use of sound information is relatively simple and should be further utilized.