Automatic meter classification of Kurdish poems

Most of the classic texts in Kurdish literature are poems. Knowing the meter of the poems is helpful for correct reading, a better understanding of the meaning, and avoiding ambiguity. This paper presents a rule-based method for the automatic classification of the poem meter for the Central Kurdish language also known as Sorani. The metrical system of Kurdish poetry is divided into three classes quantitative, syllabic, and free verses. As the vowel length is not phonemic in the language, there are uncertainties in syllable weight and meter identification. The proposed method generates all the possible situations and then, by considering all lines of the input poem and the common meter patterns of Kurdish poetry, identifies the most probable meter type and pattern of the input poem. Evaluation of the method on a dataset from VejinBooks Kurdish corpus resulted in 97.3% of precision in meter type and 96.2% of precision in pattern identification.

process confronts ambiguities. In addition, syllable weight is not a distinctive concept in Kurdish, and it is probable to change the weight of some syllables in a poem without altering the meaning.
The proposed method utilizes a rule-based method of SCK grapheme-to-phonemes converter [5], which syllabifies the input poem. Then, for the analysis and detection of the poem meter, we consider all possible patterns of each line. Eventually, we analyze the whole poem to calculate a score for each common quantitative pattern. If a pattern repeats in all lines, the proposed method classifies it as quantitative. Else, if most of the lines have equal syllable count, the poem is a syllabic verse; otherwise, it is a free verse.
The rest of the paper is organized as follows: Section 2 reviews phonology and the alphabet of Standard Central Kurdish and the common types of meters of Kurdish poems. Section 3 presents the steps of the proposed method for the classification of Kurdish poems. Section 4 describes the test dataset and results. Section 5 gives conclusions and further works.

Phonemes, alphabet, and syllables of SCK
There are 37 phonemes in SCK, including 8 vowels and 29 consonants [5]. This study uses the Hawar alphabet (standard Latin script for Northern Kurdish) with changes in some consonants. Table 1 compares IPA and the Standard Arabic alphabet of Kurdish with this study's transcription of consonants.
As the syllable weight is the essential material in the identification of poem meter, we will discuss the SCK vowel's length more precisely. Table 2 describes the details of Standard Central Kurdish vowels. The long vowels (/î, ê, a, o, û/) are shorter in final unstressed positions, and the short vowel /e/ in word-final positions can be pronounced longer [6]. The vowel /i/ (bizroke) is unstable in most environments [7] and does not have a grapheme in the standard Kurdish alphabet [5,8].
The Kurdish alphabet, which is adapted from the Perso-Arabic script, has ambiguities in three cases [5]: • The letter ‫"ﯼ"‬ indicates both consonant /y/ and vowel /î/. • There is no letter for the short vowel /i/ (bizroke) in the Arabic script of Kurdish.
In the syllable structure of Kurdish, the nucleus is always a vowel, and the onset is one or two consonants. In two-consonant onsets, the second consonant must be /w/ or /y/. Coda has zero to three consonants. Three-consonant coda is rare, occurs only in some dialects [9], and was not observed in our dataset of Kurdish poetry. Table 3 presents syllable types and their normal weight in SCK.

Types of Kurdish poems
As mentioned, the Kurdish language is a collection of dialects whose speakers live in Iran, Iraq, Turkey, Syria, and parts of the Caucasus. Neighboring different nations has led Kurdish literature to enjoy the characteristics of different literature styles.
In classical literature of Kurdish, there are three categories of poetic works: "quantitative (Arudi) verse", "beit (syllabic songs)" and "gorani (lyric songs)" [10]. Kurdish quantitative verses are an imitation of Arabic and especially Persian poetry [11], and in terms of meter, it is based on syllable weight, i.e., all lines of a poem have an equal number of syllables, repeating a pattern of light and heavy syllables [12]. Beit and gorani have "syllabic meter". The syllabic or numerical meter is rooted in the ancient tradition of Iranian languages, and it has long existed among different ethnic groups in Iran [10]. There is evidence of a syllabic meter in pre-Islamic literature in the texts of the Zoroastrian and Manichaean rituals [10,13,14]. In a syllabic meter, the weight of the syllables and the place of stress do not affect the meter, and only the total number of syllables in each line is important.
Fixed-form poems in Kurdish consist of lines that have an equal number of syllables. In most of the forms, like ghazal and mathnawi, even lines rhyme; however, in some forms, like mukhammas, rhyming is different. In the modern literature of Kurdish, "free verse" is a new style that is not limited to a fixed form, and the number of syllables in each line may be different [15]. This study considers three types of Kurdish poems: Quantitative, Syllabic, and Free verses.

Syllabic verses.
In syllabic or numerical verses, only the number of syllables in feet is considered, and the syllable weight sequence is not following a specific pattern. Kurdish folk poems are syllabic verses [16]. There are three types of three, four, and five-syllable feet in Kurdish syllabic verses, which are repeated uniformly or alternately at each line [12]. Table 4 shows the types of syllabic verses in Kurdish and how feet are combined to form each line. The most common type of syllabic verses in Kurdish is 10-syllabic [10, p. 15], [12, p. 247].

Quantitative verses.
The quantitative meter is an arrangement of heavy (ˉ) and light (˘) syllables in a line of the poem, as is found in Greek and Latin poems [17]. This type of meter fits languages like Arabic, where the vowel length is distinctive and changes the meaning. Arabic has three short vowels [a i u] with distinctive long pairs [aː iː uː] [18]. For example, in the following Arabic hemistich by Hafez (1325-1390), all vowels are pronounced with their normal lengths: However, in languages such as Persian and Kurdish, whose vowel length is not distinctive, to follow the metrical pattern of the poem, some syllables can be pronounced contrary to their natural weight [19]. For example, in another hemistich of that poem which is in Persian, the short vowel /e/ in the word /ha.me/ 'all' should be pronounced as /ha.meː/ to preserve the meter: «  However, to save the meter, some syllables (4, 5, and 11) are pronounced differently, as in the following line by Piramerd (1867-1950) from the meter "ˉˉ˘/ˉ˘ˉ˘/˘ˉˉ˘/ˉ˘ˉ": «  ‫ﭺ‬  ‫ە‬  ‫ﻧ‬  ‫ﺪ‬  ‫ﺳ‬  ‫ﺎ‬  ‫ڵ‬  ‫ﮔ‬  ‫ﻮ‬  ‫ڵ‬  ‫ﯼ‬  ‫ﮬ‬  ‫ﯿ‬  ‫ﻮ‬  ‫ﺍ‬  ‫ﯼ‬  ‫ﺋ‬  ‫ێ‬  ‫ﻢ‬  ‫ە‬  ‫ﭘ‬  ‫ێ‬  ‫ﭖ‬  ‫ە‬  ‫ﺳ‬  ‫ﺖ‬  ‫ﺑ‬  ‫ﻮ‬  ‫ﻭ‬  ‫ﺗ‬  ‫ﺎ‬  ‫ﮐ‬  ‫ﻮ‬  ‫ﭘ‬  ‫ﺎ‬  ‫ﺭ‬  »   Table 5 shows the most common patterns of Kurdish quantitative verses extracted from the VejinBooks corpus [20].
Aziz Gardi [15] has also conducted a comprehensive statistical study on quantitative verses of 82 Kurdish poets. Future works will benefit from its information.

Meter classification in Kurdish poetry
The principles of classification of quantitative meter in Kurdish are similar to Persian [12]. Experts of Persian poetry use the following traditional steps for the identification of meter in quantitative verses [19,21]

PLOS ONE
1. Scansion: each line will be divided into its syllables.
2. Comparing: light and heavy syllables sequence is compared with the known common meter patterns.
3. Considering poetic license: Sometimes, it is necessary to make changes in the pronunciation of certain words in order to match the overall meter of the poem, such as making a light syllable heavy, lightening a heavy syllable, and fading together two adjacent words.
If a pattern is repeated in all lines, the poem will be recognized as a quantitative verse, and that pattern is proposed as the poem's meter. Otherwise, if all lines have an equal number of syllables, then the poem will be recognized as a syllabic verse of that number. Else, when lines have neither a consistent pattern nor an equal number of syllables, then the poem is free verse [12].

Related works
As far as the authors know, no research has been done on the automatic classification of Kurdish poetry. Considering the similarities between the Kurdish quantitative verses and classic Persian, Arabic, and Turkish ones, we will give a brief overview of the works done in these languages.
In the Arabic and Persian orthographies, short vowels are written only for kids or ritual texts. The absence of short vowels in poems is a challenge for the syllabification step [22]. Mojiri [23], Kurt & Kara [24], Alabbas et al. [25], and Abuata & Al-Omari [26] have considered preprocessing steps for insertion of short vowels (diacritizing) and turning the text into phonemic representation. For example, The Basrah system [25] converts the word like " ‫ﺳ‬ ‫ﺪ‬ ّ " to " ‫ﺳ‬ َ ‫ﺪ‬ ْ ‫ﺩ‬ " and " ". Mojiri [23] looks up the words that cannot be syllabified by the rules from a transliteration dictionary. Jafari Qamsari [27] relies on the distributive characteristics of Persian phonemes and by using poetic and phonetic rules, converts the input Persian couplet into light, heavy, and potentially heavy syllable string.
Recently, data-driven and machine-learning works have been done on Arabic and Persian meter classification. Yousef et al. [28] encode the input poem at the character level and directly fed it to the recurrent neural networks without feature handcrafting. Yousefi [29] finds the unwritten linking vowel (izafe) by convolutional neural networks. Al-shaibani et al. [30] by deep bidirectional recurrent neural networks classify the meter of Arabic poems without diacritizing. Abandah et al. [31] use recurrent neural networks with bidirectional long short-term memory cells for diacritizing the input Arabic poems.
A critical step in meter classification is the comparison with common patterns. Mojiri [23] and Yousefi [29] compare the poem with 31 common Persian patterns, Alabbas et al. [25] compare with 16 meters of Arabic, and Kurt & Kara [24] compares with 20 plain and 45 mixed Ottoman templates.

Kurdish poem meter classification
In this section, we describe the proposed method in detail. The input is a Kurdish poem text written in the standard alphabet of Kurdish. The output is the type (quantitative, syllabic, or free verse) and the metrical pattern of the poem. The traditional manual method described earlier influenced our method of automatic meter classification. Fig 1 illustrates the flowchart of the proposed method. The method is available as a web application at https://asosoft.github.io/poem.

Normalization and syllabification
As the input of the proposed method is plain text, we perform the following normalization steps to prevent errors: • Removing lines that contain plenty of non-Kurdish characters or have previously tagged as non-Kurdish.
Next is the syllabification process. For orthographies like English, Arabic, and Central Kurdish that do not have a one-to-one correspondence between the alphabet letters and the phonemes of the language, there will be challenges for syllabification. In this study, we use the rulebased method of SCK grapheme-to-phoneme conversion presented by Mahmudi & Veisi [5], which converts the input text into a syllabified string of phonemes. This method also correctly merges the conjunction ‫'ﻭ'‬ (and) with the previous word [5]. This merge (e.g., " ‫ﺗ‬ ‫ﯿ‬ ‫ﺮ‬ ‫ﻭ‬ ‫ﮎ‬ ‫ە‬ ‫ﻭ‬ ‫ﺍ‬ ‫ﻥ‬ " /tî. rû ke.wan/ and " ‫ﭘ‬ ‫ﯚ‬ ‫ڵ‬ ‫ﺍ‬ ‫ﻭ‬ ‫ﯪ‬ ‫ﺳ‬ ‫ﻦ‬ " /po.ław ʔa.sin/) is required for the following scansion step. For example, the following line by Nalî (1800-1877) will be normalized and syllabified as: Inside the input poem, some words may occur more than once, therefore, we store the syllabified sequences of phonemes for each word to speed up the process of syllabification.

Generating candidates of syllable weight sequence
As vowel length is not distinctive in Kurdish, for preserving the meter in quantitative verses, sometimes, short vowels should be pronounced long and long ones short. There are some clues for recognizing syllable weight changes automatically: • A: Long vowels in word-final unstressed positions are pronounced short.
• B: When a short vowel precedes an open juncture (punctuations or the end of a line), it is usually pronounced long.
• C: In quantitative verses, that foot start with two adjacent light syllables, often, the first syllable is heavy, and for satisfying the meter, it should be pronounced light.
For managing the uncertainties in syllable weights, considering the above clues, we generate possible weight sequence candidates. For example, in /ger nebexşê merhemî wesłî birînim karîye/, we have: • By clue C, syllable 1 can be pronounced light because we do not know the meter for now.
• By clue A, syllables 7 and 9 can be pronounced light.
• By clue D, syllable 14 is pronounced light.
• By clue B, syllable 15 can be pronounced heavy.
In the above example, for 5 syllables, there are 2 possible weights; therefore, 2^5 = 32 sequence candidates can be generated.

Finding the matching patterns for each line
As the quantitative meters have more detailed and are harder to compose, we first examine the lines of the poem for detecting a quantitative pattern. In this step, for each line, we compute the Levenshtein edit distance of each syllables weight sequence candidate with 27 common

PLOS ONE
meter patterns (presented in Table 5). For example, if a line of the poem has 32 weight sequence candidates, we must calculate 32×27 = 864 edit distances. Since the strings are less than 20 characters long and contain only two characters (˘andˉ), these calculations are done quickly. For each line, we only store candidate-pattern pairs that have the smallest distances below a maximum acceptable distance (given 4). For example, for /ger nebexşê merhemî wesłî birînim karîye/, among 864 pairs, only 62 pairs are acceptable, and one of these 62 pairs is: •ˉ˘ˉˉˉ˘ˉˉˉ˘ˉˉˉ˘˘(syllable weight sequence candidate) •ˉ˘ˉˉˉ˘ˉˉˉ˘ˉˉˉ˘ˉ(nearest common pattern, with an edit distance of 1)

Meter classification
The meter classification of a Kurdish poem, just by one or two lines is not correct at all the times, because: • some lines of a syllabic verse may follow a quantitative pattern • some lines may contain misspellings • syllabification of some words and weight of some syllables are ambiguous • unprofessional poets may commit mistakes in patterns Therefore, in our proposed method, we consider all lines of the poem together. For each acceptable pair from the previous step, we add up to the score of the corresponding pattern for the whole poem. Eventually, there is a score for each common meter pattern. For example, if the pattern˘ˉˉˉ/˘ˉˉˉ/˘ˉˉhas a small distance with most of the lines, its score will be higher. The pattern with the highest score is the most probable quantitative pattern of the poem; i.e., we define: In which, P is the most probable quantitative pattern of the poem, M is the set of common metrical patterns, n is the number of lines of the poem, Dist(p j ,w i ) is the edit distance of a pattern (p j ) and weight sequence of a line (w i ), MaxDist is the maximum acceptable edit distances (given 4).
Sometimes the score of the winner pattern is close to another one and the victory is not decisive. Therefore, the most probable pattern should be regulated by a confidence criterion, according to the following formula: The poem must have the following conditions to be recognized as a "quantitative verse": • Nearly all lines of the poem must have a same syllable count, i.e., the amount of standard deviation has to be small.
• The majority of lines must comply with a pattern, i.e., the calculated confidence has to be high.
If a poem fulfills only the first condition, the proposed method assigns it as a "syllabic verse" of the statistical mode of syllables count of lines. Else, if none of the above conditions are fulfilled with the poem, it will be classified as a "free verse".

Test dataset
We evaluated our proposed method on a dataset consisting of 1,154 Central Kurdish poems (979 quantitative, 130 syllabic, and 45 free verses) from available poems of "VejinBooks" (available at https://books.vejin.net). This website is a growing free online corpus of Kurdish literary texts in different dialects of Kurdish. The type and meter of all poems in this corpus are specified manually. VejinBooks also has statistics about the frequency of each meter available in the corpus. Among the available texts on the website, we chose only poems with more than three couplets from 12 well-known poets of Central Kurdish. Table 6 shows the overall statistics of the dataset. The dataset is available on GitHub at https://github.com/AsoSoft/Vejinbooks-Poem-Dataset (reference number 4079471).
The use of Arabic or Persian phrases (like Arabic Quranic Verses) within the text is a common convention in Kurdish poetry. Since our method is based on Central Kurdish phonology, this can be a problem for the evaluation. Fortunately, in the VejinBooks corpus, non-Kurdish phrases are tagged. We removed all the lines with a non-Kurdish phrase inside the test dataset.

Test results
We evaluated our method in type (quantitative, syllabic, or free) and pattern classification. The evaluation metrics are precision, recall, and F1-score. In Table 7, we show the results of the poem-type classification. Table 8 indicates the test results for pattern classification. It shows the efficiency of the proposed method for each pattern. The recall for patterns that have " ‫ﻓ‬ ‫ﻌ‬ ‫ﻼ‬ ‫ﺗ‬ ‫ﻦ‬ " (˘˘ˉˉ) or " ‫ﻓ‬ ‫ﻌ‬ ‫ﻠ‬ ‫ﻦ‬ " (˘˘ˉ) ", is low. The method often classifies the poems of these patterns as syllabic. It causes a lower classification precision for the syllabic type, as shown in Table 7. Maybe the reason is that finding and matching words in the poem with two adjacent light syllables at the start of feet is hard in Kurdish. Therefore, poets consider using poetic licenses to preserve the meter. Fig 2 shows the test results for metrical pattern classification, separated by authors. It can be speculated how much a poet complies with the patterns and uses fewer poetic licenses. For example, Herdî is known for having few but admirable poems. The lower accuracy for the poems of Hêmin and Ḧacî Qadir is due to using patterns with " ‫ﻓ‬ ‫ﻌ‬ ‫ﻼ‬ ‫ﺗ‬ ‫ﻦ‬ " (˘˘ˉˉ) or " ‫ﻓ‬ ‫ﻌ‬ ‫ﻠ‬ ‫ﻦ‬ " (˘˘ˉ) feet and using more poetic licenses.  ‫ﺎ‬  ‫ﻋ‬  ‫ﯿ‬  ‫ﻠ‬  ‫ﻦ‬  ‫ﻣ‬  ‫ﻔ‬  ‫ﺎ‬  ‫ﻋ‬  ‫ﯿ‬  ‫ﻠ‬  ‫ﻦ‬  ‫ﻣ‬  ‫ﻔ‬  ‫ﺎ‬  ‫ﻋ‬  ‫ﯿ‬  ‫ﻠ‬  ‫ﻦ‬  ‫ﻣ‬  ‫ﻔ‬  ‫ﺎ‬  ‫ﻋ‬  ‫ﯿ‬  ‫ﻠ‬  ‫ﻦ‬  245  100  100  100   ‫ﻓ‬  ‫ﺎ‬  ‫ﻋ‬  ‫ﻼ‬  ‫ﺗ‬  ‫ﻦ‬  ‫ﻓ‬  ‫ﺎ‬  ‫ﻋ‬  ‫ﻼ‬  ‫ﺗ‬  ‫ﻦ‬  ‫ﻓ‬  ‫ﺎ‬  ‫ﻋ‬  ‫ﻼ‬  ‫ﺗ‬  ‫ﻦ‬  ‫ﻓ‬  ‫ﺎ‬  ‫ﻋ‬  ‫ﻠ‬  ‫ﻦ‬  224  99  100  99   ‫ﻣ‬  ‫ﻔ‬  ‫ﺎ‬  ‫ﻋ‬  ‫ﯿ‬  ‫ﻠ‬  ‫ﻦ‬  ‫ﻣ‬  ‫ﻔ‬  ‫ﺎ‬  ‫ﻋ‬  ‫ﯿ‬  ‫ﻠ‬  ‫ﻦ‬  ‫ﻓ‬  ‫ﻌ‬  ‫ﻮ‬  ‫ﻟ‬  ‫ﻦ‬  151  99  100

Conclusions and future works
In this paper, we have proposed an automatic poem meter classifier for the Central Kurdish language. The evaluations achieved an overall precision of 97.3% in meter-type classification and overall precision of 96.2% in metrical pattern identification.
In the future, we plan to extend the method's functionality in identifying subclasses of Kurdish free verses. Automatic author identification, based on the poem's characteristics, is another field of study for further works. Since Kurdish is a low-resourced language, our rulebased classifier can help with the tedious task of data preparation for future machine-learning solutions. Now, this algorithm assists the contributors of Vejinbooks online corpus in tagging newly imported poems. Furthermore, an online application is developed for amateur poets to evaluate their poems.