A clustering approach to identify multidimensional poverty indicators for the bottom 40 percent group

The Multidimensional Poverty Index (MPI) is an income-based poverty index which measures multiple deprivations alongside other relevant factors to determine and classify poverty. The implementation of a reliable MPI is one of the significant efforts by the Malaysian government to improve measures in alleviating poverty, in line with the recent policy for Bottom 40 Percent (B40) group. However, using this measurement, only 0.86% of Malaysians are regarded as multidimensionally poor, and this measurement was claimed to be irrelevant for Malaysia as a country that has rapid economic development. Therefore, this study proposes a B40 clustering-based K-Means with cosine similarity architecture to identify the right indicators and dimensions that will provide data driven MPI measurement. In order to evaluate the approach, this study conducted extensive experiments on the Malaysian Census dataset. A series of data preprocessing steps were implemented, including data integration, attribute generation, data filtering, data cleaning, data transformation and attribute selection. The clustering model produced eight clusters of B40 group. The study included a comprehensive clustering analysis to meaningfully understand each of the clusters. The analysis discovered seven indicators of multidimensional poverty from three dimensions encompassing education, living standard and employment. Out of the seven indicators, this study proposed six indicators to be added to the current MPI to establish a more meaningful scenario of the current poverty trend in Malaysia. The outcomes from this study may help the government in properly identifying the B40 group who suffers from financial burden, which could have been currently misclassified.


Introduction
Malaysia has experienced significant progress in poverty reduction over half a century ago with tremendous initiatives made by the government since the introduction of the New Economic Policy (NEP) in 1971 [1]. Afterwards, the New Economic Model (NEM) was launched in 2010 with the main objective to make Malaysia a high-income and developed country by mortality, years of schooling, school attendance, cooking fuel, sanitation, drinking water, electricity, housing and assets. Each dimension has the same weight as one third. The MPI looks at poverty from a surpassing perspective and sees how poverty can be experienced in many ways at the same time. The multidimensional measures satisfy several useful properties which allow, for instance, poverty targeting and comparisons over time and across countries and regions.
In accordance with that, Malaysia has also taken steps to develop its custom Multidimensional Poverty Index (MPI) model at the national level as outlined in the Eleventh Malaysia Plan (11MP), following the footsteps of 100 countries worldwide that have already adopted the methods launched by OPHI in 2010 [7]. It also complements the PLI by considering other aspects apart from income. Malaysian MPI covered four dimensions: education, health, living standards and income with 11 indicators: schooling years, school attendance, healthcare access, clean water access, living place conditions, room crowdedness, toilet, garbage collection facility, transportation, basic communication tools and mean monthly household income [4]. However, according to a recent mid-term review of the 11th Malaysia Plan released on October 2018, the index calculated using the MPI model was reported to be at 0.0033 while the incidence of multidimensional poverty was 0.86% at the national level for 2016 [6]. According to Dr Kenneth Simler, a Senior Economist of World Bank Group Global Knowledge and Research Hub Malaysia, the index is too low for Malaysia and it was recommended to increase the benchmark or the socalled deprivation cut-off level by using both MPI and PLI model in the future [8]. The multidimensional measures satisfy several useful properties which allow, for instance, poverty targeting and comparisons over time and across countries and regions. However, it is crucial to identify the indicators that are important for the MPI classification, which can be used by the government for further strategic planning in response to the poverty elimination. The recognition of these limitations has led us to propose this study in using data analytics approach to identify relevant indicators for multidimensional poverty classification. The proposed study makes use of clustering machine learning for poverty classification.
Machine learning methods are the most commonly used methods for predicting poverty. There are two main groups in machine learning methods, namely, supervised and unsupervised learning. Supervised learning is one of the ways in which the learning environment (also known as training data which contains user-defined labels) is formed and delivered. The algorithm will repeat the predictions using training data, and the learning will stop once it has achieved a certain level of performance. Then, a test set is performed to verify the accuracy of the predictions. In contrast, in unsupervised learning, the data on learning process is unlabeled to view unusual structures or patterns without clear learning goals [9][10][11]. Many studies have been conducted in analyzing multidimensional poverty using machine learning methods such as classification and clustering [12][13][14][15][16][17]. Clustering technique is a method of collecting data objects and grouping them based on the similarity of objects to gain an in-depth understanding of data distribution. In general, there are five key approaches to clustering, namely partitioning, hierarchy, density-based, grid-based and model-based [18].
To date, many studies have been published in the B40 domain. Mohd Zain and Tambi described the B40 group as urban poor in Malaysia and studied the factor of urban poverty in the development of late bloomer in education [19]. Whereas, Abdullah and Mohammad studied the health and literacy level among B40 and M40 men and demographic factors related to health literacy [20]. On the other hand, a group of researchers looked at the causes contributing to the increasing cost of living in this group [21]. Studies by Mayan, Mohd Nor and Samat examined the challenges faced in increasing the income of the B40 group [22]. A recent study conducted by Sani has classified the B40 group by a predictive model using the machine learning method. The researchers compared the performance of the three classification algorithms namely the Naïve Bayes, Decision Tree and k-Nearest Neighbor (kNN) and concluded that the Decision Tree model is the best model for classifying the B40 group [9].
In the past few decades, many researchers have developed a large number of clustering algorithms such as partitional, hierarchical and density-based clustering (DBC) methods. Those clustering algorithms have been applied in a wide variety of domain, such as image processing, data mining, market segmentation, medical imaging, social networks and including poverty. For instance, Ahmad and Ejaz [23] used the Two-Step Cluster Analysis technique. They found out that the ratio of sex, income and education were the crucial contributing factors in the non-poor group while dependence rate and family size were the crucial contributing factors in the poor group. Apart from that, the Analytic Hierarchy Process (AHP) was applied for poverty classification, while K-Means clustering was used to determine the range values between clusters [24]. Likewise, Coromaldi and Drago [25] employed the K-Means algorithm to explain poverty in Italy through an in-depth study of the income-deprivation score relationship. Their research found that poverty analysis is strengthened by examining the relationship between income and deprivation score using the multidimensional poverty indicators. On top of that, Chamboko and Re [17] have mapped multiple deprivation patterns for 13 areas in Namibia using GIS application and using the K-Means algorithm for clustering purposes. To build scores and thus reduce the number of deprivation dimensions, they applied Principal Component Analysis (PCA). This study looks at the relationship between deprivation and demographic characteristics based on the clusters produced.
Another research relevant to poverty using machine learning was done by Santoso and Irawan [26] using K-nearest neighbor (k-NN) and learning vector quantization (LVQ). In their research, K-NN produced higher accuracy as compared to LVQ. Similarly, Sano and Nindito [27] from Indonesia used K-Means algorithm for clustering the poverty. More interesting research was carried out by Njuguna and McSharry [28], who constructed spatiotemporal poverty indices through mobile telephone activity as an alternative to classify poverty using linear regression. Based on the research conducted thus far, there is a huge opportunity to discover a machine learning technique to classify multidimensional poverty according to the Malaysian context. The capability of machine learning in dealing with a large amount of data that can reveal data pattern may contribute to a higher accuracy of a poverty prediction model [29,30].
In summary, from the above study, it can be concluded that there is a need for a comprehensive study on the measurement of multidimensional poverty to improve the current national MPI. Therefore, in this work, we have identified that there is a great opportunity to develop a clustering model that can identify Multidimensional Poverty Indicators and dimensions for the B40 group in Malaysia. After considering a number of well-known clustering algorithm, the K-Means algorithm is suggested in this study. The contributions of this paper are summarized below: • Proposed B40 clustering-based K-Means architecture to identify the right indicators and dimensions that yield more precise MPI measurement.
• Extensive clustering analysis identified seven indicators of multidimensional poverty among B40 group. Out of the seven, six indicators (i.e. literacy, highest education level and grade, housing, access to television services, assets, and work) from three dimensions (i.e. education, living standard and employment) are proposed to be added to the current national MPI.
• Employment is identified as an additional dimension for the consideration of policymakers towards MPI establishment.
• The relevant indicators and dimensions are required and can guide the government in formulating an MPI to ensure the needs of B40 group are adequately addressed • Outcomes from this study help government to efficiently identify B40 group, which otherwise could be misclassified.

Research methodology
The overall architecture of the proposed method for identifying key indicators of multidimensional poverty among B40 group is depicted in Fig 1.

PLOS ONE
carried out in order to produce very useful data for planning and implementation of national development. The data collected will provide a comprehensive set of information on population, various demographic, social and economic features. Furthermore, the census data provides information on the total stock of residence, basic amenities and housing requirements available.
The raw dataset would go through data pre-processing phase before clustering phase takes place. In clustering phase, K-Means algorithm was tested with four different distance measure: Euclidean Distance (ED), Correlation Similarity (CrS), Cosine Similarity (CS) and Dice Similarity (DS) to choose the best distance measure. Then, experiments were conducted and evaluated from k value equals to 2 up to 15 in order to determine the best k. Finally, a series of analysis was performed by looking at the cluster size, centroid chart, scatter plot analysis, heat map analysis, and descriptive statistics method to investigate the pattern of each cluster formed further. The data preprocessing and experiments are conducted using Rapidminer Studio tools.

Data preprocessing
Data preprocessing methods focus on altering the raw data in an effort to assess the consistency of the data that satisfies the clustering process criteria. In this phase, six pre-processing activities are involved as depicted in Fig 1, namely data integration, attribute generation, data filtering, data cleaning, data transformation, and attribute selection. At the beginning of this process, data integration was carried out where three source files: Person, Household and Living Quarters. These were joined into a single dataset. Tables 2-4 show 40 attributes from person source file, 39 attributes from household source file and 17 attributes from living quarters source file. From a total of 96 attributes, repeated attributes were removed, leaving 84 attributes. Afterwards, two attributes were generated: salary and total household income based on occupation. These attributes mapped with Salaries & Wages Survey Report, Malaysia [32]. Then, the dataset was filtered to remove occupation from the category of unknown, unknown labor force status and unclassified. Non-B40 group and non-citizen were also filtered out from this study. Subsequently, data cleaning was done to fill in the missing values before the data transformation process takes place. Upon examination, there are 2,097 missing values from 2 attributes, namely, Country of Birth and Coding state/Country. The missing values for Country of Birth are replaced with the value '99' which is 'Malaysian Citizen' while for the State/ Country Code attribute, the missing values are replaced with the same values in State attribute. The operator called "Replace Missing Value" and it is used to replace every missing value with the specified values. In data transformation, a nominal attribute called age group was transformed into numeric attribute as there is a requirement for distance calculation in the clustering process. This process is performed by an operator called 'Nominal to Numerical' using unique integer coding type in Rapidminer. On top of that, normalization was performed using the Z-transformation method. It is important to note that normalization can ensure that the distance measure gives equal weight to each variable.
There are four steps involved in attribute selection. First, we delete useless attributes by using an operator called "Remove Useless Attributes" where the process identified attributes containing the same values for all the records. Second, we used "Remove Correlated Attributes" where it detects pairs of attributes that are strongly related to each other based on the correlation values specified. Third, we removed the non-significant ones, which is the id-likeattributes. Feature Selection methods can be classified into two major groups, which are supervised and unsupervised. In supervised feature selection methods, the features are chosen based on their association with the class label. It selects features with strong relevance to the class label. On the other hand, unsupervised feature selection methods evaluate the feature relevance by exploring the data structures with unsupervised learning techniques. In this study, an operator called 'Unsupervised Feature Selection' was used to select important attributes from a total of 65 attributes. Unsupervised Feature Selection technique uses K-Means algorithm to

PLOS ONE
find the most important features. Table 5 provides a list of the 23 selected attributes after the selection process.

K-means algorithm
K-Means algorithm is one of the most popular and widely used clustering algorithms. It is a clustering method where n objects o 1 ,. . ., o n are clustered into a number of cluster k C 1 ,. . ., C k . The initial group will be repeated several times by clustering each object to the nearest centroid

PLOS ONE
point, and the centroid point will be recalculated until no further changes occur. The purpose of the optimization criteria in the clustering process are to minimize the sum of variances (Sum of Squared Errors) E between the objects in the cluster with the cen 1 ,. . ., cen k points such as Eq (1).
In the K-Means algorithm, the distance is calculated between each data point and each centroid. The centroid is selected for each data point based on the minimum distance. Thus, distance plays an important role in the clustering process. Calculation of distance between these two points can be carried out using several techniques. Four distance measures are compared in this study namely Euclidean Distance, Correlation Similarity, Cosine Similarity and Dice

PLOS ONE
Similarity. The Euclidean distance between two points is calculated based on Eq (2), where k is the number of dimensions, aj and bj are vectors: a = (a1, a2,. . ., ak), b = (b1, b2,. . ., bk). The dimensions used need to be transformed to be within the same scale, which is also known as normalization [33].
Correlation Similarity is calculated as the correlation between two attribute vector points. Given the data matrix X (m x n) where m (1 x n) line vectors x1, x2,. . ., xm, the correlation distance between x δ and x t vectors is defined as Eq (3) [31].
ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi Cosine similarity is measured based on the cosine angle between two points of the attribute vector. Given a data matrix X (m � n) where m (1 � n) is the vector of the lines x 1 , x 2 ,. . ., x m , the cosine distance between the vector x δ and x t is defined as Eq (4) [33].
Dice similarity used in this study is dice similarity for numerical values in the input set. For the distance measure, the y(i,j) is the value of the j th attribute of the i th instance. Hence y (1,3) -y(2,3) is the difference of the values of the third attribute of the first and second instance. The similarity is calculated using Eq (5), where Y 1 Y 2 is the sum over product of values which is sum (j = 1) y(1,j) � y(2,j). Y 1 is the sum over values of the first instance which is sum (j = 1) y (1,j), while Y 2 is the sum over values of the second instance which is sum (j = 1) y(2,j) This types of similarity measured is offered in Rapidminer tools for K-Means clustering algorithm [34].
The evaluation of clustering results, also called cluster validation, is a process in which the accuracy or quality of the results obtained from the cluster is measured. Two main methods for measuring the quality of cluster results are internal and external validation. The evaluation of external validation is based on the comparison of cluster results with the unused data in the clustering process. Unused data is the data which contains the class labels. The cluster results are considered good if the comparison results are similar. Some of the measurement methods in external validation are Jaccard Index, Rand-Index and F-measure [35]. Whereas, internal validation provides a good score to the algorithms that produce high similarity within a cluster and low similarity between clusters. Davies Bouldin Index [36], Dunn Index [37] and Silhouette Index [38] are the popular methods for internal validation measure. There are also some new clustering validation indices proposed such as clustering validation index based on nearest neighbors (CVVN index) [39], Local Cores-based Cluster Validity (LCCV index) [40] and Absolute Cluster Validity index [41]. For this study, three internal validations implemented, which were Davies Bouldin, Average within Centroid Distance and Sum of Squares.
Davies Bouldin. The Davies Bouldin (DB) metric measures the variation between points within the cluster (intra-cluster) and the distance between clusters (inter-cluster). In each cluster, this metric determines which other group has the highest ratio between the average intracluster distance of points in two clusters to the distance between clusters. After obtaining the maximum value, it will be averaged for all clusters. Low values are obtained if the distance within cluster is compact and the distance between cluster is far away. This measurement metric can provide clear clues for a good cluster [42]. This metric is defined as Eq (6): where δ(x i , x j ) is the distance between cluster, x i , and x j , Δx i , Δx j represent the distance between the points within cluster x i , and x j is the centroid for cluster x i and c are the numbers of partition U cluster. Average within centroid distance. Average within Centroid Distance (AWCD) metric is measured by calculating the average distance per point from a centroid point within a cluster. The centroid distance between cluster A and B is the distance between centroid (A) and centroid (B). Average distance (dist) is calculated by finding the average in pairs between points within a cluster. In other words, for each point a i in cluster A, the average distance is calculated dist(a i ,b 1 ), dist(a i ,b 2 ), . . . dist(a i ,b n ) and average them all. The more compact a cluster is, the lower the average value. This is because as the number of clusters increases, the average distance decreases naturally. This makes these measurement metrics difficult to interpret [42].
Sum of squares. Sum of Squares (SS) metric divides the number of data points in a group by the number of data points in each cluster. This is called squared, and the values of all the clusters are summed. This evaluation metric shows that a good cluster can change according to the starting parameters used to form the cluster. If the size of the scale decreases slowly with increasing numbers of clusters, it indicates that there is a large stable cluster that is still intact. Eq (7) shows the calculation of SS evaluation metrics [43]: where S i represents the set of clusters (S 1 , . . ., S k ) with a midpoint (μ 1 , . . ., μ k ), k represents the number of clusters and x represents the data set.

Determining the best distance
A series of experiment was run with k values ranging from 2 to 15 with four different distance measures, namely Euclidean Distance (ED), Correlation Similarity (CrS), Cosine Similarity (CS) and Dice Similarity (DS). Performance is measured based on DB, AWCD and SS evaluation metrics. Low values are representative of a good cluster with a particular distance measure. Table 6 shows the clustering performance based on three evaluation metrics (i.e., DB, AWCD and SS) for all k values starting from 2 to 15 using four different distance techniques (i.e., ED, CrS, CS and DS). The average DB values for the ED, CrS, CS and DS techniques were 1.78, 2.20, 2.19 and 5.91, respectively. As shown in Table 6, the DB recorded four infinity values when using the CrS technique at the k = 2, 6, 11and 14. At the same time, the DS technique recorded ten infinity values at the k = 3, 5, 7, 9,10,11,12,13,14 and 15. This indicates poor clustering quality results are produced when using the CrS and DS techniques based on the DB metric. Furthermore, as shown in Table 6 Moreover, to select the best distance technique, their performance is measured based on DB, AWCD and SS evaluation metrics. Table 7 demonstrates the average clustering performance for each distance measured. The ED technique recorded the best performance results based on the lowest DB and AWCD values. Meanwhile, the CS is the best distance technique that can be used to produce a quality clustering model based on SS value. Moreover, CrS shows moderate performance, and DS reveals a poor clustering performance. The performance results recorded in Table 7 are ranked from 1 to 4 for each evaluation metric to select the best distance measure. Table 8 shows the list of ranks for each distance measure based on the average values of DB, AWCD and SS metrics. The distance technique with the average value for each evaluation metric is recorded. From these values, the rank for each distance technique was noted for the purpose of identifying the performance of the distance technique. Thus, in these studies, the distance technique subjects are ranked (1 to n), so the rank value is from 1 to 4. For example, DB produces the lowest average value for ED. Therefore, ED was ranked as number 1, and the DS technique, with the highest average DB value, will have rank number 3. The final two columns on the right in Table 8 are the mean of the rank and rank position obtained of all evaluation metrics for each distance measure. This produces a listed rank position for each distance measure. Overall, the resulting ranking of the four distance measures is:

PLOS ONE
Cosine Similarity > Euclidean Distance > Correlation Similarity > Dice Similarity It is shown that the Cosine Similarity is the best distance technique based on the lowest score obtained.

Determining the best k value
K-Means algorithm is an easy clustering algorithm. However, it requires the parameter k as the input to the clustering process. Variable k is an important parameter in determining the quality of a cluster. Therefore, this study will determine the best k value for the clustering model. The performance graph for the model is plotted based on Cosine Similarity measure. Fig 2 shows the performance plotting of the clustering model from k = 2 up to k = 15. According to Davies Bouldin (DB) measure, a low DB value indicates that the clusters are tight, and each cluster is well separated. Based on the DB measure, the lowest value is recorded by k = 15. Based on the Average within Centroid Distance (AWCD) plots, the AWCD values seem to flatten at k = 8. This indicates that an increasing number of clusters does not significantly affect the quality of the clusters [42]. Based on the Sum of Squares (SS) measure, the SS value drops dramatically until k = 8 before it begins to flatten. Therefore, based on DB measure and taking into account the ACWD and SS measure, it can be concluded that k = 8 with DB = 2.157 is the best k value for this model.

Clustering analysis
The analysis and interpretation of cluster results are one of the most important activities in clustering. Each cluster needs to be explored and analyzed to get its characteristics and

PLOS ONE
differences. In this study, the analysis and interpretation of each cluster will determine the indicators and dimensions for multidimensional poverty among B40 group. The analysis of each cluster was done by looking at the cluster size, centroid chart, scatter plot analysis, heat map analysis, and descriptive statistics method. Cluster size analysis. As shown in Table 9, eight clusters are derived from the clustering model. Cluster 0 and 2 constitute the largest group comprising 16% each. Both clusters had an average distance with the lowest average centroid distance, indicating more compact clusters. Whereas, the smallest cluster is Cluster 3, making up 9% of the entire cluster. On the other hand, the Average within Centroid Distance (AWCD) returned a lower value for Cluster 2 at 9.064, which indicates that Cluster 2 is the most compact cluster than the other clusters.
Centroid chart analysis. The Centroid Chart, as shown in Fig 3, is a graphical representation of centroid value in a parallel chart. It represents the mean value of centroid point for the given attribute for each of the cluster. The centroid value is a normalized value; therefore, the mean value for each attribute is equal to 0. The centroid value, which is far above and below the mean value can easily be noticed through this chart, which indicates a distinguishing

PLOS ONE
characteristic for the respective cluster. For instance, for Cluster 7, the centroid value for personal computer attribute is 2.27, which is far above the mean value and is the highest value as compared to other clusters. Thus, the 'personal computer' is one of the most important characteristics of Cluster 7. Nevertheless, this form of analysis offers minimal insights; thus, the indicators and dimensions for multidimensional poverty cannot be specified at this point. Therefore, we proceed to the next analysis called Scatter Plot analysis.
Scatter plot analysis. Scatter plots are another way of analyzing cluster characteristics graphically. It is very useful in visually positioning the cluster based on two key attributes of each cluster. It indicates the relationship or correlation between these two attributes. In light of this scatter plot analysis, 12 attributes have been selected as key indicators in defining each cluster as depicted in Fig 4A-4h.
As seen in Fig 4(A), Cluster 0 shows a relationship between toilet facilities and construction material of outer walls attribute. These group of people probably experienced a low living standard. There is a big and compact cluster in Cluster 1 that shows the strong correlation between the highest certificate and occupation attribute, as shown in Fig 4(B). Based on the plotting, most of the individuals in this cluster are not working and do not have any certification. Whereas, scatter plot for Cluster 2, as shown in Fig 4(C), depicts the correlation between paid TV channel and water filter attribute. Fig 4(D) shows plotting for Cluster 3, which reflects the remaining population between B40 group who are able and unable to read and write. Meanwhile, for cluster 4, as shown in Fig 4(E), a similar proportion can be seen between people of B40 group who owned a washing machine and a water filter. Plotting in Cluster 5 presents a strong correlation between the attribute of occupation and the capability to read and write, as shown in Fig 4(F). On the other hand, Cluster 6 revealed that majority people from this group are not working based on the reason of 'not seeking work' attribute, as shown in Fig 4(G). But majority people in this group have the ability to read and write. They might be the children or spouse of the head of the household. Lastly, Cluster 7 exposed that majority of B40 individual from this group owned a personal computer, and some of them owned an iPod/PDA, as shown in Fig 4(H). This pattern of plotting indicates a good standard of living of people in the cluster.
Heat map analysis. As compared to the scatter plot, the heat map analysis is able to reveal more than two important attributes for each cluster, whereby these attributes have a strong correlation. Heat map analysis is ideal for large-scale data visualization. The color scale shows the importance of the attributes where light green indicates an attribute with a high centroid value, and pink indicates an attribute with a low centroid value. From a total of 23 attributes, 15 attributes have been selected from heat map analysis and labelled as important attributes in forming the clusters. These are strata, birthplace, read and write, highest education, highest certificate, toilet facility, construction material of outer walls, paid TV channel, water filter, refrigerator, washing machine, occupation, reason for not seeking work, personal computer and iPod/PDA, as shown in Fig 5. There are three other extra attributes as compared to the scatter plot analysis which are highest education, refrigerator and strata. Each of these 15 attributes will be further analyzed in the next analysis called Descriptive Statistics Methods to identify the multidimensional indicators and dimensions in the context of B40 group in this study.
Descriptive statistics method. Based on the most important attributes identified in the previous analysis, descriptive statistics method is employed for further analysis in understanding the data within each cluster in order to identify the most relevant indicators and dimensions for multidimensional poverty. Descriptive statistics is a method that gives an overview or summary of a data through numerical calculations, graphs or tables [44]. Descriptive statistics on cluster results can provide a detailed picture on how similar the attributes are in the cluster [45].
Indicators and dimensions are two most important components of MPI in defining poverty. Indicators should capture the deprivation experienced, while dimensions are the grouping of indicators [46,47]. There are many methods for selecting MPI indicators and dimensions. The most relevant MPI indicators and dimensions for the B40 group will be

PLOS ONE
specified in this analysis. The naming and grouping of indicators identified in this analysis are referring to a discussion on guideline provided in [48]. For that reason, the values for each attribute discussed before need to be denormalized to see the actual values in order to achieve a meaningful interpretation result. Table 10 provides descriptive statistics for the B40 group clustering model. The grey color columns indicate the distinguishing characteristics for each cluster based on the statistics obtained.
Attributes analysis in defining multidimensional poverty indicators and dimensions. Reading and writing is a basic literacy skill which indicates the ability of a person to read and write. By referring to Table 10; this attribute was a distinct characteristic for Cluster 3 and 5, which 45% and 92% of individuals from Cluster 3 and 5, respectively, were not able to read and write. Highest education refers to the highest level of education attained by a person which includes pre-primary, primary, secondary, pre-university and tertiary. Table 10 reveals that this attribute was recognized as an important variable in distinguishing Clusters 3 and 5 in which 61% and 99% of people from Cluster 3 and 5 had no education. While the 'not applicable' classification in Cluster 3 refers to individuals who are too young or never attended school. A similar pattern can be seen for the highest certificate attribute with an additional cluster, which is Cluster 1. The 'highest education level' attribute has been observed in a large percentage in Cluster 1, which clearly indicates that this cluster consists of minors which 100% of people attained primary and pre-school education. Education is one of the dimensions of global MPI and is closely related to poverty. Thus, literacy indicator proposed in this study consists of reading and writing attribute while the highest level and grade indicator are introduced, which consist of highest education and highest certificate attributes. These two indicators are grouped under the education dimension to measure the education level among the B40 group.
Strata attribute refers to a person's living environment, urban or rural. Table 10 shows peoples from a rural area dominated cluster 0, while Cluster 1, 3, 4, 5 have more than 30% individuals from rural areas. This attribute has been observed to have a correlation with toilet facility and construction material of outer walls attributes. A greater percentage of people from these clusters are using the pour-flush toilet and living in a house made of plank or a combination of brick and plank. This proved that people living in rural area have a lower living standard as compared to urban people. Although strata is one of the important variables in cluster formation, however, this attribute is considered as a demographic variable. Another demographic

PLOS ONE
attribute found is the birthplace attribute. Thus, both attributes are not selected as multidimensional poverty indicators.
Five types of toilet facilities were listed in Malaysia, namely the flush system, pour-flush, pit, enclosed space over water and none. The Malaysian MPI used this attribute as one of the indicators to measure poverty which defines households without flush system as the cut-off for

PLOS ONE
deprivation. However, global MPI used different terminology, which is sanitation with a different cut-off measure. Table 10 reveals Cluster 0 is the most deprived when it comes to the toilet facility with 71% using pour flush toilet, followed by the other 4 clusters: Cluster 1, 3, 4 and 5. Therefore, the toilet facility attribute is selected as an attribute of measure for sanitation indicator in this study. The construction material of outer walls is another important attribute derived from the B40 clustering model. As seen in Table 10, there are 5 clusters, out of which less than 70% lived in houses made of brick. This attribute is one of the items defined by global MPI under housing indicator. Thus, the housing indicator is suggested in this study with construction material of outer walls as the measure attribute.
Paid TV channel attribute has been identified as a distinct characteristic for Cluster 2 and 7. A higher percentage of people were observed in these two clusters: 84% and 65% from Cluster 2 and 7, respectively could afford to subscribe to the service while most people in Cluster 4 and 6 cannot. This indicates a good standard of living for both clusters. In total, only 36% of the total dataset have access to this service. This attribute is suggested to be an attribute of measure for a new indicator called access to television service.
As it can be seen in Table 10, the water filter attribute is observed to be related to the paid TV channel attribute where people who are able to subscribe to the television service are also able to own a water filter. A total of 83% of the dataset does not own any water purification system at home. Safe drinking water is critical for public health, and water purification system can help to produce safe drinking water, especially for a rural area that did not get treated water supply. Hence, this attribute is also selected as one of the indicators. Refrigerator and washing machine are the two most common home appliances. Statistics, however, indicates that the majority of people from Cluster 4 are living without these two appliances. Therefore, these three home appliances: water filter, refrigerator and washing machine are chosen as measure attributes for assets indicator.
Occupation attribute refers to major groups of occupation in Malaysia based on the International Standard Classification of Occupations (ISCO-08). Occupation is the main source of income for most of the households in Malaysia. A total of 45% from the dataset of this study were categorized under outside labor force which means that they were unemployed, 22% were under ten years old who were the children of the head of the households, and the rest were employed people from various types of occupation. Table 10 indicates that Cluster 1 and 5 were the children of the head of the households, and Cluster 6 has the most significant unemployment percentage. 'Reason for not seeking work' attribute reveals about the unemployment percentage in occupation attribute, and hence both attributes are selected to measure multidimensional poverty under work indicator.
Both personal computer and iPod/PDA are other assets of B40 people, and both are distinguished features for Cluster 7. Table 10 illustrates 98% of people in Cluster 7 owned a personal computer, and 9% of this cluster owned an iPod/PDA. This indicates that this cluster is relatively good in standard of living due to their ability to own technology assets. Considering the importance of technology as the key growth engine for the emerging and developing country like Malaysia, these two attributes are selected to be measure attributes under assets indicator.
The analysis discussed above results in new multidimensional poverty measure for B40 group includes three dimensions: Education, Living Standards and Employment being broken down by seven indicators namely literacy, highest education level and grade, sanitation, housing, access to television services, assets and work with 13 measure attributes namely Read and Write, Highest Education, Highest Certificate, Toilet Facility, Construction Material of Outer Walls, Paid TV Channel, Water Filter, Refrigerator, Washing Machine, Personal Computer, iPod/PDA, Occupation, Reason for Not Seeking Work as presented in Table 11. Whereas, Table 12 provides a comparison between global MPI, Malaysia MPI and MPI discovered in this study.
Malaysia citizens are categorized into three different income groups, which are the Top 20 Percent (T20), Middle 40 Percent (M40), and Bottom 40 Percent (B40). The B40 group is further divided into three subgroups: lower-middle income, low-income, and poor. The success of the B40 clustering model in identifying B40's new important indicators can help to improve the present MPI's ability to detect the poor group. Additionally, it can also help to enhance poverty measurement based solely on income, namely the PLI. This can be seen by comparing the PLI method's poverty measurement with the new MPI calculated using this data set. With PLI, the number of B40s in each sub-category is distributed in Table 13. The B40 group can be categorized as poor with 14%, low income at 50%, and low middle income at 36%.
Eight sub-categories of B40 were discovered using the new MPI. According to verifications by poverty experts, Cluster 3 contains features of poverty that leads to the poor group. This is due to the fact that it has the smallest cluster size, which is 9% of the population, as shown in Table 9, the lowest average income as shown in Table 13, and possing the characteristics of poor people. As a result, Cluster 3 depicts the poor characteristics in this data set, as described in Table 14.
This comparison shows that just 9% of the poor are detected utilizing the new MPI method, compared to 14% using the PLI approach. Although the MPI can identify a smaller number of poor people than the PLI, it can look at a variety of different characteristics simultaneously that lead to a person being classified as poor. As a result, any government assistance programme aimed at lifting these people out of poverty can be targeted more precisely.

Conclusion
One of the focus areas in the Eleventh Malaysia Plan (11MP) is to elevate the B40 household group towards the middle-income society. Based on recent studies by the World Bank, Malaysia is expected to enter the high-income nation between 2024 and 2028. Thus, it is essential to clarify the B40 population through data-driven analytics to develop a comprehensive action plan by the government. Data analytic concerns the extraction of meaning, patterns and trends from varied and large volumes of data. Such data sets exist in many areas, and poverty eradication is no exception. Currently, the measurement of absolute poverty in Malaysia is known as the Poverty Line Income (PLI). PLI is an income approach in one dimension, specifically

PLOS ONE
measuring the gross monthly household income. In order for the B40 group to be deserving to be in the middle-class income, a striking attempt to improve the condition of the people in the group must be properly taken. At present, the B40 group is identified by income status, when in reality, they are more vulnerable to deprivations defined by numerous poverty dimensions. Malaysia has also employed the customised MPI technique from OPHI to measure multidimensional poverty. However, the World Bank Group has criticized the adoption of such techniques with only a detection rate of 0.86% and has urged that the benchmark, or so-called deprivation cut-off level, be raised. Thus, a clustering model-based K-Means Algorithm with Cosine Similarity measure is developed to form clusters of B40 group as one of alternative method by using machine learning to identify most important poverty indicator and its deprivation cut-off level. The evaluation found k = 8 to be the best k value for the model. A series of clustering analysis was then conducted to identify the indicators associated with multidimensional poverty and dimensions for the B40 community in Malaysia. By employing the descriptive statistics method, three dimensions have been established: Education, Living Standards and Employment with seven indicators: literacy, highest education level and grade, sanitation, housing, access to television services, assets and work. Out of seven indicators identified, this study proposed six new Multidimensional Poverty Indicators namely literacy, highest education level and grade, housing, access to television services, assets (water filter, refrigerator, washing machine, personal computer, iPod/PDA) and work to be considered by the policymakers as a valuable addition to the current MPI to establish a more meaningful picture of the current poverty trend in Malaysia. Furthermore, this study has discovered Cluster 3 of the B40 group to contain the smallest cluster size of 9% relative to the population with the lowest average income and possessing the characteristics of poor people, which had been confirmed by poverty specialists. A further in-depth study should be carried out in future to get the other important Multidimensional Poverty Indicators (MPI) components which are deprivation cut-offs and weights for each of the indicators specified. These components should be obtained for computation of MPI value as an absolute multidimensional poverty measurement. Furthermore, the algorithm used in the grouping model is K-Means. Many other algorithms can be studied and tested that may improve the clustering quality. The development of those algorithms could further enhance the attractiveness of the clustering approach to identify MPI for Bottom 40 group. In addition, by 2021, a new collection of census data will be published; that is Population and Housing Census 2020. This latest data could be applied in the future, which could offer the latest trends and more reliable research results.