A clustering approach to identify multidimensional poverty indicators for the bottom 40 percent group

Mariah Abdul Rahman; Nor Samsiah Sani; Rusnita Hamdan; Zulaiha Ali Othman; Azuraliza Abu Bakar

doi:10.1371/journal.pone.0255312

Abstract

The Multidimensional Poverty Index (MPI) is an income-based poverty index which measures multiple deprivations alongside other relevant factors to determine and classify poverty. The implementation of a reliable MPI is one of the significant efforts by the Malaysian government to improve measures in alleviating poverty, in line with the recent policy for Bottom 40 Percent (B40) group. However, using this measurement, only 0.86% of Malaysians are regarded as multidimensionally poor, and this measurement was claimed to be irrelevant for Malaysia as a country that has rapid economic development. Therefore, this study proposes a B40 clustering-based K-Means with cosine similarity architecture to identify the right indicators and dimensions that will provide data driven MPI measurement. In order to evaluate the approach, this study conducted extensive experiments on the Malaysian Census dataset. A series of data preprocessing steps were implemented, including data integration, attribute generation, data filtering, data cleaning, data transformation and attribute selection. The clustering model produced eight clusters of B40 group. The study included a comprehensive clustering analysis to meaningfully understand each of the clusters. The analysis discovered seven indicators of multidimensional poverty from three dimensions encompassing education, living standard and employment. Out of the seven indicators, this study proposed six indicators to be added to the current MPI to establish a more meaningful scenario of the current poverty trend in Malaysia. The outcomes from this study may help the government in properly identifying the B40 group who suffers from financial burden, which could have been currently misclassified.

Citation: Abdul Rahman M, Sani NS, Hamdan R, Ali Othman Z, Abu Bakar A (2021) A clustering approach to identify multidimensional poverty indicators for the bottom 40 percent group. PLoS ONE 16(8): e0255312. https://doi.org/10.1371/journal.pone.0255312

Editor: Carlos Alberto Zúniga-González, Universidad Nacional Autonoma de Nicaragua Leon, NICARAGUA

Received: March 11, 2021; Accepted: July 13, 2021; Published: August 2, 2021

Copyright: © 2021 Abdul Rahman et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The dataset in the study is available from the Department of Statistics Malaysia (DOSM) website under Population & Demographic subsection at https://www.dosm.gov.my/v1/index.php?r=column3/accordion&menu_id=amZNeW9vTXRydTFwTXAxSmdDL1J4dz09.

Funding: This research was funded by the Universiti Kebangsaan Malaysia (Grant code: GUP-2019-060). This grant was received by Dr Nor Samsiah Sani. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Malaysia has experienced significant progress in poverty reduction over half a century ago with tremendous initiatives made by the government since the introduction of the New Economic Policy (NEP) in 1971 [1]. Afterwards, the New Economic Model (NEM) was launched in 2010 with the main objective to make Malaysia a high-income and developed country by 2020. As such, the National Economic Advisory Council (MPEN) had suggested that the B40 group who are less fortunate and needs special attention should be focused on [2]. In regard to this, in the 10th Malaysia Plan (10MP) in 2011, the government took various efforts to increase the income of this group [3]. Later, in the 11th Malaysia Plan (11MP), the government continued its intensive efforts to support the development of the B40 group, which includes addressing issues regarding cost of living and strengthening the mechanism of assistance [4]. Likewise, through the 2019 Budget, which was unveiled in November 2018, the government committed to continuing and improving the Cost of Living Aid to the 2.7 million B40 group by providing a more targeted assistance. Health insurance and medical protection were also provided through the National Health Protection Fund, besides introducing the Healthcare Protection Scheme [5].

Poverty Line Income (PLI) is an income approach in one dimension, specifically measuring the gross monthly household income. Thus, the main weakness of such approach is that it does not represent an accurate and complete picture of deprivation and human well-being. The approach also gauges only the minimum requirement for basic needs and living standard, which does not consider the households’ preferences and does not reflect social mobility in the society. The PLI misrepresents what is available to a household for the purpose of meeting its basic needs. A family’s living conditions are shaped by more than the current income, and households may experience different living standards for reasons not explained by their current income data. This can also be regarded as a consumption bias, focusing less on human capability and potential. Generally, Malaysians are classified into three categories of income groups based on the household income: the top 20 percent of Malaysian population (T20), the middle 40 percent (M40) and the bottom 40 percent (B40). Table 1 shows the income classification based on the findings in 2016 and 2019 Household Income and Basic Amenities Survey. This study used the 2016 income threshold. T20 households earned over RM 9,620 per month, M40 households earned between RM 4,360 and RM 9,619 per month, and B40 households earned lesser than RM 4,360 per month.

Download:

Table 1. Income classification for Malaysia.

https://doi.org/10.1371/journal.pone.0255312.t001

At present, 2.78 million households earning a monthly income less than RM 4,360 are categorized as B40 in Malaysia. From this figure, three subgroups of B40 are identified, in which 24.1% of them are from lower-middle income category, 15.5% from low income, and 0.4% are categorized as poor [6]. Each subgroup represents different characteristics and needs. Thus, in order to improve the well-being of different subgroups of B40, the interpretations of poverty that should be viewed from various dimensions, in order to reflect the actual state of poverty.

On July 2010, the Oxford Poverty and Human Development Initiative (OPHI) and the United Nations Development Programme (UNDP) proposed a new poverty measure. They introduced the Multidimensional Poverty Index (MPI), which complements traditional income-based poverty indices by measuring multiple dimensions and different factors to determine and classify poverty. Based on the global MPI 2018, there are 3 dimensions namely health, education and living standards comprising 10 indicators namely nutrition, child mortality, years of schooling, school attendance, cooking fuel, sanitation, drinking water, electricity, housing and assets. Each dimension has the same weight as one third. The MPI looks at poverty from a surpassing perspective and sees how poverty can be experienced in many ways at the same time. The multidimensional measures satisfy several useful properties which allow, for instance, poverty targeting and comparisons over time and across countries and regions. In accordance with that, Malaysia has also taken steps to develop its custom Multidimensional Poverty Index (MPI) model at the national level as outlined in the Eleventh Malaysia Plan (11MP), following the footsteps of 100 countries worldwide that have already adopted the methods launched by OPHI in 2010 [7]. It also complements the PLI by considering other aspects apart from income.

Malaysian MPI covered four dimensions: education, health, living standards and income with 11 indicators: schooling years, school attendance, healthcare access, clean water access, living place conditions, room crowdedness, toilet, garbage collection facility, transportation, basic communication tools and mean monthly household income [4]. However, according to a recent mid-term review of the 11th Malaysia Plan released on October 2018, the index calculated using the MPI model was reported to be at 0.0033 while the incidence of multidimensional poverty was 0.86% at the national level for 2016 [6]. According to Dr Kenneth Simler, a Senior Economist of World Bank Group Global Knowledge and Research Hub Malaysia, the index is too low for Malaysia and it was recommended to increase the benchmark or the so-called deprivation cut-off level by using both MPI and PLI model in the future [8]. The multidimensional measures satisfy several useful properties which allow, for instance, poverty targeting and comparisons over time and across countries and regions. However, it is crucial to identify the indicators that are important for the MPI classification, which can be used by the government for further strategic planning in response to the poverty elimination. The recognition of these limitations has led us to propose this study in using data analytics approach to identify relevant indicators for multidimensional poverty classification. The proposed study makes use of clustering machine learning for poverty classification.

Machine learning methods are the most commonly used methods for predicting poverty. There are two main groups in machine learning methods, namely, supervised and unsupervised learning. Supervised learning is one of the ways in which the learning environment (also known as training data which contains user-defined labels) is formed and delivered. The algorithm will repeat the predictions using training data, and the learning will stop once it has achieved a certain level of performance. Then, a test set is performed to verify the accuracy of the predictions. In contrast, in unsupervised learning, the data on learning process is unlabeled to view unusual structures or patterns without clear learning goals [9–11]. Many studies have been conducted in analyzing multidimensional poverty using machine learning methods such as classification and clustering [12–17]. Clustering technique is a method of collecting data objects and grouping them based on the similarity of objects to gain an in-depth understanding of data distribution. In general, there are five key approaches to clustering, namely partitioning, hierarchy, density-based, grid-based and model-based [18].

To date, many studies have been published in the B40 domain. Mohd Zain and Tambi described the B40 group as urban poor in Malaysia and studied the factor of urban poverty in the development of late bloomer in education [19]. Whereas, Abdullah and Mohammad studied the health and literacy level among B40 and M40 men and demographic factors related to health literacy [20]. On the other hand, a group of researchers looked at the causes contributing to the increasing cost of living in this group [21]. Studies by Mayan, Mohd Nor and Samat examined the challenges faced in increasing the income of the B40 group [22]. A recent study conducted by Sani has classified the B40 group by a predictive model using the machine learning method. The researchers compared the performance of the three classification algorithms namely the Naïve Bayes, Decision Tree and k-Nearest Neighbor (kNN) and concluded that the Decision Tree model is the best model for classifying the B40 group [9].

In the past few decades, many researchers have developed a large number of clustering algorithms such as partitional, hierarchical and density-based clustering (DBC) methods. Those clustering algorithms have been applied in a wide variety of domain, such as image processing, data mining, market segmentation, medical imaging, social networks and including poverty. For instance, Ahmad and Ejaz [23] used the Two-Step Cluster Analysis technique. They found out that the ratio of sex, income and education were the crucial contributing factors in the non-poor group while dependence rate and family size were the crucial contributing factors in the poor group. Apart from that, the Analytic Hierarchy Process (AHP) was applied for poverty classification, while K-Means clustering was used to determine the range values between clusters [24]. Likewise, Coromaldi and Drago [25] employed the K-Means algorithm to explain poverty in Italy through an in-depth study of the income-deprivation score relationship. Their research found that poverty analysis is strengthened by examining the relationship between income and deprivation score using the multidimensional poverty indicators. On top of that, Chamboko and Re [17] have mapped multiple deprivation patterns for 13 areas in Namibia using GIS application and using the K-Means algorithm for clustering purposes. To build scores and thus reduce the number of deprivation dimensions, they applied Principal Component Analysis (PCA). This study looks at the relationship between deprivation and demographic characteristics based on the clusters produced.

Another research relevant to poverty using machine learning was done by Santoso and Irawan [26] using K-nearest neighbor (k-NN) and learning vector quantization (LVQ). In their research, K-NN produced higher accuracy as compared to LVQ. Similarly, Sano and Nindito [27] from Indonesia used K-Means algorithm for clustering the poverty. More interesting research was carried out by Njuguna and McSharry [28], who constructed spatiotemporal poverty indices through mobile telephone activity as an alternative to classify poverty using linear regression. Based on the research conducted thus far, there is a huge opportunity to discover a machine learning technique to classify multidimensional poverty according to the Malaysian context. The capability of machine learning in dealing with a large amount of data that can reveal data pattern may contribute to a higher accuracy of a poverty prediction model [29, 30].

In summary, from the above study, it can be concluded that there is a need for a comprehensive study on the measurement of multidimensional poverty to improve the current national MPI. Therefore, in this work, we have identified that there is a great opportunity to develop a clustering model that can identify Multidimensional Poverty Indicators and dimensions for the B40 group in Malaysia. After considering a number of well-known clustering algorithm, the K-Means algorithm is suggested in this study. The contributions of this paper are summarized below:

Proposed B40 clustering-based K-Means architecture to identify the right indicators and dimensions that yield more precise MPI measurement.
Extensive clustering analysis identified seven indicators of multidimensional poverty among B40 group. Out of the seven, six indicators (i.e. literacy, highest education level and grade, housing, access to television services, assets, and work) from three dimensions (i.e. education, living standard and employment) are proposed to be added to the current national MPI.

Employment is identified as an additional dimension for the consideration of policymakers towards MPI establishment.
The relevant indicators and dimensions are required and can guide the government in formulating an MPI to ensure the needs of B40 group are adequately addressed
Outcomes from this study help government to efficiently identify B40 group, which otherwise could be misclassified.

Research methodology

The overall architecture of the proposed method for identifying key indicators of multidimensional poverty among B40 group is depicted in Fig 1. The workflow comprises three main phases, namely data preparation phase, clustering phase and analysis phase, as shown in Fig 1. The data preparation phase starts with analyzing the structured data collected by the Malaysian Department of Statistics (DOSM), from the Malaysian Population and Housing Census 2010, consisting of 532,298 households. The 2010 Population and Housing Census of Malaysia [31] was the fifth decennial census to be conducted since the formation of Malaysia in 1963. The previous censuses were conducted in 1970, 1980, 1991 and 2000, indicating that each census was conducted once every decade. Census is an enormous statistical project that has been carried out in order to produce very useful data for planning and implementation of national development. The data collected will provide a comprehensive set of information on population, various demographic, social and economic features. Furthermore, the census data provides information on the total stock of residence, basic amenities and housing requirements available.

Download:

Fig 1. The workflow for the B40 clustering model.

https://doi.org/10.1371/journal.pone.0255312.g001

The raw dataset would go through data pre-processing phase before clustering phase takes place. In clustering phase, K-Means algorithm was tested with four different distance measure: Euclidean Distance (ED), Correlation Similarity (CrS), Cosine Similarity (CS) and Dice Similarity (DS) to choose the best distance measure. Then, experiments were conducted and evaluated from k value equals to 2 up to 15 in order to determine the best k. Finally, a series of analysis was performed by looking at the cluster size, centroid chart, scatter plot analysis, heat map analysis, and descriptive statistics method to investigate the pattern of each cluster formed further. The data preprocessing and experiments are conducted using Rapidminer Studio tools.

Data preprocessing

Data preprocessing methods focus on altering the raw data in an effort to assess the consistency of the data that satisfies the clustering process criteria. In this phase, six pre-processing activities are involved as depicted in Fig 1, namely data integration, attribute generation, data filtering, data cleaning, data transformation, and attribute selection. At the beginning of this process, data integration was carried out where three source files: Person, Household and Living Quarters. These were joined into a single dataset. Tables 2–4 show 40 attributes from person source file, 39 attributes from household source file and 17 attributes from living quarters source file. From a total of 96 attributes, repeated attributes were removed, leaving 84 attributes. Afterwards, two attributes were generated: salary and total household income based on occupation. These attributes mapped with Salaries & Wages Survey Report, Malaysia [32]. Then, the dataset was filtered to remove occupation from the category of unknown, unknown labor force status and unclassified. Non-B40 group and non-citizen were also filtered out from this study. Subsequently, data cleaning was done to fill in the missing values before the data transformation process takes place. Upon examination, there are 2,097 missing values from 2 attributes, namely, Country of Birth and Coding state/Country. The missing values for Country of Birth are replaced with the value ‘99’ which is ‘Malaysian Citizen’ while for the State/Country Code attribute, the missing values are replaced with the same values in State attribute. The operator called “Replace Missing Value” and it is used to replace every missing value with the specified values. In data transformation, a nominal attribute called age group was transformed into numeric attribute as there is a requirement for distance calculation in the clustering process. This process is performed by an operator called ‘Nominal to Numerical’ using unique integer coding type in Rapidminer. On top of that, normalization was performed using the Z-transformation method. It is important to note that normalization can ensure that the distance measure gives equal weight to each variable.

Download:

Table 2. A set of attributes from person source file.

https://doi.org/10.1371/journal.pone.0255312.t002

Download:

Table 3. A set of attributes from household source file.

https://doi.org/10.1371/journal.pone.0255312.t003

Download:

Table 4. A set of attributes from living quarters source file.

https://doi.org/10.1371/journal.pone.0255312.t004

There are four steps involved in attribute selection. First, we delete useless attributes by using an operator called “Remove Useless Attributes” where the process identified attributes containing the same values for all the records. Second, we used “Remove Correlated Attributes” where it detects pairs of attributes that are strongly related to each other based on the correlation values specified. Third, we removed the non-significant ones, which is the id-like-attributes. Feature Selection methods can be classified into two major groups, which are supervised and unsupervised. In supervised feature selection methods, the features are chosen based on their association with the class label. It selects features with strong relevance to the class label. On the other hand, unsupervised feature selection methods evaluate the feature relevance by exploring the data structures with unsupervised learning techniques. In this study, an operator called ‘Unsupervised Feature Selection’ was used to select important attributes from a total of 65 attributes. Unsupervised Feature Selection technique uses K-Means algorithm to find the most important features. Table 5 provides a list of the 23 selected attributes after the selection process.

Download:

Table 5. A set of attributes after unsupervised feature selection.

https://doi.org/10.1371/journal.pone.0255312.t005

K-means algorithm

K-Means algorithm is one of the most popular and widely used clustering algorithms. It is a clustering method where n objects o₁,…, o_n are clustered into a number of cluster k C₁,…, C_k. The initial group will be repeated several times by clustering each object to the nearest centroid point, and the centroid point will be recalculated until no further changes occur. The purpose of the optimization criteria in the clustering process are to minimize the sum of variances (Sum of Squared Errors) E between the objects in the cluster with the cen₁,…, cen_k points such as Eq (1).

(1)

In the K-Means algorithm, the distance is calculated between each data point and each centroid. The centroid is selected for each data point based on the minimum distance. Thus, distance plays an important role in the clustering process. Calculation of distance between these two points can be carried out using several techniques. Four distance measures are compared in this study namely Euclidean Distance, Correlation Similarity, Cosine Similarity and Dice Similarity. The Euclidean distance between two points is calculated based on Eq (2), where k is the number of dimensions, aj and bj are vectors: a = (a1, a2,…, ak), b = (b1, b2,…, bk). The dimensions used need to be transformed to be within the same scale, which is also known as normalization [33].

(2)

Correlation Similarity is calculated as the correlation between two attribute vector points. Given the data matrix X (m x n) where m (1 x n) line vectors x1, x2,…, xm, the correlation distance between x_δ and x_t vectors is defined as Eq (3) [31].

(3)

Cosine similarity is measured based on the cosine angle between two points of the attribute vector. Given a data matrix X (m * n) where m (1 * n) is the vector of the lines x₁, x₂,…, x_m, the cosine distance between the vector x_δ and x_t is defined as Eq (4) [33].

(4)

Dice similarity used in this study is dice similarity for numerical values in the input set. For the distance measure, the y(i,j) is the value of the j^th attribute of the i^th instance. Hence y (1,3)—y(2,3) is the difference of the values of the third attribute of the first and second instance. The similarity is calculated using Eq (5), where Y₁ Y₂ is the sum over product of values which is sum (j = 1) y(1,j) * y(2,j). Y₁ is the sum over values of the first instance which is sum (j = 1) y(1,j), while Y₂ is the sum over values of the second instance which is sum (j = 1) y(2,j) This types of similarity measured is offered in Rapidminer tools for K-Means clustering algorithm [34].

(5)

The evaluation of clustering results, also called cluster validation, is a process in which the accuracy or quality of the results obtained from the cluster is measured. Two main methods for measuring the quality of cluster results are internal and external validation. The evaluation of external validation is based on the comparison of cluster results with the unused data in the clustering process. Unused data is the data which contains the class labels. The cluster results are considered good if the comparison results are similar. Some of the measurement methods in external validation are Jaccard Index, Rand-Index and F-measure [35]. Whereas, internal validation provides a good score to the algorithms that produce high similarity within a cluster and low similarity between clusters. Davies Bouldin Index [36], Dunn Index [37] and Silhouette Index [38] are the popular methods for internal validation measure. There are also some new clustering validation indices proposed such as clustering validation index based on nearest neighbors (CVVN index) [39], Local Cores-based Cluster Validity (LCCV index) [40] and Absolute Cluster Validity index [41]. For this study, three internal validations implemented, which were Davies Bouldin, Average within Centroid Distance and Sum of Squares.

Davies Bouldin.

The Davies Bouldin (DB) metric measures the variation between points within the cluster (intra-cluster) and the distance between clusters (inter-cluster). In each cluster, this metric determines which other group has the highest ratio between the average intra-cluster distance of points in two clusters to the distance between clusters. After obtaining the maximum value, it will be averaged for all clusters. Low values are obtained if the distance within cluster is compact and the distance between cluster is far away. This measurement metric can provide clear clues for a good cluster [42]. This metric is defined as Eq (6): (6) where δ(x_i, x_j) is the distance between cluster, x_i, and x_j, Δx_i, Δx_j represent the distance between the points within cluster x_i, and x_j is the centroid for cluster x_i and c are the numbers of partition U cluster.

Average within centroid distance.

Average within Centroid Distance (AWCD) metric is measured by calculating the average distance per point from a centroid point within a cluster. The centroid distance between cluster A and B is the distance between centroid (A) and centroid (B). Average distance (dist) is calculated by finding the average in pairs between points within a cluster. In other words, for each point a_i in cluster A, the average distance is calculated dist(a_i,b₁), dist(a_i,b₂), … dist(a_i,b_n) and average them all. The more compact a cluster is, the lower the average value. This is because as the number of clusters increases, the average distance decreases naturally. This makes these measurement metrics difficult to interpret [42].

Sum of squares.

Sum of Squares (SS) metric divides the number of data points in a group by the number of data points in each cluster. This is called squared, and the values of all the clusters are summed. This evaluation metric shows that a good cluster can change according to the starting parameters used to form the cluster. If the size of the scale decreases slowly with increasing numbers of clusters, it indicates that there is a large stable cluster that is still intact. Eq (7) shows the calculation of SS evaluation metrics [43]: (7) where S_i represents the set of clusters (S₁, …, S_k) with a midpoint (μ₁, …, μ_k), k represents the number of clusters and x represents the data set.

Result and analysis

Determining the best distance

A series of experiment was run with k values ranging from 2 to 15 with four different distance measures, namely Euclidean Distance (ED), Correlation Similarity (CrS), Cosine Similarity (CS) and Dice Similarity (DS). Performance is measured based on DB, AWCD and SS evaluation metrics. Low values are representative of a good cluster with a particular distance measure. Table 6 shows the clustering performance based on three evaluation metrics (i.e., DB, AWCD and SS) for all k values starting from 2 to 15 using four different distance techniques (i.e., ED, CrS, CS and DS). The average DB values for the ED, CrS, CS and DS techniques were 1.78, 2.20, 2.19 and 5.91, respectively. As shown in Table 6, the DB recorded four infinity values when using the CrS technique at the k = 2, 6, 11and 14. At the same time, the DS technique recorded ten infinity values at the k = 3, 5, 7, 9,10,11,12,13,14 and 15. This indicates poor clustering quality results are produced when using the CrS and DS techniques based on the DB metric. Furthermore, as shown in Table 6, the average AWCD values were 13.98, 15.47, 14.96 and 23.87 for ED, CrS, CS and DS techniques, respectively. This shows that ED is the best distance technique compared to others (i.e., CrS, CS and DS) based on the average of DB and AWCD values. On the other hand, the CS technique is shown to outperform other distance techniques when using SS. This is based on the average values for all distance calculation techniques, which are 0.25, 0.23, 0.18 and 0.22 for ED, CrS, CS, and DS techniques.

Download:

Table 6. Clustering performance based on Davies Bouldin, average within centroid distance and sum of squares for k = 2 to 15 based on Euclidean distance, correlation similarity, cosine similarity and dice similarity.

https://doi.org/10.1371/journal.pone.0255312.t006

Moreover, to select the best distance technique, their performance is measured based on DB, AWCD and SS evaluation metrics. Table 7 demonstrates the average clustering performance for each distance measured. The ED technique recorded the best performance results based on the lowest DB and AWCD values. Meanwhile, the CS is the best distance technique that can be used to produce a quality clustering model based on SS value. Moreover, CrS shows moderate performance, and DS reveals a poor clustering performance. The performance results recorded in Table 7 are ranked from 1 to 4 for each evaluation metric to select the best distance measure.

Download:

Table 7. Comparison of average clustering performance based on distance measure.

https://doi.org/10.1371/journal.pone.0255312.t007

Table 8 shows the list of ranks for each distance measure based on the average values of DB, AWCD and SS metrics. The distance technique with the average value for each evaluation metric is recorded. From these values, the rank for each distance technique was noted for the purpose of identifying the performance of the distance technique. Thus, in these studies, the distance technique subjects are ranked (1 to n), so the rank value is from 1 to 4. For example, DB produces the lowest average value for ED. Therefore, ED was ranked as number 1, and the DS technique, with the highest average DB value, will have rank number 3. The final two columns on the right in Table 8 are the mean of the rank and rank position obtained of all evaluation metrics for each distance measure. This produces a listed rank position for each distance measure. Overall, the resulting ranking of the four distance measures is:

Cosine Similarity > Euclidean Distance > Correlation Similarity > Dice Similarity

It is shown that the Cosine Similarity is the best distance technique based on the lowest score obtained.

Download:

Table 8. Final score ranking to select the best distance measure.

https://doi.org/10.1371/journal.pone.0255312.t008

Determining the best k value

K-Means algorithm is an easy clustering algorithm. However, it requires the parameter k as the input to the clustering process. Variable k is an important parameter in determining the quality of a cluster. Therefore, this study will determine the best k value for the clustering model. The performance graph for the model is plotted based on Cosine Similarity measure.

Fig 2 shows the performance plotting of the clustering model from k = 2 up to k = 15. According to Davies Bouldin (DB) measure, a low DB value indicates that the clusters are tight, and each cluster is well separated. Based on the DB measure, the lowest value is recorded by k = 15. Based on the Average within Centroid Distance (AWCD) plots, the AWCD values seem to flatten at k = 8. This indicates that an increasing number of clusters does not significantly affect the quality of the clusters [42]. Based on the Sum of Squares (SS) measure, the SS value drops dramatically until k = 8 before it begins to flatten. Therefore, based on DB measure and taking into account the ACWD and SS measure, it can be concluded that k = 8 with DB = 2.157 is the best k value for this model.

Download:

Fig 2. Cluster performance plot.

https://doi.org/10.1371/journal.pone.0255312.g002

Clustering analysis

The analysis and interpretation of cluster results are one of the most important activities in clustering. Each cluster needs to be explored and analyzed to get its characteristics and differences. In this study, the analysis and interpretation of each cluster will determine the indicators and dimensions for multidimensional poverty among B40 group. The analysis of each cluster was done by looking at the cluster size, centroid chart, scatter plot analysis, heat map analysis, and descriptive statistics method.

Cluster size analysis.

As shown in Table 9, eight clusters are derived from the clustering model. Cluster 0 and 2 constitute the largest group comprising 16% each. Both clusters had an average distance with the lowest average centroid distance, indicating more compact clusters. Whereas, the smallest cluster is Cluster 3, making up 9% of the entire cluster. On the other hand, the Average within Centroid Distance (AWCD) returned a lower value for Cluster 2 at 9.064, which indicates that Cluster 2 is the most compact cluster than the other clusters.

Download:

Table 9. Size of cluster.

https://doi.org/10.1371/journal.pone.0255312.t009

Centroid chart analysis.

The Centroid Chart, as shown in Fig 3, is a graphical representation of centroid value in a parallel chart. It represents the mean value of centroid point for the given attribute for each of the cluster. The centroid value is a normalized value; therefore, the mean value for each attribute is equal to 0. The centroid value, which is far above and below the mean value can easily be noticed through this chart, which indicates a distinguishing characteristic for the respective cluster. For instance, for Cluster 7, the centroid value for personal computer attribute is 2.27, which is far above the mean value and is the highest value as compared to other clusters. Thus, the ‘personal computer’ is one of the most important characteristics of Cluster 7.

Download:

Fig 3. Centroid chart.

https://doi.org/10.1371/journal.pone.0255312.g003

Nevertheless, this form of analysis offers minimal insights; thus, the indicators and dimensions for multidimensional poverty cannot be specified at this point. Therefore, we proceed to the next analysis called Scatter Plot analysis.

Scatter plot analysis.

Scatter plots are another way of analyzing cluster characteristics graphically. It is very useful in visually positioning the cluster based on two key attributes of each cluster. It indicates the relationship or correlation between these two attributes. In light of this scatter plot analysis, 12 attributes have been selected as key indicators in defining each cluster as depicted in Fig 4A–4h.

Download:

Fig 4.

Scatter plot of (a) Cluster 0; (b) Cluster 1; (c) Cluster 2; (d) Cluster 3; (e) Cluster 4; (f) Cluster 5; (g) Cluster 6; (h) Cluster 7.

https://doi.org/10.1371/journal.pone.0255312.g004

As seen in Fig 4(A), Cluster 0 shows a relationship between toilet facilities and construction material of outer walls attribute. These group of people probably experienced a low living standard. There is a big and compact cluster in Cluster 1 that shows the strong correlation between the highest certificate and occupation attribute, as shown in Fig 4(B). Based on the plotting, most of the individuals in this cluster are not working and do not have any certification. Whereas, scatter plot for Cluster 2, as shown in Fig 4(C), depicts the correlation between paid TV channel and water filter attribute. Fig 4(D) shows plotting for Cluster 3, which reflects the remaining population between B40 group who are able and unable to read and write. Meanwhile, for cluster 4, as shown in Fig 4(E), a similar proportion can be seen between people of B40 group who owned a washing machine and a water filter. Plotting in Cluster 5 presents a strong correlation between the attribute of occupation and the capability to read and write, as shown in Fig 4(F). On the other hand, Cluster 6 revealed that majority people from this group are not working based on the reason of ‘not seeking work’ attribute, as shown in Fig 4(G). But majority people in this group have the ability to read and write. They might be the children or spouse of the head of the household. Lastly, Cluster 7 exposed that majority of B40 individual from this group owned a personal computer, and some of them owned an iPod/PDA, as shown in Fig 4(H). This pattern of plotting indicates a good standard of living of people in the cluster.

Heat map analysis.

As compared to the scatter plot, the heat map analysis is able to reveal more than two important attributes for each cluster, whereby these attributes have a strong correlation. Heat map analysis is ideal for large-scale data visualization. The color scale shows the importance of the attributes where light green indicates an attribute with a high centroid value, and pink indicates an attribute with a low centroid value. From a total of 23 attributes, 15 attributes have been selected from heat map analysis and labelled as important attributes in forming the clusters. These are strata, birthplace, read and write, highest education, highest certificate, toilet facility, construction material of outer walls, paid TV channel, water filter, refrigerator, washing machine, occupation, reason for not seeking work, personal computer and iPod/PDA, as shown in Fig 5. There are three other extra attributes as compared to the scatter plot analysis which are highest education, refrigerator and strata. Each of these 15 attributes will be further analyzed in the next analysis called Descriptive Statistics Methods to identify the multidimensional indicators and dimensions in the context of B40 group in this study.

Download:

Fig 5. List of important attributes for B40 clustering model from heat map analysis.

https://doi.org/10.1371/journal.pone.0255312.g005

Descriptive statistics method.

Based on the most important attributes identified in the previous analysis, descriptive statistics method is employed for further analysis in understanding the data within each cluster in order to identify the most relevant indicators and dimensions for multidimensional poverty. Descriptive statistics is a method that gives an overview or summary of a data through numerical calculations, graphs or tables [44]. Descriptive statistics on cluster results can provide a detailed picture on how similar the attributes are in the cluster [45].

Indicators and dimensions are two most important components of MPI in defining poverty. Indicators should capture the deprivation experienced, while dimensions are the grouping of indicators [46, 47]. There are many methods for selecting MPI indicators and dimensions. The most relevant MPI indicators and dimensions for the B40 group will be specified in this analysis. The naming and grouping of indicators identified in this analysis are referring to a discussion on guideline provided in [48]. For that reason, the values for each attribute discussed before need to be denormalized to see the actual values in order to achieve a meaningful interpretation result. Table 10 provides descriptive statistics for the B40 group clustering model. The grey color columns indicate the distinguishing characteristics for each cluster based on the statistics obtained.

Download:

Table 10. Descriptive statistics for B40 clustering model.

https://doi.org/10.1371/journal.pone.0255312.t010

Attributes analysis in defining multidimensional poverty indicators and dimensions. Reading and writing is a basic literacy skill which indicates the ability of a person to read and write. By referring to Table 10; this attribute was a distinct characteristic for Cluster 3 and 5, which 45% and 92% of individuals from Cluster 3 and 5, respectively, were not able to read and write. Highest education refers to the highest level of education attained by a person which includes pre-primary, primary, secondary, pre-university and tertiary. Table 10 reveals that this attribute was recognized as an important variable in distinguishing Clusters 3 and 5 in which 61% and 99% of people from Cluster 3 and 5 had no education. While the ‘not applicable’ classification in Cluster 3 refers to individuals who are too young or never attended school. A similar pattern can be seen for the highest certificate attribute with an additional cluster, which is Cluster 1. The ‘highest education level’ attribute has been observed in a large percentage in Cluster 1, which clearly indicates that this cluster consists of minors which 100% of people attained primary and pre-school education. Education is one of the dimensions of global MPI and is closely related to poverty. Thus, literacy indicator proposed in this study consists of reading and writing attribute while the highest level and grade indicator are introduced, which consist of highest education and highest certificate attributes. These two indicators are grouped under the education dimension to measure the education level among the B40 group.

Strata attribute refers to a person’s living environment, urban or rural. Table 10 shows peoples from a rural area dominated cluster 0, while Cluster 1, 3, 4, 5 have more than 30% individuals from rural areas. This attribute has been observed to have a correlation with toilet facility and construction material of outer walls attributes. A greater percentage of people from these clusters are using the pour-flush toilet and living in a house made of plank or a combination of brick and plank. This proved that people living in rural area have a lower living standard as compared to urban people. Although strata is one of the important variables in cluster formation, however, this attribute is considered as a demographic variable. Another demographic attribute found is the birthplace attribute. Thus, both attributes are not selected as multidimensional poverty indicators.

Five types of toilet facilities were listed in Malaysia, namely the flush system, pour-flush, pit, enclosed space over water and none. The Malaysian MPI used this attribute as one of the indicators to measure poverty which defines households without flush system as the cut-off for deprivation. However, global MPI used different terminology, which is sanitation with a different cut-off measure. Table 10 reveals Cluster 0 is the most deprived when it comes to the toilet facility with 71% using pour flush toilet, followed by the other 4 clusters: Cluster 1, 3, 4 and 5. Therefore, the toilet facility attribute is selected as an attribute of measure for sanitation indicator in this study.

The construction material of outer walls is another important attribute derived from the B40 clustering model. As seen in Table 10, there are 5 clusters, out of which less than 70% lived in houses made of brick. This attribute is one of the items defined by global MPI under housing indicator. Thus, the housing indicator is suggested in this study with construction material of outer walls as the measure attribute.

Paid TV channel attribute has been identified as a distinct characteristic for Cluster 2 and 7. A higher percentage of people were observed in these two clusters: 84% and 65% from Cluster 2 and 7, respectively could afford to subscribe to the service while most people in Cluster 4 and 6 cannot. This indicates a good standard of living for both clusters. In total, only 36% of the total dataset have access to this service. This attribute is suggested to be an attribute of measure for a new indicator called access to television service.

As it can be seen in Table 10, the water filter attribute is observed to be related to the paid TV channel attribute where people who are able to subscribe to the television service are also able to own a water filter. A total of 83% of the dataset does not own any water purification system at home. Safe drinking water is critical for public health, and water purification system can help to produce safe drinking water, especially for a rural area that did not get treated water supply. Hence, this attribute is also selected as one of the indicators. Refrigerator and washing machine are the two most common home appliances. Statistics, however, indicates that the majority of people from Cluster 4 are living without these two appliances. Therefore, these three home appliances: water filter, refrigerator and washing machine are chosen as measure attributes for assets indicator.

Occupation attribute refers to major groups of occupation in Malaysia based on the International Standard Classification of Occupations (ISCO-08). Occupation is the main source of income for most of the households in Malaysia. A total of 45% from the dataset of this study were categorized under outside labor force which means that they were unemployed, 22% were under ten years old who were the children of the head of the households, and the rest were employed people from various types of occupation. Table 10 indicates that Cluster 1 and 5 were the children of the head of the households, and Cluster 6 has the most significant unemployment percentage. ‘Reason for not seeking work’ attribute reveals about the unemployment percentage in occupation attribute, and hence both attributes are selected to measure multidimensional poverty under work indicator.

Both personal computer and iPod/PDA are other assets of B40 people, and both are distinguished features for Cluster 7. Table 10 illustrates 98% of people in Cluster 7 owned a personal computer, and 9% of this cluster owned an iPod/PDA. This indicates that this cluster is relatively good in standard of living due to their ability to own technology assets. Considering the importance of technology as the key growth engine for the emerging and developing country like Malaysia, these two attributes are selected to be measure attributes under assets indicator.

The analysis discussed above results in new multidimensional poverty measure for B40 group includes three dimensions: Education, Living Standards and Employment being broken down by seven indicators namely literacy, highest education level and grade, sanitation, housing, access to television services, assets and work with 13 measure attributes namely Read and Write, Highest Education, Highest Certificate, Toilet Facility, Construction Material of Outer Walls, Paid TV Channel, Water Filter, Refrigerator, Washing Machine, Personal Computer, iPod/PDA, Occupation, Reason for Not Seeking Work as presented in Table 11. Whereas, Table 12 provides a comparison between global MPI, Malaysia MPI and MPI discovered in this study.

Download:

Table 11. Dimensions, indicators and measure attributes identified from the B40 clustering model.

https://doi.org/10.1371/journal.pone.0255312.t011

Download:

Table 12. Dimensions and indicators comparison.

https://doi.org/10.1371/journal.pone.0255312.t012

Malaysia citizens are categorized into three different income groups, which are the Top 20 Percent (T20), Middle 40 Percent (M40), and Bottom 40 Percent (B40). The B40 group is further divided into three subgroups: lower-middle income, low-income, and poor. The success of the B40 clustering model in identifying B40’s new important indicators can help to improve the present MPI’s ability to detect the poor group. Additionally, it can also help to enhance poverty measurement based solely on income, namely the PLI. This can be seen by comparing the PLI method’s poverty measurement with the new MPI calculated using this data set. With PLI, the number of B40s in each sub-category is distributed in Table 13. The B40 group can be categorized as poor with 14%, low income at 50%, and low middle income at 36%.

Download:

Table 13. Distribution of B40 group based on 2016’s PLI.

https://doi.org/10.1371/journal.pone.0255312.t013

Eight sub-categories of B40 were discovered using the new MPI. According to verifications by poverty experts, Cluster 3 contains features of poverty that leads to the poor group. This is due to the fact that it has the smallest cluster size, which is 9% of the population, as shown in Table 9, the lowest average income as shown in Table 13, and possing the characteristics of poor people. As a result, Cluster 3 depicts the poor characteristics in this data set, as described in Table 14.

Download:

Table 14. Poor characteristic from Cluster 3.

https://doi.org/10.1371/journal.pone.0255312.t014

This comparison shows that just 9% of the poor are detected utilizing the new MPI method, compared to 14% using the PLI approach. Although the MPI can identify a smaller number of poor people than the PLI, it can look at a variety of different characteristics simultaneously that lead to a person being classified as poor. As a result, any government assistance programme aimed at lifting these people out of poverty can be targeted more precisely.

Conclusion

One of the focus areas in the Eleventh Malaysia Plan (11MP) is to elevate the B40 household group towards the middle-income society. Based on recent studies by the World Bank, Malaysia is expected to enter the high-income nation between 2024 and 2028. Thus, it is essential to clarify the B40 population through data-driven analytics to develop a comprehensive action plan by the government. Data analytic concerns the extraction of meaning, patterns and trends from varied and large volumes of data. Such data sets exist in many areas, and poverty eradication is no exception. Currently, the measurement of absolute poverty in Malaysia is known as the Poverty Line Income (PLI). PLI is an income approach in one dimension, specifically measuring the gross monthly household income. In order for the B40 group to be deserving to be in the middle-class income, a striking attempt to improve the condition of the people in the group must be properly taken. At present, the B40 group is identified by income status, when in reality, they are more vulnerable to deprivations defined by numerous poverty dimensions. Malaysia has also employed the customised MPI technique from OPHI to measure multi-dimensional poverty. However, the World Bank Group has criticized the adoption of such techniques with only a detection rate of 0.86% and has urged that the benchmark, or so-called deprivation cut-off level, be raised. Thus, a clustering model-based K-Means Algorithm with Cosine Similarity measure is developed to form clusters of B40 group as one of alternative method by using machine learning to identify most important poverty indicator and its deprivation cut-off level. The evaluation found k = 8 to be the best k value for the model. A series of clustering analysis was then conducted to identify the indicators associated with multidimensional poverty and dimensions for the B40 community in Malaysia. By employing the descriptive statistics method, three dimensions have been established: Education, Living Standards and Employment with seven indicators: literacy, highest education level and grade, sanitation, housing, access to television services, assets and work. Out of seven indicators identified, this study proposed six new Multidimensional Poverty Indicators namely literacy, highest education level and grade, housing, access to television services, assets (water filter, refrigerator, washing machine, personal computer, iPod/PDA) and work to be considered by the policymakers as a valuable addition to the current MPI to establish a more meaningful picture of the current poverty trend in Malaysia. Furthermore, this study has discovered Cluster 3 of the B40 group to contain the smallest cluster size of 9% relative to the population with the lowest average income and possessing the characteristics of poor people, which had been confirmed by poverty specialists.

A further in-depth study should be carried out in future to get the other important Multidimensional Poverty Indicators (MPI) components which are deprivation cut-offs and weights for each of the indicators specified. These components should be obtained for computation of MPI value as an absolute multidimensional poverty measurement. Furthermore, the algorithm used in the grouping model is K-Means. Many other algorithms can be studied and tested that may improve the clustering quality. The development of those algorithms could further enhance the attractiveness of the clustering approach to identify MPI for Bottom 40 group. In addition, by 2021, a new collection of census data will be published; that is Population and Housing Census 2020. This latest data could be applied in the future, which could offer the latest trends and more reliable research results.

References

1. Prime Minister Office of Malaysia. Second Malaysia plan (1971–1975). Kuala Lumpur: Prime Minister Office of Malaysia; 1970.
2. Majlis Penasihat Ekonomi Negara. Model baru ekonomi untuk Malaysia—Bahagian akhir: Langkah dasar strategik. Putrajaya: Majlis Penasihat Ekonomi Negara; 2010.
3. The Economic Planning Unit. Tenth Malaysia plan 2011–2015. Putrajaya: The Economic Planning Unit, Prime Minister’s Department; 2010.
4. The Economic Planning Unit. Rancangan Malaysia kesebelas 2016–2020: Pertumbuhan berpaksikan rakyat. Putrajaya: The Economic Planning Unit, Prime Minister’s Department; 2015.
5. Ministry of Finance. Belanjawan 2019. Putrajaya: Ministry of Finance Malaysia; 2018.
6. Economic Planning Unit. Mid-term review of the eleventh Malaysia plan 2016–2020: New priorities and emphases. Kuala Lumpur: Economic Planning Unit; 2018.
7. Alkire S, Santos ME. Measuring acute poverty in the developing world: Robustness and scope of the multidimensional poverty Index. World Dev. 2014;59:251–274.
- View Article
- Google Scholar
8. Simler K. An Idea Whose Time Has Come: Raising Malaysia’s Poverty Line. Malay Mail. 2019 Sep 1 [Cited 2019 December 3]; Available from: https://www.malaymail.com/news/what-you-think/2019/09/01/an-idea-whose-time-has-come-raising-malaysias-poverty-line-kenneth-simler/1786201
9. Sani NS, Abdul Rahman M, Abu Bakar A, Sahran S, Mohd Sarim H. Machine learning approach for bottom 40 percent households (B40) poverty classification. Int J Adv Sci Eng Inf Technol. 2018;8(4):1698–1705.
- View Article
- Google Scholar
10. Sani NS, Nafuri AFM, Othman ZA, Nazri MZA, Nadiyah Mohamad K. Drop-Out Prediction in Higher Education Among B40 Students. International Journal of Advanced Computer Science and Applications, 2020;11(11):550–559.
- View Article
- Google Scholar
11. Sani NS, Rahman AHA, Adam A, Shlash I, Aliff M. Ensemble Learning for Rainfall Prediction. International Journal of Advanced Computer Science and Applications, 2020;11(11):153–162.
- View Article
- Google Scholar
12. Caruso G, Sosa-Escudero W, Svarc M. Deprivation and the dimensionality of welfare: A variable-selection cluster-analysis approach. Int Assoc Res Income Wealth. 2014;61(4):1–21.
- View Article
- Google Scholar
13. Hurst W, Montañez CAC, Shone N, Al-Jumeily D. An ensemble detection model using multinomial classification of stochastic gas smart meter data to improve wellbeing monitoring in smart cities. IEEE Access. 2020;8:7877–7898.
- View Article
- Google Scholar
14. Isnin R, Bakar A A, Sani NS. Does Artificial Intelligence Prevail in Poverty Measurement?. Journal of Physics: Conference Series. 2020;1529(4):1–13.
15. Ugur MS. A cluster analysis of multidimensional poverty in Turkey. In: Chingula M, Vlahov RD, Dobribic D, editors. Proceedings of the International Scientific Conference on Economic and Social Development—Human Resources Development; 2016 Jun 9–11; Varazdin, Croatia: Varazdin Development and Entrepreneurship Agency; 2016. pp. 12–29.
16. Luzzi GF, Flückiger Y, Weber S. A cluster analysis of multidimensional poverty in Switzerland. In: Kakwani N, editor. Quantitative approaches to multidimensional poverty measurement. London: Palgrave Macmillan UK; 2008. pp. 63–79.
17. Othman Z. A, Bakar AA, Sani NS, Sallim J. Household Overspending Model Amongst B40, M40 and T20 using Classification Algorithm. International Journal of Advanced Computer Science and Applications, 2020;11(7):392–399.
- View Article
- Google Scholar
18. Abu Bakar A, Mohd Noah SA, Sani NS. Penerokaan pengetahuan dalam data raya. In: Hamdan AR, Abu Bakar A, Ahmad Nazri MZ, editors. Sains data penerokaan pengetahuan dari data raya. Selangor: Penerbit UKM; 2018. pp. 52–70.
19. Mohd Zin NA, Tambi N. Faktor kemiskinan bandar terhadap pembangunan pendidikan golongan lewat kembang. J Psikol Malaysia. 2018;32(3):119–130.
- View Article
- Google Scholar
20. Abdullah AH, Mohamad E. Tahap literasi kesihatan golongan lelaki kumpulan pendapatan B40 dan M40 di Johor Bahru. J Soc Sci Humanit. 2016;11(2):17–35.
- View Article
- Google Scholar
21. Aqmin M, Wahab A, Shahiri HI, Mansur M, Azlan M, Zaidi S. Kos sara hidup tinggi di Malaysia: Pertumbuhan pendapatan isi rumah yang perlahan atau taraf hidup yang meningkat? J Ekon Malaysia. 2018;52(1):117–33.
- View Article
- Google Scholar
22. Mayan SNA, Mohd Nor R, Samat N. Challenges to the household income class B40 increase in developed country towards 2020 case study: Penang. Int J Environ Soc Space. 2017;5(2):35–41.
- View Article
- Google Scholar
23. Ahmad Z, Ejaz Z. Classification of households with respect to poverty by using cluster analysis. Proceedings of the 11th Islamic Countries Conference on Statistical Sciences (ICCS-11); 2011 Dec 19–22; Lahore, Pakistan: Islamic Countries Society of Statistical Sciences; 2011. pp. 369–381. https://doi.org/10.13140/2.1.4604.6728
24. Sarwosri SD, Akbar RJ, Setiyawan RD. Poverty classification using analytic hierarchy process and K-means clustering. In: Satapathy SC, Das, S, editors. Proceedings of 2016 International Conference on Information and Communication Technology and Systems (ICTS 2016); 2015 Nov 28–29; Ahmedabad, India: IEEE; 2016. pp. 266–269. https://doi.org/10.1109/ICTS.2016.7910310
25. Coromaldi M, Drago C. An analysis of multidimensional poverty: Evidence from Italy. In: White R, editor. Measuring Multidimensional poverty and deprivation, global perspectives on wealth and distribution. Cham: Springer; 2017. pp. 69–86. https://doi.org/10.1007/978-3-319-58368-6
26. Santoso S, Irawan MI. Classification of poverty levels using k-nearest neighbor and learning vector quantization methods. Int J Comput Sci Appl Math. 2016;2(1):8–13.
- View Article
- Google Scholar
27. Sano AVD, Nindito H. Application of K-means algorithm for cluster analysis on poverty of provinces in Indonesia. ComTech. 2016;7(2):141–150.
- View Article
- Google Scholar
28. Njuguna C, McSharry P. Constructing spatiotemporal poverty indices from big data. J Bus Res. 2017;70:318–327.
- View Article
- Google Scholar
29. Arribas-Bel D, Patino JE and Duque JC. Remote sensing-based measurement of Living Environment Deprivation: Improving classical approaches with machine learning. PLoS one. 2017; 12: 1–25. pmid:28464010
- View Article
- PubMed/NCBI
- Google Scholar
30. Hashemian B, Massaro E, Bojic I, Arias J M, Sobolevsky S, Ratti C. Socioeconomic characterization of regions through the lens of individual financial transactions. PLoS one. 2017; 11:1–20. pmid:29190724
- View Article
- PubMed/NCBI
- Google Scholar
31. Department of Statistics Malaysia. Population and housing census of Malaysia. Putrajaya: Department of Statistics Malaysia; 2010.
32. Department of Statistics Malaysia. Laporan penyiasatan tenaga buruh, Malaysia, 2016. Putrajaya: Department of Statistic Malaysia; 2016.
33. Bora DJ, Gupta AK. Effect of different distance measures on the performance of K-means algorithm: An experimental study in Matlab. Int J Comput Sci Inf Technol. 2014;5(2):2501–6.
- View Article
- Google Scholar
34. Rapidminer GmbH. k-Means (Concurrency). Rapidminer Studio Documentation. [Cited 2019 December 18]; Available from: https://docs.rapidminer.com/8.2/studio/operators/modeling/segmentation/k_means.html
35. Sisodia DS, Verma A. Performance of unsupervised learning algorithms for online document clustering. In: Proceedings of the 2018 International Conference on Inventive Research in Computing Applications (ICIRCA); 2018 Jul 11–12; Coimbatore, Tamil Nadu, India: RVS College of Engineering and Technology; 2018. pp. 920–925. https://doi.org/10.1109/ICIRCA.2018.8597378
36. Vergani AA, Binaghi E. A soft davies-bouldin separation measure. In: Proceedings of the 2018 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE); 2018 Jul 8–13; Rio De Janeiro, Brazil: IEEE; 2018. pp. 75–82. https://doi.org/10.1109/FUZZ-IEEE.2018.8491581
37. Rathore P, Ghafoori Z, Bezdek JC, Palaniswami M, Leckie C. Approximating Dunn’s cluster validity indices for partitions of big data. IEEE Trans Cybern. 2019;49(5):1629–41. pmid:29994745
- View Article
- PubMed/NCBI
- Google Scholar
38. Rani U, Sahu S. Comparison of clustering techniques for measuring similarity in articles. In: Proceedings of the 2017 3rd International Conference on Computational Intelligence & Communication Technology (CICT); 2017 Feb 9–10; Ghaziabad, India: IEEE; 2017. pp. 1–7. https://doi.org/10.1109/CIACT.2017.7977377
39. Liu Y, Xiong H, Li Z. Understanding and enhancement of internal clustering validation measures. Data Clust. 2019;43(3):571–606. pmid:23193245
- View Article
- PubMed/NCBI
- Google Scholar
40. Cheng D, Zhu Q, Huang J, Wu Q, Yang L. A novel cluster validity index based on local cores. IEEE Trans Neural Networks Learn Syst. 2019;30(4):985–999. pmid:30072347
- View Article
- PubMed/NCBI
- Google Scholar
41. Iglesias F, Zseby T, Zimek A. Absolute cluster validity. IEEE Trans Pattern Anal Mach Intell. 2020;42(9):2096–112. pmid:31027043
- View Article
- PubMed/NCBI
- Google Scholar
42. Klinkenberg R, Hofmann M, editors. Rapidminer: Data mining use cases and business analytics applications. Boca Raton, FL: CRC Press; 2014.
43. Dao T, Duong K, Vrain C. Constrained minimum sum of squares clustering by constraint programming. In: Pesant G, editor. Principles and practice of constraint programming: Lecture notes in Computer Science. Cork, Ireland: Springer; 2015. pp. 557–573. https://doi.org/10.1007/978-3-319-23219-5
44. Donges N. Intro to Descriptive Statistics. Towards Data Science. 2018 Feb 14 [Cited 2020 July 10]. Available from: https://towardsdatascience.com/intro-to-descriptive-statistics-252e9c464ac9 pmid:30854281
- View Article
- PubMed/NCBI
- Google Scholar
45. Soman KP, Diwakar S, Ajay V. Insight into data mining: Theory and practice. Delhi: PHI Learning Private Limited; 2006.
46. Abu Bakar A, Hamdan R, Sani NS. Ensemble learning for multidimensional poverty classification. Sains Malaysiana. 2020;49(2):447–459.
- View Article
- Google Scholar
47. Shabudin S, Sani NS, Ariffin KAZ, Aliff M. Feature selection for phishing website classification. Int J Adv Comput Sci Appl. 2020;11(4):587–595.
- View Article
- Google Scholar
48. United Nations Development Programme, Oxford Proverty and Human Development Initiative, University of Oxford. How to build a national multidimensional poverty index (MPI): Using the MPI to inform the SDGs. New York, NY: United Nations Development Programme; 2019.

[ref1] 1. Prime Minister Office of Malaysia. Second Malaysia plan (1971–1975). Kuala Lumpur: Prime Minister Office of Malaysia; 1970.

[ref2] 2. Majlis Penasihat Ekonomi Negara. Model baru ekonomi untuk Malaysia—Bahagian akhir: Langkah dasar strategik. Putrajaya: Majlis Penasihat Ekonomi Negara; 2010.

[ref3] 3. The Economic Planning Unit. Tenth Malaysia plan 2011–2015. Putrajaya: The Economic Planning Unit, Prime Minister’s Department; 2010.

[ref4] 4. The Economic Planning Unit. Rancangan Malaysia kesebelas 2016–2020: Pertumbuhan berpaksikan rakyat. Putrajaya: The Economic Planning Unit, Prime Minister’s Department; 2015.

[ref5] 5. Ministry of Finance. Belanjawan 2019. Putrajaya: Ministry of Finance Malaysia; 2018.

[ref6] 6. Economic Planning Unit. Mid-term review of the eleventh Malaysia plan 2016–2020: New priorities and emphases. Kuala Lumpur: Economic Planning Unit; 2018.

[ref7] 7. Alkire S, Santos ME. Measuring acute poverty in the developing world: Robustness and scope of the multidimensional poverty Index. World Dev. 2014;59:251–274.
View Article
Google Scholar

[8] View Article

[9] Google Scholar

[ref8] 8. Simler K. An Idea Whose Time Has Come: Raising Malaysia’s Poverty Line. Malay Mail. 2019 Sep 1 [Cited 2019 December 3]; Available from: https://www.malaymail.com/news/what-you-think/2019/09/01/an-idea-whose-time-has-come-raising-malaysias-poverty-line-kenneth-simler/1786201

[ref9] 9. Sani NS, Abdul Rahman M, Abu Bakar A, Sahran S, Mohd Sarim H. Machine learning approach for bottom 40 percent households (B40) poverty classification. Int J Adv Sci Eng Inf Technol. 2018;8(4):1698–1705.
View Article
Google Scholar

[12] View Article

[13] Google Scholar

[ref10] 10. Sani NS, Nafuri AFM, Othman ZA, Nazri MZA, Nadiyah Mohamad K. Drop-Out Prediction in Higher Education Among B40 Students. International Journal of Advanced Computer Science and Applications, 2020;11(11):550–559.
View Article
Google Scholar

[15] View Article

[16] Google Scholar

[ref11] 11. Sani NS, Rahman AHA, Adam A, Shlash I, Aliff M. Ensemble Learning for Rainfall Prediction. International Journal of Advanced Computer Science and Applications, 2020;11(11):153–162.
View Article
Google Scholar

[18] View Article

[19] Google Scholar

[ref12] 12. Caruso G, Sosa-Escudero W, Svarc M. Deprivation and the dimensionality of welfare: A variable-selection cluster-analysis approach. Int Assoc Res Income Wealth. 2014;61(4):1–21.
View Article
Google Scholar

[21] View Article

[22] Google Scholar

[ref13] 13. Hurst W, Montañez CAC, Shone N, Al-Jumeily D. An ensemble detection model using multinomial classification of stochastic gas smart meter data to improve wellbeing monitoring in smart cities. IEEE Access. 2020;8:7877–7898.
View Article
Google Scholar

[24] View Article

[25] Google Scholar

[ref14] 14. Isnin R, Bakar A A, Sani NS. Does Artificial Intelligence Prevail in Poverty Measurement?. Journal of Physics: Conference Series. 2020;1529(4):1–13.

[ref15] 15. Ugur MS. A cluster analysis of multidimensional poverty in Turkey. In: Chingula M, Vlahov RD, Dobribic D, editors. Proceedings of the International Scientific Conference on Economic and Social Development—Human Resources Development; 2016 Jun 9–11; Varazdin, Croatia: Varazdin Development and Entrepreneurship Agency; 2016. pp. 12–29.

[ref16] 16. Luzzi GF, Flückiger Y, Weber S. A cluster analysis of multidimensional poverty in Switzerland. In: Kakwani N, editor. Quantitative approaches to multidimensional poverty measurement. London: Palgrave Macmillan UK; 2008. pp. 63–79.

[ref17] 17. Othman Z. A, Bakar AA, Sani NS, Sallim J. Household Overspending Model Amongst B40, M40 and T20 using Classification Algorithm. International Journal of Advanced Computer Science and Applications, 2020;11(7):392–399.
View Article
Google Scholar

[30] View Article

[31] Google Scholar

[ref18] 18. Abu Bakar A, Mohd Noah SA, Sani NS. Penerokaan pengetahuan dalam data raya. In: Hamdan AR, Abu Bakar A, Ahmad Nazri MZ, editors. Sains data penerokaan pengetahuan dari data raya. Selangor: Penerbit UKM; 2018. pp. 52–70.

[ref19] 19. Mohd Zin NA, Tambi N. Faktor kemiskinan bandar terhadap pembangunan pendidikan golongan lewat kembang. J Psikol Malaysia. 2018;32(3):119–130.
View Article
Google Scholar

[34] View Article

[35] Google Scholar

[ref20] 20. Abdullah AH, Mohamad E. Tahap literasi kesihatan golongan lelaki kumpulan pendapatan B40 dan M40 di Johor Bahru. J Soc Sci Humanit. 2016;11(2):17–35.
View Article
Google Scholar

[37] View Article

[38] Google Scholar

[ref21] 21. Aqmin M, Wahab A, Shahiri HI, Mansur M, Azlan M, Zaidi S. Kos sara hidup tinggi di Malaysia: Pertumbuhan pendapatan isi rumah yang perlahan atau taraf hidup yang meningkat? J Ekon Malaysia. 2018;52(1):117–33.
View Article
Google Scholar

[40] View Article

[41] Google Scholar

[ref22] 22. Mayan SNA, Mohd Nor R, Samat N. Challenges to the household income class B40 increase in developed country towards 2020 case study: Penang. Int J Environ Soc Space. 2017;5(2):35–41.
View Article
Google Scholar

[43] View Article

[44] Google Scholar

[ref23] 23. Ahmad Z, Ejaz Z. Classification of households with respect to poverty by using cluster analysis. Proceedings of the 11th Islamic Countries Conference on Statistical Sciences (ICCS-11); 2011 Dec 19–22; Lahore, Pakistan: Islamic Countries Society of Statistical Sciences; 2011. pp. 369–381. https://doi.org/10.13140/2.1.4604.6728

[ref24] 24. Sarwosri SD, Akbar RJ, Setiyawan RD. Poverty classification using analytic hierarchy process and K-means clustering. In: Satapathy SC, Das, S, editors. Proceedings of 2016 International Conference on Information and Communication Technology and Systems (ICTS 2016); 2015 Nov 28–29; Ahmedabad, India: IEEE; 2016. pp. 266–269. https://doi.org/10.1109/ICTS.2016.7910310

[ref25] 25. Coromaldi M, Drago C. An analysis of multidimensional poverty: Evidence from Italy. In: White R, editor. Measuring Multidimensional poverty and deprivation, global perspectives on wealth and distribution. Cham: Springer; 2017. pp. 69–86. https://doi.org/10.1007/978-3-319-58368-6

[ref26] 26. Santoso S, Irawan MI. Classification of poverty levels using k-nearest neighbor and learning vector quantization methods. Int J Comput Sci Appl Math. 2016;2(1):8–13.
View Article
Google Scholar

[49] View Article

[50] Google Scholar

[ref27] 27. Sano AVD, Nindito H. Application of K-means algorithm for cluster analysis on poverty of provinces in Indonesia. ComTech. 2016;7(2):141–150.
View Article
Google Scholar

[52] View Article

[53] Google Scholar

[ref28] 28. Njuguna C, McSharry P. Constructing spatiotemporal poverty indices from big data. J Bus Res. 2017;70:318–327.
View Article
Google Scholar

[55] View Article

[56] Google Scholar

[ref29] 29. Arribas-Bel D, Patino JE and Duque JC. Remote sensing-based measurement of Living Environment Deprivation: Improving classical approaches with machine learning. PLoS one. 2017; 12: 1–25. pmid:28464010
View Article
PubMed/NCBI
Google Scholar

[58] View Article

[59] PubMed/NCBI

[60] Google Scholar

[ref30] 30. Hashemian B, Massaro E, Bojic I, Arias J M, Sobolevsky S, Ratti C. Socioeconomic characterization of regions through the lens of individual financial transactions. PLoS one. 2017; 11:1–20. pmid:29190724
View Article
PubMed/NCBI
Google Scholar

[62] View Article

[63] PubMed/NCBI

[64] Google Scholar

[ref31] 31. Department of Statistics Malaysia. Population and housing census of Malaysia. Putrajaya: Department of Statistics Malaysia; 2010.

[ref32] 32. Department of Statistics Malaysia. Laporan penyiasatan tenaga buruh, Malaysia, 2016. Putrajaya: Department of Statistic Malaysia; 2016.

[ref33] 33. Bora DJ, Gupta AK. Effect of different distance measures on the performance of K-means algorithm: An experimental study in Matlab. Int J Comput Sci Inf Technol. 2014;5(2):2501–6.
View Article
Google Scholar

[68] View Article

[69] Google Scholar

[ref34] 34. Rapidminer GmbH. k-Means (Concurrency). Rapidminer Studio Documentation. [Cited 2019 December 18]; Available from: https://docs.rapidminer.com/8.2/studio/operators/modeling/segmentation/k_means.html

[ref35] 35. Sisodia DS, Verma A. Performance of unsupervised learning algorithms for online document clustering. In: Proceedings of the 2018 International Conference on Inventive Research in Computing Applications (ICIRCA); 2018 Jul 11–12; Coimbatore, Tamil Nadu, India: RVS College of Engineering and Technology; 2018. pp. 920–925. https://doi.org/10.1109/ICIRCA.2018.8597378

[ref36] 36. Vergani AA, Binaghi E. A soft davies-bouldin separation measure. In: Proceedings of the 2018 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE); 2018 Jul 8–13; Rio De Janeiro, Brazil: IEEE; 2018. pp. 75–82. https://doi.org/10.1109/FUZZ-IEEE.2018.8491581

[ref37] 37. Rathore P, Ghafoori Z, Bezdek JC, Palaniswami M, Leckie C. Approximating Dunn’s cluster validity indices for partitions of big data. IEEE Trans Cybern. 2019;49(5):1629–41. pmid:29994745
View Article
PubMed/NCBI
Google Scholar

[74] View Article

[75] PubMed/NCBI

[76] Google Scholar

[ref38] 38. Rani U, Sahu S. Comparison of clustering techniques for measuring similarity in articles. In: Proceedings of the 2017 3rd International Conference on Computational Intelligence & Communication Technology (CICT); 2017 Feb 9–10; Ghaziabad, India: IEEE; 2017. pp. 1–7. https://doi.org/10.1109/CIACT.2017.7977377

[ref39] 39. Liu Y, Xiong H, Li Z. Understanding and enhancement of internal clustering validation measures. Data Clust. 2019;43(3):571–606. pmid:23193245
View Article
PubMed/NCBI
Google Scholar

[79] View Article

[80] PubMed/NCBI

[81] Google Scholar

[ref40] 40. Cheng D, Zhu Q, Huang J, Wu Q, Yang L. A novel cluster validity index based on local cores. IEEE Trans Neural Networks Learn Syst. 2019;30(4):985–999. pmid:30072347
View Article
PubMed/NCBI
Google Scholar

[83] View Article

[84] PubMed/NCBI

[85] Google Scholar

[ref41] 41. Iglesias F, Zseby T, Zimek A. Absolute cluster validity. IEEE Trans Pattern Anal Mach Intell. 2020;42(9):2096–112. pmid:31027043
View Article
PubMed/NCBI
Google Scholar

[87] View Article

[88] PubMed/NCBI

[89] Google Scholar

[ref42] 42. Klinkenberg R, Hofmann M, editors. Rapidminer: Data mining use cases and business analytics applications. Boca Raton, FL: CRC Press; 2014.

[ref43] 43. Dao T, Duong K, Vrain C. Constrained minimum sum of squares clustering by constraint programming. In: Pesant G, editor. Principles and practice of constraint programming: Lecture notes in Computer Science. Cork, Ireland: Springer; 2015. pp. 557–573. https://doi.org/10.1007/978-3-319-23219-5

[ref44] 44. Donges N. Intro to Descriptive Statistics. Towards Data Science. 2018 Feb 14 [Cited 2020 July 10]. Available from: https://towardsdatascience.com/intro-to-descriptive-statistics-252e9c464ac9 pmid:30854281
View Article
PubMed/NCBI
Google Scholar

[93] View Article

[94] PubMed/NCBI

[95] Google Scholar

[ref45] 45. Soman KP, Diwakar S, Ajay V. Insight into data mining: Theory and practice. Delhi: PHI Learning Private Limited; 2006.

[ref46] 46. Abu Bakar A, Hamdan R, Sani NS. Ensemble learning for multidimensional poverty classification. Sains Malaysiana. 2020;49(2):447–459.
View Article
Google Scholar

[98] View Article

[99] Google Scholar

[ref47] 47. Shabudin S, Sani NS, Ariffin KAZ, Aliff M. Feature selection for phishing website classification. Int J Adv Comput Sci Appl. 2020;11(4):587–595.
View Article
Google Scholar

[101] View Article

[102] Google Scholar

[ref48] 48. United Nations Development Programme, Oxford Proverty and Human Development Initiative, University of Oxford. How to build a national multidimensional poverty index (MPI): Using the MPI to inform the SDGs. New York, NY: United Nations Development Programme; 2019.

Figures

Abstract

Introduction

Research methodology

Data preprocessing

K-means algorithm

Davies Bouldin.

Average within centroid distance.

Sum of squares.

Result and analysis

Determining the best distance

Determining the best k value

Clustering analysis

Cluster size analysis.

Centroid chart analysis.

Scatter plot analysis.

Heat map analysis.

Descriptive statistics method.

Conclusion

References