Semi-supervised machine learning approaches for predicting the chronology of archaeological sites: A case study of temples from medieval Angkor, Cambodia

Archaeologists often need to date and group artifact types to discern typologies, chronologies, and classifications. For over a century, statisticians have been using classification and clustering techniques to infer patterns in data that can be defined by algorithms. In the case of archaeology, linear regression algorithms are often used to chronologically date features and sites, and pattern recognition is used to develop typologies and classifications. However, archaeological data is often expensive to collect, and analyses are often limited by poor sample sizes and datasets. Here we show that recent advances in computation allow archaeologists to use machine learning based on much of the same statistical theory to address more complex problems using increased computing power and larger and incomplete datasets. This paper approaches the problem of predicting the chronology of archaeological sites through a case study of medieval temples in Angkor, Cambodia. For this study, we have a large dataset of temples with known architectural elements and artifacts; however, less than ten percent of the sample of temples have known dates, and much of the attribute data is incomplete. Our results suggest that the algorithms can predict dates for temples from 821–1150 CE with a 49-66-year average absolute error. We find that this method surpasses traditional supervised and unsupervised statistical approaches for under-specified portions of the dataset and is a promising new method for anthropological inquiry.


Introduction
Archaeologists often rely on statistical methods to infer the chronology of and group archaeological sites, artifact types, and architecture. However, this can be limited by incomplete datasets. It can be relatively easy to create large archaeological datasets, with excavations producing thousands of ceramic sherds and lithic assemblages. Similarly, hundreds of archaeological sites can be identified on the landscape using remote sensing at relatively little cost. However,

Statistical paradigms: Supervised, unsupervised, and semi-supervised
The degree of completeness of a given dataset defines the type of statistical learning paradigms possible [4]. As in traditional statistical analyses, the goal of machine learning algorithms is to infer information on the basis of incomplete data. One prototypical problem is to classify data points by assigning each data point a "label" reflecting a quantity of interest. For example, we are interested in dating temples; temples with known dates are considered labeled data and temples without known dates are considered unlabeled data. In general, there are three types of learning paradigms: supervised (all data are labeled), unsupervised (no data are labeled), and semi-supervised (a portion of the data are labeled).
In the following sections, we discuss the differences between supervised, unsupervised, and semi-supervised machine learning. Note that supervised and unsupervised paradigms also apply to non-machine learning statistical analyses. The analyses we performed in this paper encompassed both supervised and unsupervised paradigms.
Supervised machine learning. Analyses that use labeled data are "supervised" because we know the correct output, which allows us to correct errors in the algorithm [5]. Some examples of machine learning applications that use supervised paradigms are associations, classification, and regression. Machine learning associations identify conditional probability in sets of data among input variables and between input variables and outputs [6]. For example, machine learning can associate products customers typically buy together, like cereal and milk. The association of cereal and milk can be used by companies to cross-sell and advertise milk to customers purchasing cereal.
Supervised machine learning can also classify data into discrete classes. Insurance companies use a wide assortment of data about insurance applicants (e.g., age, income, sex, history) to classify them into high and low-risk groups. This machine learning method relies on previously collected data about individuals including their attributes (e.g., age, income, sex, history) and their insurance claims. By classifying new customers into low or high-risk groups, the insurance provider can determine which types of insurance to offer and determine premium rates. Classification algorithms are created with pre-existing data, but they can be adjusted as future data become available to improve accuracy. Other examples of machine learning classification include image and text recognition [6]. Semi-supervised machine learning for archaeology Regression is distinguished from classification because the output is continuous as opposed to discrete. For example, a machine learning regression can predict the price of houses based on a training set of houses' attributes (e.g., type of countertop, square feet, neighborhood) and known sale prices. Machine learning optimizes the algorithm, so the approximate error of the value is as minimal as possible based on the known prices of houses in the training set [6].
Unsupervised machine learning. For unsupervised learning, all the data are unlabeled [5]. Unsupervised learning works best to identify underlying patterns or structures in data [6,7]. While unsupervised learning is fundamentally used for estimating density, it can also be used for quantile estimation, clustering, outlier detection, and dimensionality reduction [7]. For example, companies can use unsupervised learning paradigms to group customers based on demographic information and purchasing habits. The companies can then target different groups for marketing and outliers can be identified as niche markets.
Semi-supervised machine learning. Semi-supervised learning (SSL) lies between supervised and unsupervised learning paradigms by incorporating both labeled and unlabeled data. This approach is often used when labeled data points are few because they are time consuming or expensive to obtain. In many cases, a fully labeled dataset may be infeasible, whereas nonlabeled data points may be easily obtained [7,8]. The internet, for example, has provided an avenue to easily and inexpensively obtaining unlabeled data through web crawlers. Web crawlers can scrape millions of photographs from the internet. However, to label these images would require much human effort to identify and record the content of each image by hand. SSL works to minimize the number of labels needed by learning from unlabeled data, thereby reducing the necessary human effort. One of the first SSL algorithms was developed to classify web pages [9]. SSL creates algorithms that use unlabeled data to improve the supervised learning algorithms [9]. It may seem counterintuitive to suggest that one can use unlabeled data to learn the labels of other data, but the distribution of unlabeled data in relation to labeled data can reveal a great deal of information. Fig 2 illustrates how unlabeled data can be used with labeled data to infer underlying patterns. In this example, there are two labeled data points, a circle and a diamond. Many statistical methods (e.g., Bayesian paradigms, regularization, minimum description length) would linearly divide the space as shown on the left. However, if we introduce unlabeled data, a geometric structure emerges that contradicts the linear divide. Instead, a circular classifier is preferred [10].
Indeed, much of natural human learning occurs in SSL paradigms [10]. Take, for example, how children learn to classify objects. They are exposed to some labeled data, their parent pointing to a gray fluffy animal, "cat." However, they also observe many animals that are not explicitly labeled. Over time, children combine both the labeled and unlabeled data as they learn to classify animals [11].
If the data are unlabeled, how do we know if SSL works? In some cases, it is possible to identify isolated errors. For example, the number of labeled data points for image recognition SSL is limited by the relatively expensive human component of hand labeling. In these cases, the labels are not truly unknown, only in the context of the training dataset used by the learning algorithm. As such, humans can easily verify the results by scanning through the classification of images and recognizing mistakes. Mistakes can then be rectified to improve the overall accuracy of the model. A classic example is the individuals incorrectly classified as gorillas by Google's image classifier in 2015. The individuals brought the error to the attention of Google engineers, who quickly rectified the mistake [12]. When the labels are truly unknown, the standard way to evaluate machine learning algorithms and estimate prediction error is through cross-validation [13].

Background: Case study and data
Angkor is a sprawling low-density urban complex with hundreds of temples and occupation mounds connected through a network of hydraulic infrastructure [14]. Until recently, the full extent of the settlement was only partially understood. Much of the habitational space was constructed in non-durable organic materials that have since disintegrated. Decades of aerial mapping and other remote sensing, however, have revealed traces of archaeological features including ponds, occupation mounds, embankments, and channels on the landscape [1,15]. Evans and Pottier mapped much of the hinterlands and identified over 1400 temples (Fig 3).
In this paper, we are interested in identifying the construction sequence of these temples so that we can date other urban features by proxy and create historical models of the urban development of the city.
The archaeological record at Angkor is a palimpsest of thousands of years of human habitation, with early urban forms emerging in the Bronze and Iron Ages and developing, in the first millennium CE, into dispersed, low-density settlement complexes punctuated by high-density epicenters and nodes [16]. While much work has been done using inscriptions and art historical inference to dates temples, the sheer scale and intensity of human transformation of the landscape, combined with persistent occupation and renovation of settlements over millennia, makes understanding the chronology of Angkor difficult.
In the following sections, we describe prior work that has been done to date temples at Angkor, primarily through Lustig's interpretation of temple inscriptions with listed dates and dates derived from Polkinghorne's dating of lintels [17,18]. In total, there are 1437 temples in Cambodia. Of these, 105 of the temples have known dates from Lustig and Polkinghorne (Appendix 1). Our goal is to identify dates for the remaining 1332 undated temples using statistical methods to analyze a variety of attribute data of temples that, prior to this study, have not been found statistically useful for determining chronological patterns. A further complication is that a single temple may have had multiple periods of occupation. Some were used for 100 years, then abandoned, and then re-purposed 300 years later. For example, one temple, Kapilapura, has inscriptions dating to 968 CE and 1200-1399 CE, suggesting at least two periods of occupation. Others were built over and obscured from our current record. Given the nature of the archaeological record, in most cases, it is easiest to determine when the temples were initially built, or terminus post quem. We do not expect to be able to identify multiple phases of occupation unless there are multiple art historical periods, inscriptions, or extensive excavation. When a temple went out of use, or terminus ante quem, is also difficult to determine. However, Greater Angkor Project (GAP) III ceramic data and excavations of temple sites by Pierre Bâty suggest degrees of longevity [19]. As a result, we treat the temple dates as cumulative, meaning that once built, a temple is in continued use unless we have specific spatiotemporal data to suggest it went into disuse.

Data
Angkorian inscriptions were inscribed on stone in Khmer, Sanskrit, or both and often refer to temple foundations, including their establishment, administration, and support [20,21]. Meaning that when temples were established the date of their construction and the name of the sponsor were inscribed on the temple walls. Similar inscriptions on contemporary pagodas indicate individual contributions to Pagoda foundations (Fig 4). As such, temples with inscriptions can be dated to a high degree of accuracy. However, inscriptions were expensive and only a small proportion (roughly 10%) of temples in the greater Angkor region have inscriptions. Similarly, the temples that do have inscriptions tend to be larger in size. Where specific foundation dates were listed, Lustig converted the śaka dates to CE by adding 78 years. Saka dates are years in the Indian calendar, which begin at the start of the Saka Era (March 22 nd of year 79 in the Gregorian calendar). Temples with śaka dates are considered "certain." Where inscriptions were undated or a century or even two centuries are suggested, Lustig converted these to the approximate CE centuries [22]. For example, she converted 9 th century śaka (800-899 śaka or 878-977 CE) to 10 th century CE (900-999 CE). She further narrowed date ranges to specific reigns mentioned in the inscription. For example, if a king was mentioned by name, the date range was adjusted to the known dates of that king's reign. If a king's posthumous name (the name the king was referred to after death) was given, Lustig determined that inscription must have been written after his death and she adjusted the date range accordingly. Lustig considered dates with ranges "uncertain." For the statistical analysis, we are interested in identifying the date most consistent with the current attributes of the temple. For example, it is possible that a sandstone temple from the 11 th century CE was built on top of a small shrine dating to a much earlier period. If our attribute data for that temple represents the construction in the 11 th century CE, we are interested in associating the temple with the 11 th century CE date for the statistical analysis, regardless of whether there was an earlier foundation. We acknowledge that not being able to identify earlier foundations is a limitation in our study; however, at the present time the only way to identify earlier foundations is through extensive excavations at each temple site. Which would be prohibitively expensive and time consuming. Multiple periods of occupation are added to the model of urban development after we have conducted the statistical analysis. Where the date listed for each inscription or lintel was a range, we opted to use the median of the range.
The second source of temple dates are lintels. Many smaller temples, that did not have inscriptions, had carved lintels. Lintels depict scenes from mythology and decorative motifs. The styles of the lintels changed over time and Polkinghorne was able to compare these styles with styles from temples with known dates to determine a chronology of lintels. Polkinghorne also used the designation of "certain" vs. "uncertain" for lintel dates based on multiple lines of evidence, including the inscriptional data [17] (S1 Table) Temples with multiple inscriptions and lintel dates were dated as follows with "certain" dates always prioritized over "uncertain" dates: a. If there was only one lintel or inscription date, we used the date.
b. If there are multiple inscription and/or lintel dates that were within 65 years of each other, we used the median of the dates. 65 years is somewhat arbitrary, but we argue that it accounts for variability in the original data, especially when dates were assigned based on the reigns of kings.
c. If there are no inscription or lintel dates, we use the dates found through literature searches.
d. When there are conflicting dates from the inscriptions and lintels where literature searches did not help, we prioritized the dates in the following order: lintels (certain), inscriptions (certain), lintels (uncertain), and inscriptions (uncertain). These have multiple periods of occupation.
In addition to the temple dates, there are six measures of similarity, or attributes, for each temple: 1) presence or absence of a primary reservoir (coded by Klassen) (Fig 5); 2) Building Materials (sandstone, pink sandstone, laterite, brick, thmaphom or mountain stone) (from database created by Evans); 3) azimuth (calculated by Klassen) (Fig 6); 4) area (calculated by Klassen) (Fig 7); 5) mound morphology (square, horseshoe (east), horseshoe (west), horseshoe (northern), two causeways, four causeways, blob, and undetermined) (coded by Klassen) (Fig  8); 6) presence or absence of a moat (coded by Klassen) (Fig 9) (S2 Table). These attributes represent the current data we have on the temples. In total, there are 11 attributes because we record the presence or absence of each building material. Building materials have been recorded by dozens of archaeological surveys over the last century. The surveys were conducted for different reasons and focused on collecting different types of data; however, one of the consistent elements that was recorded was the building material of the temple. We compiled this data into a consistent framework. We then contributed to the data set by using geographic information systems analyses and remote sensing data to quantify attributes of the temples, like their size, orientation, and morphology. This was a very cost effective method of amassing great deal of information on the temples without the need for fieldwork. We reason that things like the style of temple morphology (including morphology, presence or absense of moat and primary reservoir, etc.) may reflect changes in style over time. However, while statistical analyses have been done with each feature in isolation, no chronological patterns had independently emerged from these datasets. We did not use geographic coordinates or relative spatial data as metrics of similarity in this study. Meaning, we are not auto-correlating temples based on their geographic proximity to other temples.

Methods
To determine the foundation date of otherwise undated temples, we conducted k-means clustering, discriminant function analysis, multiple linear regression analysis, principal component analysis regression, and graph-based semi-supervised machine learning to determine if any morphological or architectural features were strong predictors of the temple dates. These analyses incorporate a variety of methods that produce either continuous-change (regression) or phase models (classification). Continuous values can be grouped into historical periods so that either technique will suffice for our purposes. Because we are interested in dividing the temples by century, each modeling approach was assessed on its ability to accurately predict the correct time period for temples with known dates. Classifications were considered satisfactory if they could successfully group temples with known dates with other temples from the  same 100-year range and regressions were considered ideal if they could predict dates for temples with an AAE of 50 years or less and successful if they could predict dates for temples with an AAE of 75 years or less. For these analyses, we introduced dummy variables to represent categorical data (all attributes except azimuth). Dummy variables are independent variables that represent categorical or nominal variables and are coded to allow for statistical analyses [23].
We used a variety of initial methods including k-means clustering, discriminant function analysis and principal component analysis. These methods, however, did not yield results that allowed us to date the temples accurately (see S1 File).
Multiple regression analysis. Multiple linear regression analysis determines the relationship between a single dependent variable (temple date) and multiple independent variables. Linear regression is designed to perform well when there aer linear or nearly linear relationships in the data. This assumption is common in statistical modeling and holds approximately in a wide variety of applied situations [24]. Linear regression does not work well when data are grouped in clusters or when there is no clear linear relationship. Multiple linear regression is often used to identify constituent components in archaeological collections. For example, the technique has been used to determine periods of occupation from ceramic assemblages [25].
We fitted a multiple linear regression model with the all of the temple attributes. One limitation of multiple linear regression is that it cannot process temples with missing pieces of data. For example, if there is no known azimuth for a temple, the temple cannot be included in the analysis. Removing temples with missing data reduces the number of temples with known dates and complete datasets to 16. If we remove the pedestal type from the analysis, the number of temples included in the analysis increases to 73. As such, we chose to remove pedestal types from the analysis. The results from the linear model including all temple data except pedestals was statistically significant (R squared = 0.5892, adjusted R squared = 0.4883, F = 5.84, p = 0.00). The AAE in the predicted values from a leave-one-out cross evaluation is 60 years. Unfortunately, this method requires complete datasets. Much of our data is incomplete in the elements that were recorded during pedestrian survey or mapped using remotely sensed data. As such, we could only use the model to predict dates for approximately half of the sample (755 of 1437 temples).
Graph-based SSL (GSSL). Graph-based SSL (GSSL) constructs a graph from training data to understand the underlying structure and relationships in the data [11]. A graph is a collection of mathematical objects with vertices connected by edges. In GSSL models, each vertex is a labeled or unlabeled data point in the training dataset. The number of vertices in the graph Semi-supervised machine learning for archaeology is determined by the total number of data points, and the number of edges is at most the square of the number of data points. The weights of the edges are determined by the amount of similarity between the two data points. In general, graph-based approaches do not extend predictions to data beyond the sample used in the graph [10].
GSSL works best when the labels between data points vary smoothly across the graph and when data points with large edge weight have the same or similar labels [11] and have the same distribution [26]. Similarly, GSSL is expected to underperform for data at either end of the range because the procedure attempts to intelligently "average" the known labels in the dataset. As a result, the procedure will never assign a date outside the range of the dates present in the original labeled set. Hence, if we remove the earliest or latest temple from the sample, it is impossible for it to be assigned the correct label in a k-fold hold out procedure (for further information, see S1 File).

Results: Combining the results of multiple linear regression and Graph-based semi-supervised machine learning (GSSL)
In this paper, we explore several statistical approaches that fall under supervised or unsupervised paradigms. In the case-study, there are 1332 undated temples (non-labeled data points) and 105 dated temples (labeled data points). Seriation like k-means clustering is unsupervised and uses data from all the temples but does not incorporate the known dates in the analysis. In contrast, MLR is supervised and uses the known dates to determine the algorithm, but is limited to approximately 10% of the dataset and could only predict dates for approximately half of the dataset [6]. As a result, none of the analyses took full advantage of the dataset using information from both the labeled and unlabeled data to improve the algorithms. Since collecting data for all the undated temples, using excavation and traditional dating methods, would be prohibitively costly and time-consuming, a semi-supervised paradigm was a natural approach for our analysis to predict dates for the remaining temples that could not be dated using multiple linear regression. However, the GSSL model had a higher AAE than the multiple linear regression. As a result, we decided to merge the results from the GSSL and the MLR to combine the benefits of both approaches and determine estimated errors for different types of temples.
We expect GSSL performance to be worse for temples with incomplete data and for temples very dissimilar from all other temples. To test this hypothesis, we classified temples as either "well-specified" or "under-specified." Any temple with more than five null attributes was classified as "under-specified." A temple was also called "under-specified" if there was no other temple with which it had a similarity of at least 10. For GSSL, "well-specified" temples had a 65-year AAE, and "under-specified" temples had a 92-year AAE. This analysis demonstrates the importance of complete datasets. We expect that the results can be improved in the future with a more-complete dataset. For the MLR, "well-specified" temples had a 60-year AAE, and "under-specified" temples had a 55-year AAE; however, dates were only predicted for 34 "under-specified" temples.
GSSL is also expected to underperform in predicting dates at either end of the range. In our sample, temples with known dates from 690-820 CE had an AAE of 137 years later than their true date from the GSSL predictions and 129 for the MLR predictions. Temples with known dates from 1150-1308 CE had an AAE of 132 years before their true date from the GSSL predictions and 92 for the MLR predictions. Temples with known dates from 821-1149 CE had an AAE of 56 years from the GSSL predictions and an AAE of 50 for the MLR predictions. In all scenarios, the MLR has lower AAE than the results of the GSSL. As a result, we chose to use the MLR predictions, where possible, and use the results from the GSSL for the remainder of the analysis (S3 Table).
In Fig 10, we plotted the results from the analysis using the bchron tool in R used for calibrating radiocarbon dates using a normal calibration curve to account for the estimated absolute error. We plotted the results as follows for the GSSL dates: "well-specified" temples between 821-1149 CE, 49 years AAE; "well-specified" temples before 820 or after 1150 CE, 130 years AAE; "under-specified" temples between 821-1149 CE, 66 years AAE; "under-specified" temples before 820 or after 1150 CE, 139 years AAE. We plotted the results as follows for the MLR dates: "well-specified" temples between 821-1149 CE, 49 years AAE; "well-specified" temples before 820 or after 1150 CE, 107 years AAE; "under-specified" temples between 821-1149 CE, 57 years AAE; "under-specified" temples before 820 or after 1150 CE, 50 years AAE. Notably, the GSSL and MLR have the same AAE for "well-specified" temples between 821-1149 CE. For temples with known dates, we used the true date, rather than the inferred date, and included multiple occupation periods if there were separated by at least 20 years.
The results of the analysis suggest an increase in the number of temples founded until 11 th century CE. After this, there is a decline in the number of new constructions through the 12 th century CE and very few subsequent temple foundations. This pattern is noted with the disclaimer that GSSL methods tend to replicate the distribution that exists in the originally labeled dataset since it replicates the distribution of the original dataset to propagate known labels to the unknown set. It is possible that we are underestimating the number of temples in the first and last periods if the original set of labels also underestimated the proportion of temples from those time periods. We argue it is unlikely we are underestimating the number of temples with inscriptions for each period in our labeled dataset. We base this argument on the assumption that most of the inscriptions from Khmer temples have been identified and inventoried, representing an accurate distribution. Within our dataset, there were 12 temples with inscriptions from multiple periods. In six of these instances, there was one date from the period 821-1149 CE that was not used in favor of an earlier or later date. Only two dates before 821 CE were not used in the model in favor of a later date and only three dates after 1149 CE were not used in the model in favor of an earlier date. One limitation of our study is that Polkinghorne's database is constrained to lintels dating from before 1100 CE. As such, it is possible that temples that were dated by their lintels do not represent the entire distribution of temples across the landscape because of sampling bias in the original study.

Discussion and conclusion
In the absence of detailed chronological models, the working assumption has always been that essentially all of the temples we see on the landscape were operational at the pinnacle of