The authors have declared that no competing interests exist.
Analyzed the data: RN SDG. Contributed reagents/materials/analysis tools: RN JDB SDG. Wrote the paper: JDB RN SDG. Data collection: AB JDB.
The Pacific coast of the Tohoku region of Japan experiences repeated tsunamis, with the most recent events having occurred in 1896, 1933, 1960, and 2011. These events have caused large loss of life and damage throughout the coastal region. There is uncertainty about the degree to which seawalls reduce deaths and building damage during tsunamis in Japan. On the one hand they provide physical protection against tsunamis as long as they are not overtopped and do not fail. On the other hand, the presence of a seawall may induce a false sense of security, encouraging additional development behind the seawall and reducing evacuation rates during an event. We analyze municipalitylevel and submunicipalitylevel data on the impacts of the 1896, 1933, 1960, and 2011 tsunamis, finding that seawalls larger than 5 m in height generally have served a protective role in these past events, reducing both death rates and the damage rates of residential buildings. However, seawalls smaller than 5 m in height appear to have encouraged development in vulnerable areas and exacerbated damage. We also find that the extent of flooding is a critical factor in estimating both death rates and building damage rates, suggesting that additional measures, such as multiple lines of defense and elevating topography, may have significant benefits in reducing the impacts of tsunamis. Moreover, the area of coastal forests was found to be inversely related to death and destruction rates, indicating that forests either mitigated the impacts of these tsunamis, or displaced development that would otherwise have been damaged.
Over the past century, the Pacific coast of Japan’s Tohoku region (
In addition, the municipalities of Shiogama, Tagajo, Rifu, Shichigahama, and Matsushima lie between Sendai and Higashimatsushima. Figure created by the authors using prefecture boundary data from [
After the 1933 event, a limited number of cities in Iwate Prefecture constructed “hard” tsunami defense structures along the coast specifically to protect lives and property from tsunamis. After the 1960 event, the number of projects for this purpose increased rapidly [
The Japanese government’s reconstruction plan [
However, detractors conjecture that hard coastal structures cause a sense of complacency in residents [
Contrarily, Tomita et al. [
1896 Miyagi  1896 Iwate  1933 Miyagi  1933 Iwate  1960 Miyagi  1960 Iwate  2011 Miyagi  2011 Iwate  2011 submunicipal  

[ 
[ 
[ 
[ 
[ 
[ 
[ 
[ 
[ 

N/A  N/A  N/A  N/A  N/A  [ 
[ 
[ 
[ 

[ 
[ 
[ 
[ 
[ 
[ 
[ 
[ 
[ 

[ 
[ 
[ 
[ 
[ 
[ 
[ 
[ 

[ 
[ 
[ 
[ 
[ 
[ 
[ 
[ 

[ 
[ 
[ 
[ 
[ 
[ 
[ 
[ 

[ 
[ 
[ 
[ 
[ 
[ 
[ 
[ 

[ 
[ 
[ 
[ 
[ 
[ 
[ 
[ 

[ 
[ 
[ 
[ 
[ 
[ 
Google Earth  Google Earth  Google Earth 
In this study, the number “dead” includes both those reported as “dead” and those reported as “missing”. Dwellings “destroyed” includes the sum of the number of dwellings reported as “swept away”, “collapsed”, and “completely damaged”. “Death rate” or “mortality” is defined as the ratio of the number dead in each municipality (or submunicipality) divided by the total population of the municipality (or submunicipality). Likewise, “damage rate” is the number of dwellings destroyed in the municipality divided by the total number of dwellings in the municipality. Other studies (i.e., [
In
In addition to census, municipal, and damage data, topography is listed in
Histograms of total death and destruction rates are shown in
The red dashed lines represent kernel density plots of death and destruction rates respectively.
The empirical cumulative distribution function (ecdf) plots of total damage and destruction rates for each prefecture have been plotted in
Variable  Prefecture  Min.  1^{st} Q.  Median  Mean  3^{rd} Q.  Max. 

Destruction Rate (%)  Miyagi  0.00  2.57  20.74  31.72  51.45  100.00 
Destruction Rate (%)  Iwate  0.00  2.82  20.63  24.24  38.40  92.39 
Death Rate (%)  Miyagi  0.00  0.00  0.67  3.96  4.80  29.23 
Death Rate (%)  Iwate  0.00  0.13  1.29  9.50  7.63  81.62 
It can be seen both from
Furthermore, it is interesting to compare the seawall characteristics and tsunami heights for each prefecture. As can be seen in
The bubble charts in
Conditional density plots describe how the conditional distribution of a given categorical response variable changes as the explanatory variable changes. In
Here the response variables are binary, representing whether death and damage rates are above or below their median values. For a given seawall height, a greater proportion of red means lower death or damage rates, while a greater proportion of gray means higher death or damage rates.
A brief review of statistical learning methods relevant to analyzing the tsunami data is presented in this section. We begin by stating the distinction between supervised and unsupervised learning methods and discuss different classes of supervised learning methods, namely, parametric, semiparametric and nonparametric methods. We then deliberate the details of the statistical models used in this paper. We conclude by discussing biasvariance tradeoff and also the tradeoffs between predictive accuracy and model interpretability.
Broadly speaking, statistical learning methods refer to a large pool of algorithms and tools used for data analysis. Statistical learning methods can be categorized into two groups of supervised (the focus of this paper) and unsupervised learning methods. Unlike unsupervised learning methods (ULM), in supervised learning methods (SLM) the observed target variable of interest (e.g. tsunamiinduced damage rates in a given year) guides the learning process. Models can be developed to predict the target variable based on a range of input variables (e.g. tsunami height and seawall heights). The ultimate goal is developing a model that can best capture the relationship between the predictors and response and minimize the loss function (i.e. the difference between the observed and predicted values of the target variable).
The target variable of interest can be denoted as
Linear and nonlinear supervised statistical learning methods can be parametric, semiparametric or nonparametric. In this paper, we develop a range of models to best predict tsunamiinduced death and damage rates.
In parametric methods, assumptions are made about the shape of function
The term generalized linear models was coined by Nelder and Wedderburn in early 1970s [
Semiparametric models lie at the fuzzy boundary between parametric and nonparametric techniques. Semiparametric models offer more flexibility compared to parametric models and better interpretability compared to nonparametrics.
Generalized additive models are nonlinear extensions to generalized linear models [
MARS is a semiparametric model that allows for local nonlinearities and interaction effects which makes it suitable for modeling highdimensional datasets [
Nonparametric models do not make assumptions about the shape of the function
BART is a Bayesian, treebased approach. A BART model consists of the summation of
Random forest is a nonparametric, treebased ensemble dataminer [
Support vector machines (SVM) is a powerful tool for big data analytics. Contrary to many datamining methods that use greedy algorithms, SVM is a constrained optimization problem and does not suffer from local optima and handles highdimensional data very well. In SVMregression the input space is first mapped onto an mdimensional feature space. A linear model is the constructed in this feature space. In other words, SVM regression involves developing a linear regression in a high dimensional feature space [
This flexible datamining technique was developed by [
As mentioned earlier, semiparametric and nonparametric techniques are usually not constrained with the (often nonrealistic) assumptions of more restrictive parametric models. While this offers the advantage of better approximating the systematic relationship between
Partial dependencies show the influence of a covariate of interest, on the response, given that the effect of the rest of the covariates on the response are averaged out as shown in the equation below [
In the equation above,
The generalization performance of a statistical model hinges on model’s capability to yield accurate predictions on an independent test sample. Biasvariance tradeoff is central to ensuring minimized generalization error [
This section summarizes the results our predictive models of damage rate and death rate respectively and discuss the importance of various factors such as tsunami heights, seawall heights, and coastal forest areas.
This section summarizes the predictive performance of a series of models trained to our dataset. In these models, the response variable is damage rates and the independent variables include: the year of the event, the city population before the event, municipal area, maximum tsunami height, coastal forest area, presence of a baymouth breakwater, maximum and minimum seawall height, flooded area, prefecture, and topography. In order to deal with the missing data in our input variables, we used the Multivariate Imputation by Chained Equations (MICE) algorithm.
As mentioned in the Methods section, we trained the data with a range of parametric and nonparametric models including: generalized linear model (GLM), generalized additive model (GAM), Bayesian additive regression trees (BART), random forest (RF), multivariate regression splines (MARS), support vector machines (SVM) and gradient boosted trees (GBM).
Model  MSE  SE  MAE  SE 

Null  813.1  54.9  23.6  0.7 
BART  387.9  24.3  15.4  0.5 
MARS1  526.3  51.2  16.8  0.6 
MARS2  512.6  55.1  16.5  0.6 
MARS3  649.8  136.9  17.0  0.7 
MARS5  690.1  202.7  16.6  0.7 
SVM  582.2  43.2  17.8  0.7 
GBM  755.9  51.4  22.7  0.7 
The method of Random Forest (RF) outperformed all other models in terms of outofsample predictive accuracy. The difference between the MSE and MAE values of RF and all other models were statistically significant (based on the Wilcoxon signedrank test).
To examine how well our selected final model (RF) fitted the data, we plotted observed destruction rates versus our model’s estimates along with the model’s residuals as shown in
The red dashed lines in the QQplot represent 95% confidence intervals.
Since our best model (RF) is nonparametric, to examine the relationship between each covariate and response we use partial dependency plots as discussed in the Methods section.
The red lines represent bootstrapped confidence intervals around model estimates.
The red lines represent bootstrapped confidence intervals around model estimates.
This section summarizes the predictive performance of a series of models fitted to our data. In this case the response is death rates and the explanatory variables include the year of the event, number of dwellings before the event, maximum tsunami height, municipal area, coastal forest area, presence of a baymouth breakwater, maximum and minimum seawall height, flooded area, prefecture and topography. In order to deal with the missing data in our input variables, we used the Multivariate Imputation by Chained Equations (MICE) algorithm.
The methods of Bayesian Additive Regression Trees (BART) and Random Forest (RF) outperformed all other models in terms of their predictive accuracy. Even though the errors look slightly less for BART, the difference between BART and RF is not statistically significant. It can be seen from
Model  MSE  SE  MAE  SE 

Null  204.0  28.3  8.8  0.4 
BART  
RF  
MARS1  149.8  17.8  7.4  0.3 
MARS2  156.4  19.8  7.4  0.3 
MARS3  139.2  23.8  6.8  0.4 
MARS5  152.6  20.3  6.8  0.4 
SVM  159.7  25.8  5.8  0.4 
GBM  189.9  27.2  8.4  0.4 
The red dashed lines in the QQplot represent 95% confidence intervals.
The red lines represent bootstrapped confidence intervals around model estimates.
The red lines represent bootstrapped confidence intervals around model estimates.
The significant outcome of this work is seen in the partial dependence of mortality and destruction rate on seawall height and coastal forest area (Figs
In order to concentrate on the effectiveness of manmade coastal defense structures in mitigating death and damage, the present work neglects the importance of other factors. [
(XLSX)
The authors thank Tom Logan, Nobuo Shuto, Anawat Suppasri, Carine Yi, and Midori (Katie) Saito for advice and assistance in data collection. Funding for this project was provided by the NSF SEES # 1215872, entitled “Sustainable Infrastructure Planning”, the NSF HazardSEES Type 2 grant #1331399, entitled “Modeling to Promote Regional Resilience to Repeated Heat Waves and Hurricanes”, and the Japan Society for the Promotion of Science, JSPSNSF Cooperative Program for Interdisciplinary Joint Research Projects in Hazards and Disasters, project entitled “Evolution of Urban Regions in Response to Recurring Disasters”.