% Template for PLoS
% Version 1.0 January 2009
%
% To compile to pdf, run:
% latex plos.template
% bibtex plos.template
% latex plos.template
% latex plos.template
% dvipdf plos.template
\documentclass[10pt]{article}
% amsmath package, useful for mathematical formulas
\usepackage{amsmath}
% amssymb package, useful for mathematical symbols
\usepackage{amssymb}
% graphicx package, useful for including eps and pdf graphics
% include graphics with the command \includegraphics
\usepackage{graphicx}
%\usepackage{float}
% characters
\usepackage[latin1]{inputenc}
\usepackage[T1]{fontenc}
% cite package, to clean up citations in the main text. Do not remove.
\usepackage{cite}
\usepackage{color}
% Use doublespacing - comment out for single spacing
\usepackage{setspace}
\doublespacing
% Tables
\usepackage{multirow}
\usepackage{slashbox}
\usepackage{array}
\usepackage{colortbl}
% Appendix
\usepackage{appendix}
% Line numbering
\usepackage[left]{lineno}
% Text layout
\topmargin 0.0cm
\oddsidemargin 0.5cm
\evensidemargin 0.5cm
\textwidth 16cm
\textheight 21cm
% Bold the 'Figure #' in the caption and separate it with a period
% Captions will be left justified
\usepackage[labelfont=bf,labelsep=period,justification=raggedright]{caption}
% Use the PLoS provided bibtex style
\bibliographystyle{plos2009}
% Remove brackets from numbering in List of References
\makeatletter
\renewcommand{\@biblabel}[1]{\quad#1.}
\makeatother
% Leave date blank
\date{}
\pagestyle{myheadings}
%% ** EDIT HERE **
\renewcommand{\tablename}{Table A}
\renewcommand{\refname}{References S1}
%% ** EDIT HERE **
%% PLEASE INCLUDE ALL MACROS BELOW
\usepackage{lscape}
%% END MACROS SECTION
\begin{document}
\newcounter{S}[section]
\section*{S1: Dealing with conditional trees instability}
Among methods designed to build classification trees, conditional trees \cite{HothornT.2006} allow overcoming problems of possible overfitting, selection bias, or input parameters scaling. Variable selection and splitting are performed in two successive steps by means of hypothesis testing. For variable selection, the null hypothesis $H_{0}^{j}:P(Y|X_{j})=P(Y)$ is tested by means of permutation tests for each parameter $X_{j}$. This amounts to making the null hypothesis that knowing $X_{j}$ does not increase our knowledge of Y. If this hypothesis can be rejected for at least one $X_{j}$, parameter $X_{j}$ with strongest association to $Y$ is selected for splitting. Then various split values are tested for parameter $X_{j}$ that split its domain of variation into two subsets $A$ and $\bar{A}$. The split value selected is that maximizing a test statistic measuring the discrepancy between ${Y|X_{j} \in A}$ and ${Y|X_{j} \in \bar{A}}$. This method ensures that the tree built is optimal relative to the criteria chosen and so no step of tree pruning is needed once the tree has been built.
One of the greatest issues when using classification trees is their instability. Indeed, tree structure as well as split values can change greatly when slightly modifying the dataset used for classification. Methods such as bagging \cite{Breiman1994,Breiman1996}, random forests \cite{Ho1995} or gradient boosting \cite{Friedman1999,Friedman1999a} allow dealing with tree instability and increasing prediction accuracy. These methods build an average tree from a high number of simulated trees, but characteristics of each individual tree are lost.
To estimate tree stability but keep tree characteristics we first build 500 replicates of the tree, using samples containing 95\% of the total number of simulations contained in our training set (i.e. 9500 simulations). From these 500 trees we assess similarity between trees by computing the mean distance between all 500 trees built, using the method from \cite{BriandB.2009}.
Distance $D(A_{1},A_{2})$ between two trees is:
\[
\left\lbrace
\begin{array}{lr}
D(A_{1},A_{2})=1 & \text{if } A_{1} \text{ and } A_{2} \text{ have different structures}\\
D(A_{1},A_{2})=\sum_{t=0}^{T} d(A_{1},A_{2})_{t} \cdot \dfrac{1}{T} & \text{if } A_{1} \text{ and } A_{2} \text{ have similar structures}\\
\end{array}
\right.
\]
where t is a given node of the tree, T the total number of inner nodes and $d(A_{1},A_{2})_{t}$ the distance between tree $A_{1}$ and tree $A_{2}$ for inner node t. Two trees $A_{1}$ and $A_{2}$ have similar structures if they have the same set of nodes at similar locations. For trees with similar structure, distance between nodes is computed as:
\[
\left\lbrace
\begin{array}{lr}
d(A_{1},A_{2})_{t}=1 & \text{if } A_{1} \text{ and } A_{2} \text{ do not split on the same parameter at node t}\\
d(A_{1},A_{2})_{t}= \dfrac{\vert\delta_{1}-\delta_{2}\vert}{range(X_{k})} & \text{if } A_{1} \text{ and } A_{2} \text{ split on the same parameter at node t}\\
\end{array}
\right.
\]
where $\delta_{1}$ is the split value for parameter $X_{k}$ at node t for tree $A_{1}$ and $\delta_{2}$ is the split value for parameter $X_{k}$ at node t for tree $A_{2}$. The range of values that parameter $X_{k}$ can take is denoted range($X_{k}$).
If mean distance between trees is small then tree stability can be seen as good.
This measure of distance between trees gives information about tree stability but does not allow determining whether a particular tree structure can be selected to build subtrees. So as to determine whether the tree built from the entire training dataset is stable, we identify among the 500 trees built how many are of similar type with this particular tree. Here, we mean by tree type, trees with same structure and same split parameter for each node. If the tree built from the whole dataset corresponds to the most common tree type identified from the set of 500 trees, then we deem it is stable enough to build subtrees.
One of the assets of sorting trees by tree type is that, for each inner node of a particular tree type, the mean and standard deviation of the splitting value can be computed. We can therefore obtain a mean tree for each tree type and asses its variability, so as to estimate which splitting values would ensure a high robustness to uncertainties.
%\section*{References}
% The bibtex filename
\bibliography{BiblioThese}
\end{document}