The authors of this manuscript have read the journal’s policy and have the following competing interests: AB, FK, and KRM have a pending patent application:
Conceived and designed the experiments: SB AB GM WS FK KRM. Performed the experiments: SB AB GM. Wrote the paper: AB SB WS GM KRM FK. Conceived the theoretical framework: AB GM SB WS. Performed the experiments: SB AB GM. Revised the manuscript: KRM WS GM AB SB. Figure design: WS SB GM AB FK. Wrote code for MNIST: SB GM. Wrote code for ILSVRC: AB. Wrote code for BOW: SB AB. Image search and generation: AB SB GM.
Understanding and interpreting classification decisions of automated image classification systems is of high value in many applications, as it allows to verify the reasoning of the system and provides additional information to the human expert. Although machine learning methods are solving very successfully a plethora of tasks, they have in most cases the disadvantage of acting as a black box, not providing any information about what made them arrive at a particular decision. This work proposes a general solution to the problem of understanding classification decisions by pixelwise decomposition of nonlinear classifiers. We introduce a methodology that allows to visualize the contributions of single pixels to predictions for kernelbased classifiers over Bag of Words features and for multilayered neural networks. These pixel contributions can be visualized as heatmaps and are provided to a human expert who can intuitively not only verify the validity of the classification decision, but also focus further analysis on regions of potential interest. We evaluate our method for classifiers trained on PASCAL VOC 2009 images, synthetic image data containing geometric shapes, the MNIST handwritten digits data set and for the pretrained ImageNet model available as part of the Caffe open source package.
Classification of images has become a key ingredient in many computer vision applications, e.g. image search [
This lack of interpretability is due to the nonlinearity of the various mappings that process the raw image pixels to its feature representation and from that to the final classifier function. This is a considerable drawback in classification applications, as it hinders the human experts to carefully verify the classification decision. A simple
In this work, we aim to close the gap between classification and interpretability both for multilayered neural networks and Bag of Words (BoW) models over nonlinear kernels which are two classes of predictors which enjoy popularity in computer vision. We will consider both types of predictors in a generic sense trying to avoid whenever possible a priori restrictions to specific algorithms or mappings. For the first part, the Bag of Words models will be treated as an aggregation of nonlinear mappings over local features in an image which includes a number of popular mapping methods such as Fisher vectors [
The next Section
The overall idea of pixelwise decomposition is to understand the contribution of a single pixel of an image
In the classification step the image is converted to a feature vector representation and a classifier is applied to assign the image to a given category, e.g., “cat” or “no cat”. Note that the computation of the feature vector usually involves the usage of several intermediate representations. Our method decomposes the classification output
In this paper we propose a novel concept we denote as
We will introduce layerwise relevance propagation as a concept defined by a set of constraints. Any solution satisfying the constraints will be considered to follow the concept of layerwise relevance propagation. In later sections we will then derive solutions for two particular classifier architectures and evaluate these solutions experimentally for their meaningfulness. Layerwise relevance propagation in its general form assumes that the classifier can be decomposed into several layers of computation. Such layers can be parts of the feature extraction from the image or parts of a classification algorithm run on the computed features. As shown later, this is possible for Bag of Words features with nonlinear SVMs as well as for neural networks.
The first layer are the inputs, the pixels of the image, the last layer is the realvalued prediction output of the classifier
Iterating
We give here a simple counterexample. Suppose we have one layer. The inputs are
Let us discuss a more meaningful way of defining layerwise relevance propagation. For this example we define
The above example gives furthermore an intuition about what relevance
We give a second, more graphic and nonlinear, example. The left panel of
The top layer consists of one output neuron, indexed by 7. For each neuron
In general, this condition can be expressed as:
Now we can derive an explicit formula for layerwise relevance propagation for our example by defining the messages
While this solution
The
One further property is visible here as well. The formula for distribution of relevance is applicable to nonlinear and even nondifferentiable or noncontinuous neuron activations
The relevance conservation property can in principle be supplemented by other constraints that further reduce the set of admissible solutions. For example, one could constrain relevance messages
To summarize, we have introduced layerwise relevance propagation in a feedforward network. In our proposed definition, the total relevance is constrained to be preserved from one layer to another, and the total node relevance must be the equal to the sum of all relevance messages incoming to this node and also equal to the sum of all relevance messages that are outgoing to the same node. It is important to note that the definition is
One alternative approach for achieving a decomposition as in
Several works have been using sensitivity maps [
The blue dots are labeled negatively, the green dots are labeled positively. Left: Local gradient of the classification function at the prediction point. Right: Taylor approximation relative to a root point on the decision boundary. This figure depicts the intuition that a gradient at a prediction point
One technical difficulty is to find a root point
Note that Taylortype decomposition, when applied to one layer or a subset of layers, can be seen as an approximate way of relevance propagation when the function is highly nonlinear. This holds in particular when it is applied to the output function
Several works have been dedicated to the topic of explaining neural networks, kernelbased classifiers in general and classifiers over Bag of Words features in particular.
As for neural networks, [
Another approach which lies between partial derivatives at the input point
With respect to sensitivity of kernelbased classifiers to input dimensions, [
We differ from the above works on kernelbased classifiers over Bag of Words features in the following sense: Our methodology is applicable to
Despite recent advances in neural networks, Bag of Words models are still popular for image classification tasks. They have excelled in past competitions on visual concept recognition and ranking such as Pascal VOC [
We will consider here Bag of Words features as an aggregation of nonlinear mappings of local features. All Bag of Words models, no matter whether based on hierarchical clustering [
In the first stage local features are computed across small regions in the image. A local feature such as SIFT [
The computation of statistics can be modeled by a mapping function accepting local feature vectors
Finally, a classifier is applied on top of these features. Our method supports the general class of classifiers based on kernel methods. For brevity we use here an SVM prediction function which results in a prediction function over BoW features
the classifier’s prediction function  
BoW representation and class label  
root point of Taylor Expansion, root point candidates  
learned model parameters  
kernel function  
counter and number of BoWdimensions  

(approximate) contribution of BoWdimension 

local feature relevance 

pixelwise decompositions per pixel 
mapping function between local features 

local feature descriptors  
the set of unmapped dimensions of a BoW data point 

area( 
the set of pixel coordinates covered by 
The main contribution of this part is the formulation of a generic framework for retracing the origins of a decision made by the learned kernelbased classifier function for a BoW feature. This is achieved, in a broad sense as visualized in
Each step taken towards the final pixelwise decomposition has a complementing analogue within the Bag of Words classification pipeline. The calculations used during the pixelwise decomposition process make use of information extracted by those corresponding analogues. Airplane image in the graphic by Pixabay user tpsdave.
In the first step we will use, depending on the type of kernel, either the Taylortype decomposition strategy or the layerwise relevance propagation strategy. In the first step relevance scores
In the second step we will apply the layerwise relevance propagation strategy in order to obtain relevance scores
The third step describes the computation of pixelwise scores
The third layer is the BoW feature itself. In the first step we would like to achieve a decomposition of the classifier prediction
The work of [
For the case of a general differentiable kernel we apply the Taylortype decomposition strategy in order to linearly approximate the dimensional contributions
The second layer are the local features extracted from many regions of the image. In the second step we would like to achieve a decomposition of the classifier prediction
For the sake of clarity, we do for now start with the case of sumpooled BoW aggregation, to later extend to a more general formulation for pmeans pooling from this point on.
As introduced in context of
The coarse structure of definition
Summing the local feature relevance scores
We can extend this definition to reflect the usage of pmeans pooling
The first quotient in
The first layer are the pixels of the image. In order to calculate scores for each pixel we make use of information regarding local feature geometry and location known from the local feature extraction phase at the beginning of the image classification pipeline. The pixel score
For visualization in the sense of color coding, the pixelwise decomposition
Note that by choosing above normalization scheme, the assumption is made that at least one class is represented within the image. In case the assumption holds this might lead to prominent local predictions of even weakly projected features which we found suitable for the purpose of detecting class evidence. If this assumption does not hold, then images may display score artifacts dominated by the set of pixels covered by a small subset of local features which would otherwise be considered input noise. A solution for this problem is global normalization which uses a maximum over pixels over a set of images instead of one image. We found that a global normalization scheme can be more appropriate for visualizing the actual decision process of the classifier, as it preserves the relative order of magnitude of local feature scores in between pixelwise decomposition tiles. Algorithm 1 gives an overview how to compute the pixelwise decomposition for classifiers based on Bag of Words features and support vector machines.
Image
Local features
BoW representation
model and mapping parameters
In order to illustrate the generality of this framework we give some examples for various methods of mapping local features and kernels.
A soft codebook mapping like in [
Considering formulations as in [
The Fisher vector [
Let
Let
We assume one kernel without loss of generality,
The histogram intersection kernel applies to the exact decomposition formula in
The
The GaussianRBF kernel function is widely used in different communities and is one of the most prominent, if not the most prominent nonlinear kernel function. The kernel function is defined as
Multilayer networks are commonly built as a set of interconnected neurons organized in a layerwise manner. They define a mathematical function when combined to each other, that maps the first layer neurons (input) to the last layer neurons (output). We denote each neuron by
Denoting by
A requirement of the Taylorbased decomposition is to find roots
Alternatively, root points can be found by line search on the segment defined by
As an alternative to Taylortype decomposition, it is possible to compute relevances at each layer in a backward pass, that is, express relevances
Left: forward pass. Right: backward pass.
The method works as follows: Knowing the relevance of a certain neuron
A drawback of the propagation rule of
An alternative stabilizing method that does not leak relevance consists of treating negative and positive preactivations separately. Let
Once a rule for relevance propagation has been selected, the overall relevance of each neuron in the lower layer is determined by summing up the relevance coming from all upperlayer neurons in consistence with Eqs (
Above formulas (
We remark again, that even max pooling fits into this structure as a limit of generalized means, see
Finally, it can be seen from the formulas established in this section that layerwise relevance propagation is different from a Taylor series or partial derivatives. Unlike Taylor series, it does not require a second point other than the input image. Layerwise application of the Taylor series can be interpreted as a generic way to achieve an approximate version of layerwise relevance propagation. Similarly, in contrast to any methods relying on derivatives, differentiability or smoothness properties of neuron activations are not a necessary requirement for being able to define formulas which satisfy layerwise relevance propagation. In that sense it is a more general principle.
For Bag of Words features we show two experiments, one on an artificial but easily interpretable data set and one for on Pascal VOC images which have a high compositional complexity. For the artificial data set we apply the Taylortype strategy for the top layer, for Pascal VOC images we apply the strategy for sumdecomposable kernels for the top layer. In both cases these strategies are combined with our definitions for arbitrary mappings for the lower layers.
For neural networks we show results also on two data sets, two sets of results on MNIST which are easy to interpret, and a second set of experiments in which we rely on a 15 layers already trained network provided as part of the Caffe open source package [
An example of a pixelwise decomposition for synthetic data is given in
Left: The original image. Middle: Pixelwise prediction. Right: Superposition of the original image and the pixelwise prediction. The decompositions were computed on tiles of size 102 × 102 and having a regular offset of 34 pixels. The decompositions from the overlapping tiles were averaged. In the heatmap, based on linearly mapping the interval [−1, +1] to the jet color map available in many visualization packages, green corresponds to scores close to zero, yellow and red to positive scores and blue color to negative scores. See text for interpretation.
The averaged classifier decomposition in
This observation implies that local features over polygons are provided on average with positive second layer scores
Finally we remark that the rankmapping is a discontinuous weighting scheme for BoW feature dimensions, yet the layerwise propagation yields reasonable explanations.
We have calculated pixelwise predictions for images from the evaluation set of the Pascal VOC 2009 image classification challenge. The BoW representations of the training and test part of the data set have been computed over whole images based on local features extracted from a dense regular grid and fixed rotation. Standard SIFT features and stacks of 9dimensional quantile features measuring values from 0.1 to 0.9 of the data over color intensities as in [
Each triplet of images shows—from left to right—the original image, the pixelwise predictions superimposed with prominent edges from the input image and the original image superimposed with binarized pixelwise predictions. The decompositions were computed on the whole image. Images twice by Pixabay users tpsdave, and by Pixabay users sirocumo and Pixeleye.
Each triplet of images shows—from left to right—the original image, the pixelwise predictions superimposed with prominent edges from the input image and the original image superimposed with binarized pixelwise predictions. The decompositions were computed on the whole image. Faces below the hairline but also hands yield high scores, see the woman in the third picture which turns away her face from the camera as an example that hair alone is not relevant. Images by Pelagio Palagi, Wikimedia users Rorschach, Frankie Fouganthin and Flickr user Le vent dans les dunes.
Each triplet of images shows—from left to right—the original image, the pixelwise predictions superimposed with prominent edges from the input image and the original image superimposed with binarized pixelwise predictions. The decompositions were computed on the whole image. Notably the tail of a plane receives negative scores consistently. Blue sky context seems to contribute to classification which has been conjectured already in the PASCAL VOC workshops [
Each triplet of images shows—from left to right—the original image, the pixelwise predictions superimposed with prominent edges from the input image and the original image superimposed with binarized pixelwise predictions. The decompositions were computed on the whole image. Positive responses seem to exist for certain fur texture patterns, see also the false responses on the wood and the plaster in the second example which both have similar texture and color to a cat’s fur. Images by Pixabay users LoggaWiggler and Holcan.
We would like to investigate the capacity of relevance propagation to find evidence for classification of MNIST handwritten digits. A particular advantage of this data set over manyclass image data sets is that it is easy for humans to interpret both positive and negative evidence, because of the small number of classes. For example, evidence for the handwritten digit “1” comes from the presence of a vertical bar on the pixel grid, but also from the absence of horizontal bar starting from the top of the vertical bar, which would make it a “7”. Also, the data set is relatively simple and the relation between the training algorithm and the resulting pixelwise relevances can be analyzed.
We perform three experiments for MNIST data, as we would like to demonstrate that the method is able to uncover properties specific to the way of training. One experiment is done with a smaller network which is trained without translations of digits for the sake of allowing a direct comparison of the pixelwise decomposition results to classdensities for each pixel of digits and seeing the impact of artifacts in the training set. Two further experiments are done on a larger network with has been trained without artifacts and with translated versions of digits and more training iterations for the sake of a better response to digits. The latter two experiments intend to show the impact of nondigit pixels as positive and negative evidence for a class of digits.
The first set of experiments is done on a fullyconnected neural network trained in the most common way: Input data is normalized so that the sum of pixels is on average zero, and the variance of pixel values is on average one. This setting implies that only black pixels yield strong inputs whereas white pixels fire only due to mean subtraction. The absence of translation invariance during training allows to uncover correlations of the pixelwise decomposition to pixelwise training densities and, as we will see in the experiments, allow to uncover artifacts in the data that may harm generalization.
Examples for pixelwise decompositions for the first type of neural networks are given in Figs
Each group of four horizontally aligned panels shows—from left to right—the input digit, the Taylor root point
Each group shows the decomposition of the prediction for the classifier of a specific digit indicated in parentheses.
When considering a digit from class
Pixelwise predictions obtained via the layerwise relevance propagation
Correlating the pixelwise decompositions for classifier for digit
In this set of experiments, we train larger neural networks on the MNIST data set augmented translated copies of the digits. Neural networks are composed of three hidden layers of 1296 units each, where weight connections between layers are initialized at random. Neural networks are trained by backpropagation using stochastic gradient descent with mini batches of size 25 and using a softmax objective [
We consider two types of nonlinearities: (1) rectified linear units and (2) hyperbolic tangent sigmoids. These nonlinearities are some of the most commonly used in neural networks and are plotted in
Figs
Strong positive evidence for “4” is allocated to the top part of the image for keeping it blank. If trying to interpret these digits as “9”, the open toppart of the image is perceived as negative evidence for this class, because a “9” would rather have a topdash closing the upper loop of the “4”. Explanations are consistent across a variety of neural networks and samples.
Classifying as “3” is supported by the middle horizontal stroke featured in this digit and the absence of vertical connections on the left of the image. Evidence for being a “8” feature again the middle horizontal stroke, however, the absence of connections on the left side of the digit constitutes negative evidence. Explanations are again stable for various models and samples.
In order to compute the pixelwise decompositions, we used
Results are obtained using the relevance propagation
In this section we intend to make a semiquantitative analysis of the pixelwise decompositions. The basic idea is to compute a decomposition of a digit for a digit class and then flip pixels with highly positive, highly negative scores or pixels with scores close to zero and then to evaluate the impact of these flips onto the prediction scores. The advantage of demonstrating this on MNIST data is that a vast majority of pixels either has very high (black) or very low (white) values, with very few pixels having values in between. For results on photographic scenes one may need to resort to building masks which are specific to each category, for example, a gray box may be good for masking a flower, but it will not mask effectively a gray road or fur of a Koala bear.
In general: If we
One apparent result from the preceding section is that we can observe nondigit pixels with highly positive pixel scores. In
Pixels with highest positive scores are flipped first. The pixelwise decomposition was computed for the true digit class, three (left) and four (right).
Secondly, it shows that measuring the quality of a pixelwise decomposition by an object segmentation mask is not always a good idea. When seeing the digit as an object, having nonobject pixels with high scores can make sense in the case of geometric constraints for objects which are to be recognized, as we have in our case. For this reason we deliberately did not choose object segmentation masks for the digits as a basis for evaluation in the sense that digitpixels should have highly positive scores, and nondigit pixels should have zero or negative scores.
For the same reason, namely the possibility of the presence of geometric constraints, pixelwise decomposition is not always a good weak segmentation in contrast to the convincing results in the experiments of [
Once we have established the reason why we do not use segmentation masks for evaluating the quality of pixelwise prediction we can observe the effects of flipping the highest scoring pixels, independently of whether they are a digit or nondigit pixel. We can see from
Pixels with highest positive scores are flipped first. The pixelwise decomposition was computed for the true digit classes three (left) and four (right).
Pixels with absolute value closest to zero are flipped first. Digit and nondigit pixels may be flipped. Pixelwise decomposition have been computed for the true digit classes three (left) and four (right).
Thus, the pixelwise decomposition is not only intuitively appealing to a human but also makes sense for the representation used in classifier to make its decision. Using digits for demonstrating such a statement has a mild bias towards our method because for a geometricdriven task like digits we can expect that firstly the problem can be learned well by a classifier so that the resulting pixelwise decompositions are very informative, and secondly what has been learned on digits might be more similar between humans and algorithms compared to complex natural scene recognition tasks. See, however, the experiments in the Section
Finally, we evaluate the influence of pixels with negative scores. For this, we take a digit, compute the pixelwise decomposition for a wrong class, then flip the pixels which are marked negatively when trying to predict the wrong class. Results are shown in
Pixels with lowest negative scores are flipped first.
The results in Figs
Left: Pixels with highest positive scores are flipped first. Right: Flipping of neutrally predicted pixels, i.e. pixels with absolute value closest to zero are flipped first (solid lines), and flipping of randomly picked pixels (dashed lines). Results are averaged over digits from all digit classes in contrast to using only digit classes 3 and 4 in the preceding figures.
On average over all digits, flipping the highest scoring pixels at first results in a fast decline of the prediction for the true class, and at some point another class is predicted. Flipping the pixels at first with scores close to zero results in a much slower decline of the prediction for the true class. This result demonstrates a quantifiable plausibility of the pixelwise decomposition by layerwise relevance prediction. In order to visualize the process of pixel flipping, Figs
Here pixels are flipped away from the class label given in parentheses above the heatmap. Pixels were flipped in steps of 1% of all pixels until the predicted class label changed. The plots show the output of the softmax function
Here pixels are flipped towards the class label given in parentheses above the heatmap. Pixels were flipped in steps of 1% of all pixels until the predicted class label changed. The plots show the output of the softmax function
We use here the pretrained neural network which is provided by the Caffe open source package [
Second column shows decompositions computed by
Left and Right: Failures to recognize toilet paper. The decompositions computed by
Comparing the different shown methods in Figs
Only a subset of strong edges and textures receive high scores. Panels show the original image on the left, and the decomposition on the right. The decompositions were computed twice for the classes table lamp and once for the class rooster. The neural net is the pretrained one on ILSVRC data from the Caffe package [
It is known from [
Nonlinear learning machines are ubiquitous in modern problem solving. While highly successful in e.g. hard classification, regression or ranking problems, their nonlinearity so far has prevented these models to additionally explain and thus contribute to a better understanding about the nature of the solved problem. Making, say, a nonlinear classification decision for one particular novel data point transparent to the user is essentially orthogonal to the standard task of optimizing for an excellent and well generalizing classifier. We have introduced a tool set for deconstructing a nonlinear decision and thus fostering transparency for the user.
In particular we have introduced the general concept of decomposition of a nonlinear image classification decision in terms of pixels. In other words, for a wellclassified image, a heatmap can be produced that highlights pixels that are responsible for the predicted class membership. Note that this is possible without need of segmented training images. We consider heatmapping as an important part of the interpretation of nonlinear learning machines and its applicability goes far beyond what has been exemplarily presented in this work: it ranges from the interpretation of biomedical images to the practical validation of a trained models for image classification.
Practically, we have proposed two different approaches to pixelwise decomposition: The first one, Taylortype decomposition, seeks to linearly approximate the class scoring function locally by performing a Taylor decomposition of it near a neutral data point without class membership, where the contribution of each dimension (i.e. pixel) can easily be identified. The second one, coined layerwise relevance propagation, applies a propagation rule that distributes class relevance found at a given layer onto the previous layer. The layerwise propagation rule was applied iteratively from the output back to the input, thus, forming another possible pixelwise decomposition. This inherits the favorable scaling properties of backpropagation.
Notably, these two methods were not defined as a particular solution to the heatmapping problem, but instead as a set of constraints that the heatmapping procedure must fulfill in order to be admissible. For instance, the exact choice of Taylor reference point was not specified beyond the constraints that it should be a root and that it should be located close to the actual data point. Similarly, the layerwise relevance propagation has been defined with the sole restriction that the propagation rule conserves class relevance on a layer and node basis. Thus, the further specification of relevance propagation rule is deferred to the appreciation of the user, or as a future work, and may either be modelspecific, problemspecific, or subject to a particular practical or computational requirement.
Specific instances of the pixelwise decomposition procedure satisfying the constraints mentioned above have been proposed and analyzed. In particular, our work has covered a set of nonlinear learning algorithms for image classification, including kernel classifiers over Bag of Words pooled features, and feedforward multilayer neural networks. Both models are popular choices for image classification or analysis.
Our experiments show that applying Taylortype decomposition, layerwise relevance propagation or a combination of both on these nonlinear models produces highly informed heatmaps that reflect in many aspects the sophistication of the learned classifier. In particular, we have demonstrated that the same relevance propagation rule may, for different images, react to a variety of image features within the bounds modeling capacity. For example, in the case of the ImageNet convolutional network, we have shown that the heatmapping procedure finds classrelevant features that can be large areas of a particular color, localized features, image gradients, or more structured visual features such as edges, corners, contours, or object parts.
An important aspect of the proposed heatmapping procedure lies in the fact that it does not require to modify the learning algorithm, or to learn an additional model for heatmapping. Instead, it can be directly and transparently applied to any (pre)trained Bag of Words model or neural network when applicable. This desirable property is demonstrated in this paper by the heatmapping of images classified by the thirdparty GPUtrained ImageNet neural network. In particular, our heatmapping procedure was applied to this network without any further training or retraining. Thus, heatmaps for the ImageNet network could be quickly produced using a modest CPU.
While we have proposed in this paper several instances of pixelwise decomposition and demonstrated their excellent performance in practice, the set of possible relevance propagation methods, and their mathematical properties, will certainly have to be further explored. A first aspect that needs to be investigated is the greediness of the layerwise relevance propagation procedure, in the sense that it is computed one layer at a time, and its potential impact on the quality of heatmaps: While computationally advantageous, some of the backpropagated relevance might encounter a deadend in the lower layers and be distributed randomly. Another open question relates to the heuristic nature of the proposed instances of relevance propagation, in particular, whether the distributed relevance messages being proportional to the weighted neuron activations can be analytically justified.
Finally, it is not clear how to evaluate the quality of a heatmap beyond simple visual assessment. In this paper we have proposed as a starting point a pixelflipping method that allows to discriminate between two heatmapping methods that may otherwise look of similar quality to the human. Finding quantitative properties that are desirable for these heatmaps, or further constraints on the heatmapping procedure is therefore of nature to complement the human assessment. Future work will need to explore the many domain and dataspecific degrees of freedom in the heatmapping process in order to ultimately propose a universal metric for quantification. It is our firm belief, that heatmapping will be an important ingredient of future knowledge discovery and exploratory analysis and understanding of complex data in the sciences and industry, even beyond the presented field of image analysis.
This work was supported in part by the Federal Ministry of Economics and Technology of Germany (BMWi) under the project THESEUS, grant 01MQ07018, by the German Ministry for Education and Research as Berlin Big Data Center BBDC, funding mark 01IS14013A and by DFG. KRM thanks for partial funding by the National Research Foundation of Korea funded by the Ministry of Education, Science, and Technology in the BK21 program. Correspondence should be addressed to SB, KRM and WS.