Learning efficient haptic shape exploration with a rigid tactile sensor array

Haptic exploration is a key skill for both robots and humans to discriminate and handle unknown objects or to recognize familiar objects. Its active nature is evident in humans who from early on reliably acquire sophisticated sensory-motor capabilities for active exploratory touch and directed manual exploration that associates surfaces and object properties with their spatial locations. This is in stark contrast to robotics. In this field, the relative lack of good real-world interaction models—along with very restricted sensors and a scarcity of suitable training data to leverage machine learning methods—has so far rendered haptic exploration a largely underdeveloped skill. In robot vision however, deep learning approaches and an abundance of available training data have triggered huge advances. In the present work, we connect recent advances in recurrent models of visual attention with previous insights about the organisation of human haptic search behavior, exploratory procedures and haptic glances for a novel architecture that learns a generative model of haptic exploration in a simulated three-dimensional environment. This environment contains a set of rigid static objects representing a selection of one-dimensional local shape features embedded in a 3D space: an edge, a flat and a convex surface. The proposed algorithm simultaneously optimizes main perception-action loop components: feature extraction, integration of features over time, and the control strategy, while continuously acquiring data online. Inspired by the Recurrent Attention Model, we formalize the target task of haptic object identification in a reinforcement learning framework and reward the learner in the case of success only. We perform a multi-module neural network training, including a feature extractor and a recurrent neural network module aiding pose control for storing and combining sequential sensory data. The resulting haptic meta-controller for the rigid 16 × 16 tactile sensor array moving in a physics-driven simulation environment, called the Haptic Attention Model, performs a sequence of haptic glances, and outputs corresponding force measurements. The resulting method has been successfully tested with four different objects. It achieved results close to 100% while performing object contour exploration that has been optimized for its own sensor morphology.


Introduction
While the sense of touch is existential to human life, tactile capabilities of robots are currently hardly developed. This stark contrast becomes even more apparent if one compares touch and vision: while good camera sensors have become affordable and ubiquitous items and huge image and video databases together with deep learning have brought computer vision close (some would argue on par) to human vision [1,2,3], comparable advances in robot touch are widely lacking [4,5].
One reason is the very limited maturity of tactile sensors if compared with human skin. A second and deeper reason is that touch differs from vision in an important way: while looking at an object leaves its state unaffected, touch requires physical contact, coupling the sensor and the object in potentially complex and rich ways that usually also change the position, orientation or even the shape of the object. Human haptics makes active and sophisticated use of this richness to lend us skills such as haptic exploration, discrimination, manipulation and more. Large parts of these tasks are hard or impossible to model sufficiently accurately to replicate them on robots, thereby calling again for machine learning approaches similar to those that were highly successful in vision. However, the highly interactive nature of touch makes not only the learning problem itself much more difficult but also creates a problem for the availability of meaningful training data, since information about interactive haptics is much harder to capture in databases of static tactile patterns. As a consequence, learning approaches for the modality of interactive touch are still largely in their beginnings and tactile skills enabling robots to establish and control rich and safe contact with objects or even humans are still a largely unsolved challenge which severely limits the use of robots in both domestic and industrial applications.
In this work we focus on using machine learning for the synthesis of one central and important haptic skill: the discrimination of unknown object shapes through a sequence of actively controlled haptic contacts between a sensor and the object surface. Our approach builds on recent advances that show how a deep network can be made to learn to integrate a sequence of visual observations to discriminate visual patterns. We extend this approach from the visual to the haptic domain and -by taking inspiration from insights about the organization of haptical exploration in humanswe create a potentially interesting new bridge between a computational understanding of interactive touch in robotics and in human haptics.
In humans, haptic capabilities are available at birth, for example, those that are necessary for a neonate to nurse. Over the course of early development, increasingly sophisticated haptic exploration comes on-line, as children acquire motor control and the ability to focus attention. By pre-school age, children demonstrate adult-like patterns of exploration [6] that they gate according to contextual demands [7]. This developmental process results in a small set of optimized action patterns, widely known under the term exploratory procedures (EPs) [8]. Robots, like humans, benefit from haptic sensors in order to find, identify, and manipulate objects. In contrast to the innate and developmentally enriched haptic capabilities of humans, implementing robot touch is still a challenge in multiple industrial applications, e.g. construction or large storage facilities. In this work we propose an implementation of object identification through the process of haptic exploration. Humans use a small set of exploratory procedures (EPs) to extract properties such as texture, hardness, weight, or volume. Under some circumstances, the level of complexity in haptic exploration can be effectively reduced to what was termed the haptic glance by [9]. Specifically, Klatzky and Ledermann define the haptic glance as brief, spatially constrained contact that involves little or no movement of the fingers. In the same work they pose the question how the information from a haptic glance is translated into effective manipulation.
Following this work, we are interested in a connection/transition between a haptic glance and an exploratory procedure. We propose that a haptic glance constitutes an atomic, primitive exploratory entity, and that an EP can be viewed as a sequence of such primitives. On a long-term scale, we are targeting the question: how to model an optimal control of haptic glances for an optimal task-specific haptic exploration of an unknown object or a scene? Will the resulting sequence of haptic glances emerge as a full EP? In order to answer this question affirmatively, such a control model should ideally contain a strategy to efficiently extract task-specific cues based on previously available information (if any), and integrate these over time. For computational purposes we make the following assumptions. Firstly, we assume that a haptic glance -being the simplest haptically directed action -is a foundation for any more complex haptic behavior, including haptic exploratory procedures of any type. Therefore, it is our goal to learn an optimal sequence of haptic glances, adapted to a given task and a sensor morphology that is provided beforehand and is specific for a given robot platform. Secondly, we assume that a haptic glance is defined by a tuple consisting of a pressure profile yielded by the tactile sensor at contact and the corresponding sensor pose.
Tactile sensing in robotics is built out of two different categories [10]. The first one is called "perception for action", which utilizes the tactile information to solve dexterous manipulation tasks including grasping, slip prevention. The second category, which has recently become a popular area of research, is named "action for perception". It deals with object recognition and exploration [11,12]. Using learned exploration strategies in the form of tactile skills in order to facilitate exploration was studied for surface classification [12]. The approach employed in this work provides one possible solution to a typically puzzling question: how to couple optimization of both above-mentioned directions, "perception for action", and "action for perception". In computer vision, the analogous question has already been investigated by measures of recurrent models of visual attention (RAM) [13,14]. RAM acquires image glimpses by controlling the movement of a simulated eye within the image. The modeling approach is inspired by the fact that humans are not perceiving their environment as a whole image. Instead, they only see parts of the scene, while the location of the fixations depends on the current task [15,16]. The model gathers information about the environment directed by image-based and task-dependent saliency cues [17,18]. Information extracted from this foveal "glimpses" is then combined in order to get an accumulated understanding of the visible scene. RAM applied to control of the sequences of haptic glances optimizes both above-mentioned directions simultaneously in a series of iterative steps, and enables us to find an optimal solution for a given tactile end-effector, with respect to its own constraints and the spatio-temporal resolution of the acquired data.
We present a framework that is able to identify four different kinds of objects using a tactile sensor array within a simulated environment. The object classification and pose control are formalized as sequential decision-making process within a reinforcement learning framework, where an artificial agent is able to perform multiple haptic glances before the final estimation of the objects class. During the training of a multi-component deep neural network, we learn how to control the pose of the rigid tactile sensor in a way that is beneficial for the classification task. To enable integration of information gained through multiple haptic glances, we employ a recurrent neural network as one building block of this architecture. In the next section, the simulation setup and the used algorithm is described, together with the training procedure. After presenting the experiments, the results are presented and then discussed.

Scenario and experimental setup
To develop an efficient haptic controller that in future enables a robot to identify objects with a sequence of haptic glances, we perform a comprehensive experimental investigation in Gazebo (see S1 Code), a physics driven simulation environment. The simulation consists of two main parts: a floating standalone tactile sensor array modeled according to the Myrmex sensor [19] and a static set of 3D objects that are distributed in the simulation environment (see Fig 1). In simulation, one side of the sensor contains a 16 × 16 sensor array that approximately outputs the values of the real sensor array when forces are applied and a collision is detected (see S2 Code). Communication with the simulated sensor in Gazebo is performed via a ROS-interface (see S3 Code).

Stimulus Material
The stimulus material exists both in simulation and in the form of real building blocks. A combination of such building blocks forms the so-called Modular Haptic Stimulus Board (MHSB) which has been previously employed in a range of studies of haptic exploration and search in humans (e.g. [20,21,22]). Through its modularity, MHSB enables a flexible experimental design resulting in a wide range of 3D shape landscapes.

Haptic control for simulated Myrmex
Without loss of generality, control in our experiments is simplified as described below. An execution of the low-level controller that performs one haptic glance moves Myrmex from a predefined (x, y, z)-position down along the z-axis. To this end, it gradually decreases the z value, while keeping both the orientation and the (x, y) position constant until a collision with an object takes place (see S1 Video). Upon collision with the object, the sensor stops and outputs its readings. The main feature of this controller implemented with the "hand of god" plugin (see S4 Code) is that it constantly sustains the orientation and the (x, y) position of the sensor, up to the time of collision. This is done by switching off the gravity and continuously holding the sensor pose at a predefined value, against the impact of any impulses. By this means, we have full control of both the pose parameters and the resulting tactile measurement.
The process of haptic exploration is conducted by the designed meta controller, represented by a neural network. Its main task is to classify the given object, while constantly providing a new expedient target pose ζ = (x g , y g , z g , e 1 , e 2 , e 3 ) to the low-level controller for further exploration. For proof of concept, we used only two parameters within the meta controller: the position along the x-axis and the angle around the y-axis. Before the execution of the haptic glance, the sensor is positioned at a specific pose where x g and the Euler angle e 2 are specified by the output of the network l = (x g , e 2 ) . For the sake of readability, the alterable position x g is called x and the angle e 2 is called ϕ in the following sections.

Classification Task
To simplify the exploration process for the agent, a predefined amount of space is assigned to each object, illustrated with dashed lines in Fig 2. During training and classification, the agent is always presented with one out of four objects. It explores the restricted object space with the sensor by performing a predefined number of haptic glances. To preclude learning the absolute position of the object, the object coordinates within the simulation space are mapped to the pre-defined location space of the sensor x ∈ [−1, 1]. Due to the location of the pressure-sensitive surface on only one side of the Myrmex, rotations are performed within the range ϕ ∈ [−π/2, +π/2], as shown in Fig 3. Further rotation will not yield contact information between the object and the sensor surface. The acquired pressure information is employed not only to classify the given object but also to determine the next position and orientation of the sensor in the next exploration step.

Methods
Reinforcement learning is a well-known class of machine learning algorithms for solving sequential decision making problems through maximization of a cumulative scalar reward signal [23]. To formalize our task as a reinforcement learning problem, the artificial agent receives a reward of r = 1 for a correctly classified object and a reward of r = 0 otherwise. We then use the standard formulation of a Markov decision process defined by the tuple (S, A, P A , R, γ, S 0 ), where S denotes the set of states and A the set of admissible actions. P A is the set of transition matrices, one for each action a ∈ A with matrix elements P a s, s specifying the probability to end up in state s after taking action a from state s. Finally, R : S × A → R is a scalar valued reward function, γ the discount factor and S 0 ⊆ S is the set of starting states. The goal is to find an optimal policy π : S → A that maximizes the discounted future reward (1) The discount factor γ ∈ [0, 1) balances the weighting between present rewards and rewards that lie increasingly in the future. The policy π(a, s t ) is defined as the probability of choosing an action a while in state s t at time-step t.
A neural network with a set of weights θ can be employed to solve a reinforcement learning task, i.e. its output should maximize a given reward function R t . In this case we can perform policy gradient-based optimization with the REINFORCE update rule [24]. The general update rule for updating the corresponding weights θ of the network is thus given by where α defines the learning rate factor, b the reinforcement baseline. ζ is called the characteristic eligibility. It is defined as where f ( s t ; θ) determines the output of the network as a function of its input s t and its weight parameters θ. It is also possible to develop learning rules for an output that is determined via stochastic distributions which depends on multiple input parameters, like an adaptable Gaussian with variable mean µ and standard deviation σ. To this end, a neural network is trained to map the input to a parameterization of the Gaussian distribution, i.e. µ and σ. Instead of their corresponding weights θ µ and θ σ , µ and σ themselves can be treated as the adaptable parameters of the Gaussian N (x; µ, σ). Using this simplification, the characteristic eligibility for µ is given by where x is the corresponding value, sampled from the Gaussian distribution N . Analogously, the characteristic eligibility for σ is The details of the application of these equations to our work is described in the section below.

The Network architecture
An overview of the interaction loop between the network and the simulation is displayed in Fig 4. Inspired by the architectures in [13,14], the meta-controller network is constructed from three modules which are described in detail in the following subsections. A vector s = ( p, x, ϕ) consisting of the sensor pose (x, ϕ) and the corresponding pressure profile acquired by Myrmex performing a haptic glance in Gazebo is used as the sensory input for the network. The 16 × 16 pressure matrix is flattened to a normalized pressure vector p with dim( p) = 256. First, the input is processed through the tactile network, which combines the recorded pressure profile p with its corresponding location x and orientation ϕ into one single feature vector. The features s are then propagated through the LSTM-module [25]. It consists of one single LSTM unit with a hidden state of 256 neurons. If not stated otherwise, all layers are connected using the rectified linear unit (ReLu) as the activation function [26].  Gazebo Simulation Figure 4: Illustration of the overall network architecture The figure illustrates the overall design of the multi-module meta-controller model and its interaction with the Gazebo simulation environment.

The Tactile Network
The tactile network is displayed in detail in Fig 5. It combines the tactile response of the sensor p with the corresponding location x and angle ϕ. An important choice is the approach used to combine what (i.e. the pressure p) with where (i.e. position x and orientation ϕ). While [13] use an element-wise addition of the two features, [14,27] suggest using element-wise multiplication. In this work, based on the performed tests, we concatenate the two resulting types of features followed by two additional linear layers. In this way, we do not impose a specific inner structure on the combination process, but let the network resolve this issue on its own.

The Location Network
The location network is designed to output the pose of the next haptic glance. The feature vector that is used as the input to this module is the output that is generated by the LSTM unit. It thus implicitly integrates shape information yielded by the previously performed glances. A stochastic location policy is modeled using two Gaussian distributions for position and orientation, respectively with variable mean µ and standard deviation σ as shown in Fig 6. The features of the LSTM are propagated through a linear layer that outputs the mean µ(θ) ∈ [−1, 1] and the standard deviation σ(θ) of the Gaussian 2 . The extent of exploration of the location policy is given by the size of the Gaussian's standard deviation σ. While for large σ, the raw location of the glance, given by µ, is imprecise, the location has more precision for smaller σ.
The two above-mentioned pipelines are used for computing a distinct µ and σ for the position and for orientation. The used activation function for the output layers are chosen to limit the resulting values to a reasonable range. While the tanh is used as the activation function to generate the mean within the desired range, the softplus function is implemented as the activation function for the standard deviation. µ(θ) and σ(θ) are then used to compute the new location and orientation by sampling from the respective 1-dimensional Gaussians for each of the desired variables.  To ensure that the location and position of the sensor remains within the predefined space around the to-be-classified object and also that the orientation remains within its boundaries, the sampled values of the Gaussians N (q; µ, σ) are again restricted to the range q ∈ [−1, 1]. Thus, if q is sampled outside this range, it is resampled. The new pose vector is then given as l = q x , q ϕ · π 2 = (x, ϕ) .

The classification network
In order to classify a given object, the generated feature vector of the LSTM is not only transferred to the location network, but also propagated through a different linear layer that is then used for classification. To achieve this, the softmax-function is utilized to encode the predicted class-affiliation of the current object in a probability density π(o | τ 1:s ; θ t ), representing the current policy of the reinforcement learning agent. Here, τ 1:S (θ t ) encodes the accumu-lated LSTM feature vector after S glances, using the current set of weights θ t at training step t. For classification, the class o with the highest probability is taken as the prediction.

Training
The target loss function L, used for training, is composed of two different components: classification and location. The update rule for both parts is derived from the REINFORCE algorithm [24]. For the classification component of the loss, we see the designed model as a reinforcement learner which has to choose the right action in order to classify the given object. For classifying the object correctly it receives a reward r = 1, and r = 0 otherwise. The predicted probability of correctly identifying the target object o after S glances is then given as π(o| τ 1:S ; θ). To this end, the categorical cross-entropy can be used to compute the loss.
For learning the means µ x and µ ϕ of the location component of the policy, the characteristic eligibility as outlined in Eq (3) is used. σ x and σ ϕ are learned by applying Eq (4). The hybrid update rule is then given by The function π(o) gives the computed classification probability that the to-be-classified object is object o, while y o is 1 if o corresponds to the correct object and 0 otherwise.
The parameter β controls the contribution of the different parts of the update. While for β = 1 both part of the update contribute equally to the weight update, a smaller factor of β < 1 assigns more resources to the classification part, for β = 0 the location part is completely omitted [27].
The baseline layer is updated separately, using the mean-squared error. Instead training the baseline only on the accumulated tactile information of the last glance τ 1:S , the training can be improved by also using all included sub-sequences τ 1:s with s ≤ S [27]. This leads to the loss function The overall network model is trained using stochastic gradient descent with Nesterov momentum [28,29]. The chosen learning rate of α 0 decays towards α min every training-step t with a decay factor of δ α and a step-size of T according to Due to the design of the network that generates a location for the next haptic glance, no fixed training set can be used to train the classifier. The current batch only specifies the to-be-classified objects, while the first pressure-location pair is chosen by the first random glance for each object. The location for any further glance is chosen by the current state of the location policy of the network.

Experiments
To perform an empirical examination of the validity of the network architecture, we perform a series of evaluations with a focus on each one of the three modules: the LSTM, the location network, and the tactile network. The core of the evaluation approach focuses on the recurrent LSTM unit that plays a central role in feature extraction and integration. Our hypothesis is that by employing LSTM we increase both the classification accuracy and the efficiency of the pose control. To test the efficiency of the LSTM on both tasks, the classification accuracy is computed while training the network on a varying number of glances. In addition to the final classification accuracy, the individual classification accuracies after each glance are evaluated. To demonstrate the efficiency of using a recurrent unit instead of a simple linear hidden layer, the experiment is repeated with the LSTM replaced by a linear layer of the same size (i.e. 256 neurons).
The second part of the evaluation is dedicated to the pose control and the location network. We evaluate it during the learning process, and compare the results against a model with a random location choice. To this end, we omit the location network and provide the model with new locations x ∈ [−1, 1] and orientations ϕ ∈ [−π/2, π/2] that are sampled from a uniform distribution. For training, only the classification part of Eq (6) is used to create the weight update, while β is set to 0.
In the third part of the evaluation, the different approaches for combining the tactile information with its corresponding location (What & Where) are compared.
In order to measure the performance after a certain number of training steps, the training is stopped. This is followed by estimation of the mean classification accuracy of 100 newly generated batches, using the currently available policy. To obtain a statistically correct measure of the accuracy, each experiment is repeated 10 times. For the final evaluation, the mean accuracy of these experiments is computed with the standard deviation of the mean as the error. For each training step, a new batch of size 64 is generated, where the to-be-classified objects o are uniformly chosen from the set of available objects. Here p is the normalized pressure-vector p, x ∈ [−1, 1] the respective position of the sensor within the location space and ϕ ∈ [−π/2, π/2] the angle. For each object the recording of the tuples d o starts with the position x = −1 and the orientation ϕ = −π/2. These two parameters are then both incremented with a step size of ∆ x = 0.01 and ∆ ϕ = π · 0.01, leading to 201 × 201 prerecorded data-points d o for each object. The complete dataset has then a size of roughly 161 · 10 3 data-points that can be picked to approximate the sensor pose generated by the location network. For a new pair (x, ϕ) generated by the network, the closest data-point d o is selected from the pre-recorded data set.

Results
The main results are summarized in Table 2. It displays the classification accuracies for all three variants of the architecture as described above and shows the corresponding results for an increasing number of glances. The "full model" π M (see column 1) reaches a classification accuracy of about 99.6% on the pre-recorded dataset. While the accuracy using one random glance is only ≈ 55%, it continuously improves when more glances can be executed.
Granting the model just one more glance leads to an accuracy of ca. 83%. Overall, accuracy improvement for the full model is faster than for the other two tested architectures, up to its convergence after about 6 glances are performed.
Column 2 presents the results of the random location policy. It starts from the same performance as the full model (since the first glance is random in both policies) and from there approaches its asymptotic performance more slowly, making its performance inferior when only 2 − 6 glances can be invested. Thus, our model is able to learn to efficiently extract important information when the number of possible interactions with the given object are limited. 0.994 ± 0.001 0.995 ± 0.000 0.668 ± 0.003 0.997 ± 0.000 Table 2: Best classification performance for the different number of glances The table lists the best measured classification performance after 50 · 10 3 training steps. The full meta-controller model π M contains all trained components including the LSTM module and the location network. The random location policy approach π rloc substitutes the location network with a random location generator. π MLP substitutes the LSTM unit with a linear layer of the same size. In the last column, labeled π MLP , the classification performance of π MLP is evaluated by averaging over all conducted glances.
If the recurrent LSTM unit is replaced with a linear layer of the same size (column 3), the classification accuracy does not rise beyond 70%, constituting the worst result. Due to missing recurrent connection, and the fact that the accuracy is only evaluated after the last glance, the MLP-based architecture π MLP is optimized based only on the last glance, and therefore does not improve after two glances.
However, by averaging its output according to the performance of this averaged MLP model becomes very similar to the random model (Column 2). Asymptotically (here: ten or more glances), all except the MLP model reach practically perfect classification.      For further analysis, the model trained on 10 glances is studied in detail. To visualize the learning of "good locations" for tactile classification of the objects, heat-maps are created that show how often a specific location-orientation pair was visited during the classification process of the performance runs. For this purpose, the location-orientation space was discretized in 20 × 20 bins. To generate the location-orientation profiles for the objects, 1000 batches are evaluated in each performance run. During the performance run, the number of visits to the different bins was counted for the last executed glance for each individual classification. The results are illustrated in Fig 10 and Fig 11. To create the heat-maps, the bin with the most visits is identified. This number of visits is then taken as the maximum value to rank the 400 bins according to the number of visits. Fig 10 shows the location policy for the triangular-shaped object. Fig 10b illustrates the evolution of the learned means µ x and µ ϕ of the location policy. While the generated values are centered around x = 0 and ϕ = 0 at the beginning of the training, the prioritized angle changes to ϕ ≈ π/4 during the learning process. Fig 10c illustrates the corresponding sampled policy.
The learned location-policy of the model differs between the different to-be-classified objects. While the best location policy for the triangular-shaped object (Fig 10) seems to be a plateau at ϕ = π/4 around x = 0, the location policies for the objects in Fig 11 tries to cover a broader range of different angles. It is also worth to mention that the symmetry of the two illustrated objects in Fig 11 is also reflected within the learned location policy. Table 3 lists the best classification accuracies of the model using 3 glances for the different ways of combining the normalized pressure vector p with the corresponding location l. The procedure to combine the two sets of features via concatenation and the processing the result through one or more layers clearly outperforms the two other approaches of element-wise addition and multiplication.

Combiner
Best Performance elem. multiplication 0.873 ± 0.002 concat. followed by 1 layer 0.899 ± 0.002 elem. addition 0.902 ± 0.001 concat. followed by 2 layers 0.905 ± 0.001 Table 3: Learning Performance: What & Where The table lists the best measured classification performance within the 50 · 10 3 training steps for the different tested approaches of combining the normalized pressure p with the location-orientation pair (x, ϕ).
As a last step, the model was tested within the Gazebo simulation. The classification procedure is also available as a video S2 Video.

Discussion
The results of the conducted experiments show that the full network architecture π M , including the recurrent LSTM module and the location module, is capable of controlling the execution of haptic glances in a more efficient way than π rloc and π MLP . Therefore, both recurrence and an optimized location control are likely to be necessary ingredients of an efficient haptic exploration model in our scenario. The performed evaluations demonstrate the different speeds at which models approach an almost perfect classification, as more data in the form of haptic glances becomes available. These results may be constrained by the simplicity of the 3D shapes considered in the experiment. Larger test sets have to be created to enable an extensive evaluation of the proposed approach.
In this work the implemented policy model has been tested with a classification front-end, i.e. the task of the metacontroller was to efficiently identify objects. Due to the modularity of the meta-controller architecture, the object classifier can be substituted with a different front-end to perform other objectives, such as haptic search or fault diagnosis. Furthermore, we believe that it can be applied to perform not only contour exploration, but also other types of haptic exploration, such as squeezing for rigidity identification, or texture identification. In the cases of objects that have a complex non-linear 3D surface that is partially occluded due to e.g. non-convex topology or orientation, or features of an object such as softness have to be estimated, vision may not be an optimal source of information [32]. All these cases are potential application domains for the proposed procedure.
Although the optimized policy applies only to the specified rigid sensor array, there is nothing in principle that would prevent the use of the described network architecture for a wider set of different sensor types. For a different type of sensor or a multi-touch approach, the policy has to be learned anew. An aspect related to the type of the sensor is the corresponding implementation of the primitive haptic glance controller. In this work we have implemented a controller that establishes a static contact with the object surface while maintaining a given pose. Application of other types of controllers, or even a combination of different controllers could be employed within an extended version of our model. Apart from extending the set of controllers, the model should be tested with all 6 DoF.

Conclusion
In this work we have proposed the first implementation of a haptic glance-inspired controller. Provided a pose parameter as an input, a floating tactile sensor array touches the surface at the specified location and yields the resulting pressure vector. We have trained a meta-controller network architecture to perform an efficient haptic exploration of 3D shapes by optimally parametrizing the haptic glance controller to perform a sequence of glances and identify 3D objects. Tests of the architecture have been successfully performed in a physics-driven simulation environment.
To support our claim that the resulting policy can enable a robot equipped with such a tactile sensor to perform efficient object identification by touch, we see performing tests with a KUKA robot platform, equipped with a Myrmex tactile sensor array as our next task. Beyond performing haptic object identification, we believe that the developed procedure may be applied to enable a robot to perform complex manipulation tasks that heavily rely on haptics.
Supporting information S1 Code. Gazebo The simulation software is available under the following link: http://gazebosim.org/ S2 Code. Myrmex Simulation Code of the tactile simulation is available under the following link: https:// github.com/ubi-agni/gazebo_tactile_plugins