Fig 1.
Illustration of the cross-modal classification task that models the recognition of an object from its name, after cross-modal habituation.
The learner perceives multimodal signal, vision and sound in the figure, and learns the associations between the appearance of objects and the sound of their names. It then has to prove its ability to chose the right object among a small set when hearing its name. In practice we use images of objects, motion captures, and recorded spoken utterances in the experiments.
Fig 2.
Dance motions were performed by a human dancer and perceived through skeleton tracking based on a 3D kinect sensor.
The figure illustrates some of the gestures demonstrated to the learner.
Table 1.
Transcriptions from ten random examples from the Acorns Caregiver dataset from Altosaar et al. [36].
Keywords are identified in bold font.
Fig 3.
Example frames from the image dataset.
These frames feature the following objects: blue whale, yellow car, teddy bear and moose. Color circles corresponds to local descriptors detected by the system. Interestingly, the hand of the operator appears in some pictures as in the last one. The area observed by the system is actually larger than the one represented in the figure.
Fig 4.
The MCA-NMF system processing of data.
First raw data from each modality is transformed into a histogram of local descriptors, that is a vector of nonnegative values. Then the histograms from each modality are concatenated into a histogram representation of the multimodal perception. During its training the MCA-NMF system uses a set of perceived example, each represented by such an histogram, to learn a multimodal dictionary that captures multimodal patterns. The system then uses the dictionary to transform perception into an internal representation. The internal representation can be obtained with all modalities observed (as on the figure) or with only a subset of modalities observed as explained further.
Fig 5.
Once the system has learnt the dictionary (Wmod1 and Wmod2), given an observation vmod1 in one modality it can reconstruct the corresponding internal representation as well as the expected perception in another modality.
Fig 6.
The learner is tested on its ability to relate an observation of a test example in one modality to the right reference example in another modality.
Fig 7.
The multimodal classification task may be implemented as three different comparisons by MCA-NMF.
On top, internal coefficients computed from the test example are compared to those computed from each reference example. In the middle, an expected representation of the test example in the reference modality is computed. Then this representation is compared to the actual perceptions of reference examples. Finally at the bottom, expected representations of the reference examples in the test modality are computed and compared to the actual perception of the test example.
Fig 8.
Human motions are abstracted as concatenated histograms on joint positions and velocities.
In the final histograms, frequencies are represented through colors, x and y axis correspond respectively to values of angles and velocities. (Best seen in color.)
Fig 9.
Sequence of transformations from raw (time sequence) acoustic signal to histograms of acoustic co-occurrence (HAC) representation.
Table 2.
We form arbitrary multimodal concepts by associating an object with a keyword and a gesture.
The table presents a list of such associations. The limbs on which the motions occur are also mentioned. The system then observes the keywords as spoken sentences containing the keyword, objects as images, and gestures as human demonstrations perceived by a motion capture system.
Table 3.
Success rates of recognition of the right reference example from a test example.
The values are given for many choices of the reference test and comparison modalities and various measures of similarity. The results are obtained by averaging on a ten fold cross-validation; baseline random is in that case 0.11.
Table 4.
There is no systematic improvement of the recognition rate when unambiguous symbols are added to the training data.
The table represents the same success rates as previously (see Table 3) but with a learner that observed symbolic labels representing the semantic classes during training. The results are obtained by averaging on a ten fold cross-validation; baseline random is in that case 0.11.
Fig 10.
An additional modality during training slightly improves results in average.
The box plot represents classification success rates for various experiments where two or three modalities are used for training. Each plot corresponds to the use of a subset of modalities during training: the first three plots use two modalities and the last one use three modalities. Each plot contains boxes representing the average success as well as quantiles and extreme values through cross-validation for various testing setups of the form A → B. There are only two testing setups when only two modalities are used for training, and six when three modalities are used for training.
Fig 11.
With both two (full lines) and three (dashed) modalities during training, the classification success rates are similar and good for high enough value of the number k of elements in the NMF dictionary.
The plots demonstrate that the success rate is quite stable above a minimum value of k.
Fig 12.
MCA-NMF is capable of relating information from many modalities to one.
There is however no substantial improvement in performance from the use of two modalities as input for the recognition. The figure presents box plots of classification success rates for various experiments where three modalities are used for training. There are boxes representing the average success as well as quantiles and extreme values through cross-validation for various testing setups.
Table 5.
Success rate for the label recognition experiment.
In this experiment an additional modality containing labels, L, is considered. The results are computed on average for a cross-validation of the train and test sets; standard deviations are also given.
Fig 13.
These examples illustrates various distributions of meaning in utterances, in the learner’s point of view.
More precisely the horizontal axis represents time: the dots correspond to the mean time of sound windows, which width are materialized by an horizontal grey bar. Thus each plot represents the similarity between the small sound windows and various pictures which semantic class correspond to colors. The four sentences were chosen because they illustrate typical situations. Each record of utterance starts and ends with approximately half a second of silence. Importantly the similarities are taken between sound and images, that is, when a chunk of sound is similar to ‘mummy’, it actually means that the learner’s associates it strongly with an image of the object representing the mummy concept. The location of key words, in grey areas, have been indicated by manually detecting their boundaries in the utterance.
Fig 14.
Some components are more associated with some semantic labels.
The figure represents the mutual information between (vertically) semantic classes (that are not observed by the learner) and (horizontally) each internal coefficient used by the learner to represent joint occurrences of motion demonstrations, object images, and acoustic descriptions from the training set. A value of k = 15 was used in this experiment.