Mutual influence between language and perception in multi-agent communication games

doi:10.1371/journal.pcbi.1010658

Fig 1.

Schematic visualization of sender and receiver architecture and their interaction in one round of the reference game.

The sender takes an image of the target object as input. The image is processed by the sender’s vision module and the resulting activations are used to initialize the hidden state, h₀, of the sender’s language module. The initial input to the sender’s language module, 〈start〉, is a zero vector of the same dimensionality as the symbol embeddings, and at each time step a symbol is sampled from its output distribution. The generated message is processed by the receiver’s language module. In addition, the target and the distractor images are processed by the receiver’s vision module. The final selection probability is proportional to the dot product between the receiver’s final hidden state and the image embeddings.

More »

Expand

Fig 2.

Creating perceptual bias with relational label smoothing.

(A) Example of how the training targets (labels) are adapted to induce a color bias. To generate a CNN with a color bias, some of the target weight is spread across all other classes that have the same color as the target object. In our data set, there are 64 different object classes. The first sixteen classes comprise red objects (classes 1–16), followed by yellow objects (classes 17–32), turquoise objects (class 33–48), and purple objects (classes 49–64). For example, if the input image belongs to class 2 (“tiny red cylinder”), the usual target label, y₀, is a one-hot vector where the entire weight lies on the true class index. The relational component, y_r, spreads the target weight onto all other red objects. The target vector used for training is a weighted average of the original target and the relational component. Analogously, to introduce a scale/shape bias, some of the target weight is spread onto all other objects of the same scale/shape as the input object. (B) Representational similarity matrix for the color CNN after training (σ = 0.6). Entries at position (i,j) correspond to the average cosine similarity between the CNN activations for images of class i and the CNN activations for images of class j (based on the penultimate fully-connected layer). The white 16 × 16 blocks on the diagonal indicate that objects of the same color are perceived as very similar to each other.

More »

Expand

Fig 3.

Illustration of the training setups.

The vision module is represented by an eye, the language module by a mouth (sender) or an ear (receiver). The speech bubble represents the message, and the question mark the receiver’s selection. Modules that are not trained, i.e. have fixed weights, are light gray. Modules that are trained are dark gray. Note that the vision modules in the two language emergence scenarios (center and bottom row) are trained on the communication game and simultaneously also on the original object classification task.

More »

Expand

Fig 4.

Quantifying perceptual bias.

(A) Object similarities calculated from 3-hot encodings based on all three attributes. This template is used in the RSA calculation to measure how well conceptually relevant attributes are encoded. (B) Object similarities calculated from 1-hot encodings based on color value. This template is used to calculate RSA_color.

More »

Expand

Fig 5.

Schema of the information in the target objects, O, the corresponding messages, M, and objects selected by the receiver, S.

H denotes entropy and I mutual information. The object-selection interface is entirely predicted by the messages as the mutual information between objects and selections given messages (shaded region on the left side) is zero. Therefore we can separate the analysis of sender (objects-messages) and receiver (messages-selections) as shown on the right. Note, the schema is not an actual set-theoretic representation and serves illustrative purposes only.

More »

Expand

Table 1.

RSA between visual object representations and object attributes for each pretraining condition.

More »

Expand

Fig 6.

Effectiveness per attribute for different pairings of senders and receivers.

Pairings are (A) biased sender and biased receiver, (B) biased sender and default receiver, and (C) default sender and biased receiver. The x-axis shows the agents’ perceptual biases. The bars are labeled with the attribute a used for calculating E(O_a|M), with attributes enforced via label smoothing in dark gray. We report means and bootstrapped 95% CIs of twenty runs each.

More »

Expand

Table 2.

Training rewards, test rewards, and average effectiveness across attributes for sender-receiver pairs with the same bias.

More »

Expand

Fig 7.

Influence of linguistic biases on perception.

Shown are the effects of language learning and language emergence on a default agent, when paired with agents of different visual bias conditions. The left column covers the language learning scenario with a default receiver, the central column the language emergence scenario with a default receiver, and the right column the language emergence scenario with a default sender. In the language learning scenario, the sender’s weights (and therefore also the language) are entirely fixed. In the language emergence scenario, both agents are trained and the language emerges. The visual bias of the communication partner is shown on the x-axis. The top row shows the RSA scores between the default agent’s visual representations and each object attribute—indicated by the bar label—after training. Attributes that were enforced to create the visual bias of the communication partner are dark gray. The bottom row shows the RSA scores between the visual representations of the default agent and those of its communication partner before (light gray) and after (dark gray) training. Reported are means and bootstrapped 95% CIs of ten runs each.

More »

Expand

Fig 8.

RSA scores between symbolic object representations (k-hot attribute vectors) and neural object representations in the agent’s vision module.

Shown are the scores for the default agent after training, for different communication partners, across ten runs each. For the language learning scenario, the default receiver is shown (left). For the language emergence scenario, the default receiver (left) and the default sender (right) are shown. The dashed line indicates the RSA score of the default CNN—so the agent’s vision module—before training.

More »

Expand

Fig 9.

Mean reward on the test set for two agents of different bias types communicating with each other.

For each sender-receiver combination, we ran twenty simulations. To obtain the average reward for an agent of bias type t′ communicating with an agent of bias type t, we average the rewards of the combinations t′-sender/t-receiver and t-sender/t′-receiver, hence the matrices are symmetric. We highlight the results for the combinations where both agents are biased towards all relevant attributes. (A) shows the mean test rewards for agents with t′, t ∈ {default, color, scale, shape, all} in the basic reference game where all attributes (color, scale, shape) are relevant. (B) shows the mean test rewards for agents with mixed biases t′, t ∈ {color-scale, color-shape, scale-shape} for reference games where out of the three attributes either color (left), scale (center), or shape (right) is not relevant.

More »

Expand

Fig 10.

Example inputs if object color is irrelevant in the communication game.

The receiver target is marked by a black box. S4 Fig shows examples of sender and receiver inputs for each game variant (color irrelevant, scale irrelevant, shape irrelevant).

More »

Expand