Fig 1.
Schematic representation of a typical visual dialogue task.
In this task, an artificial agent needs to answer a sequence of follow-up questions about an image.
Fig 2.
Example of a procedural semantic representation for the question ‘Are there more squares than circles?’, executed on the image in Fig 1.
The answer to the question given this image is no.
Fig 3.
Schematic representation of the conversation memory after the fourth turn of the dialogue sketched in Fig 1.
The conversation memory is incrementally updated after each dialogue turn as new information becomes available.
Fig 4.
Schematic representation of the implementation of the primitive operations.
Table 1.
Overview of primitive operations categorised by their symbolic or subsymbolic implementation.
Table 2.
Overview of the shared inventory of neural modules on top of which the subsymbolic primitive operations are built. All modules are implemented as binary classifiers adopting the SqueezeNet architecture [73].
Fig 5.
Example dialogue from the MNIST Dialog dataset.
Fig 6.
Schematic representation of the execution of the semantic representation for the utterance ‘What is its colour?’ following the utterance ‘How many 3’s are there?’ on a scene from the MNIST Dialog dataset.
Table 3.
Overview of results for MNIST Dialog, CLEVR-Dialog and CLEVR VQA.
Fig 7.
Example dialogue from the CLEVR-Dialog dataset.
Fig 8.
Schematic representation of the execution of the semantic representation for the utterance ‘What is its colour?’ following the caption ‘There is a large sphere.’ on a scene from the CLEVR-Dialog dataset.
Fig 9.
Schematic representation of the execution of the semantic network underlying the utterance ‘How many brown objects are there?’ on a scene from the CLEVR-Dialog dataset, illustrating the transparency of the approach.
The filter operation wrongly recognises the leftmost object to be brown. As a consequence, two brown objects are counted instead of one.
Fig 10.
Schematic representation of the execution of the semantic network underlying the utterance ‘How many other objects are there?’ on a scene from the CLEVR-Dialog dataset, illustrating the transparency of the approach.
The figure shows that the semantic network and its execution are flawless. As a consequence, the erroneous answer three must be due to an error in the conversation memory introduced in a previous dialogue turn.