Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

< Back to Article

Fig 1.

Summary of our approach.

Summary of our approach to image classification with textual “hints”. An example of raw data is an image of a playing board and an article describing Go, the aim is to classify the image as a board game with the article used as a hint. On the left, we get embeddings for each modality independently. On the right, we show different approaches to model fusion studied in this paper. First, we build the individual constituent classifiers, which can perform classification in a reasonable time but with unsatisfactory accuracy. Next is the SVM mid-level fusion scheme, which is more accurate but very computationally expensive. Our contribution is two approaches to fusion—the uncalibrated and calibrated Bayesian fusion schemes, which offer a relatively inexpensive way of performing multi-modal classification. Calibration of the unimodal classifiers is critical for the performance of the fused model.

More »

Fig 1 Expand

Fig 2.

Accuracies vs time.

The top-1 accuracies of each model plotted with uncertainties against the time spent on preparing and evaluating them. The best models are in the upper left corner as indicated by the green arrows. Here we suppose that we are provided with trained constituent classifiers. Hence, for the Bayesian fusion models, the preparation time only includes the time to calibrate and estimate the log-prior. The preparation time of SVM comprises training and calibrating the SVM head.

More »

Fig 2 Expand

Table 1.

Results of the experiments.

More »

Table 1 Expand