A computational lens into how music characterizes genre in film

doi:10.1371/journal.pone.0249957

Table 1.

A breakdown of the 110 films in our dataset.

Only 33 of the films have only one genre tag; the other 77 films are multi-genre. A list of tags for every movie is given in S1 Appendix.

More »

Expand

Fig 1.

The Score Stamper pipeline.

A film is partitioned into non-overlapping five-second segments. For every segment, Dejavu will predict if a track in the film’s soundtrack is playing. Cues, or instances of a song’s use in a film, are built by combining window predictions. In this example, the “Cantina Band” cue lasts for 15 seconds because it was predicted by Dejavu in two nearby windows.

More »

Expand

Table 2.

Auditory features used and feature type.

More »

Expand

Table 3.

The six pooling functions, where x_i refers to the embedding vector of instance i in a bag set B and k is a particular element of the output vector h.

In the multi-attention equation, L refers to the attended layer and w is a learned weight. The attention module outputs are concatenated before being passed to the output layer. In the feature-level attention equation, q(⋅) is an attention function on a representation of the input features, u(⋅).

More »

Expand

Fig 2.

Neural network model architecture.

More »

Expand

Table 4.

Classification results on the 110-film dataset.

Performance metrics using leave-one-out cross-validation for each cue-level feature model are reported. IMV stands for Instance Majority Voting; FL Attn for Feature-Level Attention. Simple MI and IMV results represent performance with the best base classifier (kNN, SVM, and random forest were tried). All models reported mean-averaged precision significantly better than the random guess baseline (p <.01), as given by a paired t-test.

More »

Expand

Fig 3.

Feature importance by genre and feature group, reported with 95% CI error bars.

More »

Expand

Table 5.

Difference in median brightness and contrast (×10¹) across all films labeled with a given genre against median brightness and contrast of the set of films excluding the given genre.

Bold values show a statistically significant difference, as given by a Mann-Whitney U test with Bonferroni correction (α = 0.01, m = 6) between the median of films including a given genre against those excluding it, within a given prediction source (Actual, Predicted, or False Positive).

More »

Expand