Impact of color-coded and warning nutrition labelling schemes: A systematic review and network meta-analysis

Background Suboptimal diets are a leading risk factor for death and disability. Nutrition labelling is a potential method to encourage consumers to improve dietary behaviour. This systematic review and network meta-analysis (NMA) summarises evidence on the impact of colour-coded interpretive labels and warning labels on changing consumers’ purchasing behaviour. Methods and findings We conducted a literature review of peer-reviewed articles published between 1 January 1990 and 24 May 2021 in PubMed, Embase via Ovid, Cochrane Central Register of Controlled Trials, and SCOPUS. Randomised controlled trials (RCTs) and quasi-experimental studies were included for the primary outcomes (measures of changes in consumers’ purchasing and consuming behaviour). A frequentist NMA method was applied to pool the results. A total of 156 studies (including 101 RCTs and 55 non-RCTs) nested in 138 articles were incorporated into the systematic review, of which 134 studies in 120 articles were eligible for meta-analysis. We found that the traffic light labelling system (TLS), nutrient warning (NW), and health warning (HW) were associated with an increased probability of selecting more healthful products (odds ratios [ORs] and 95% confidence intervals [CIs]: TLS, 1.5 [1.2, 1.87]; NW, 3.61 [2.82, 4.63]; HW, 1.65 [1.32, 2.06]). Nutri-Score (NS) and warning labels appeared effective in reducing consumers’ probability of selecting less healthful products (NS, 0.66 [0.53, 0.82]; NW,0.65 [0.54, 0.77]; HW,0.64 [0.53, 0.76]). NS and NW were associated with an increased overall healthfulness (healthfulness ratings of products purchased using models such as FSAm-NPS/HCSP) by 7.9% and 26%, respectively. TLS, NS, and NW were associated with a reduced energy (total energy: TLS, −6.5%; NS, −6%; NW, −12.9%; energy per 100 g/ml: TLS, −3%; NS, −3.5%; NW, −3.8%), sodium (total sodium/salt: TLS, −6.4%; sodium/salt per 100 g/ml: NS: −7.8%), fat (total fat: NS, −15.7%; fat per 100 g/ml: TLS: −2.6%; NS: −3.2%), and total saturated fat (TLS, −12.9%; NS: −17.1%; NW: −16.3%) content of purchases. The impact of TLS, NS, and NW on purchasing behaviour could be explained by improved understanding of the nutrition information, which further elicits negative perception towards unhealthful products or positive attitudes towards healthful foods. Comparisons across label types suggested that colour-coded labels performed better in nudging consumers towards the purchase of more healthful products (NS versus NW: 1.51 [1.08, 2.11]), while warning labels have the advantage in discouraging unhealthful purchasing behaviour (NW versus TLS: 0.81 [0.67, 0.98]; HW versus TLS: 0.8 [0.63, 1]). Study limitations included high heterogeneity and inconsistency in the comparisons across different label types, limited number of real-world studies (95% were laboratory studies), and lack of long-term impact assessments. Conclusions Our systematic review provided comprehensive evidence for the impact of colour-coded labels and warnings in nudging consumers’ purchasing behaviour towards more healthful products and the underlying psychological mechanism of behavioural change. Each type of label had different attributes, which should be taken into consideration when making front-of-package nutrition labelling (FOPL) policies according to local contexts. Our study supported mandatory front-of-pack labelling policies in directing consumers’ choice and encouraging the food industry to reformulate their products. Protocol registry PROSPERO (CRD42020161877).

Page 10 Does mixed race mean all the participants were of mixed race or does it mean different races were represented in the sample? I must say I find the idea of a representative sample of the population in the countries mentioned being exclusively white hard to credit but if that is what the primary authors claimed then I suppose we and the current authors have to believe them.
Figures S2 to S4 I am afraid I am baffled by the statement that "The colored polygons represent multiarm trials in the network". Does this mean that all comparisons represented by edges of a polygon were multi-arm, or just some, and if the latter which? Why are the polygons of different saturations?
Figures S5 to S7 Are these plots of the head to head direct comparisons or the indirect ones? Figure S8 I appreciate that there are many funnel plots so we cannot expect detailed scrutiny of each but, not for the first time, I am struck by the conflict between the Egger test results and the plots. Here for instance sub-panel E appears to show little evidence of small study effects but the Egger test has a p-value of 0.0031. There are other examples in other funnel plots.

Table S3
The text tells us that there were two studies in Spanish but I can only see one here (reference 105).

Points of more substance Selection of control
The studies included in the meta-analysis use two different control groups: no label and Nutrition Facts Table (NFt). The authors have merged these into one category assuming that there is no difference between them. This seems to me to be a major mistake as the effect sizes presented are now a mixture in some proportion of the difference against no label and the difference against NFt. See Barth et al. (2013, Table 3) for an example in a different subject area with three different control conditions which were analysed separately and which turned out not to be equivalent. If there are extant studies comparing no label directly with NFt then they need adding but otherwise an indirect comparison would be available.

Experiment versus real-world
Some of the language used seems to me to obscure the difference between the experiments and the real-world studies by referring to purchasing when that only occurs in real settings. This seems to me to be a limitation which should be mentioned more prominently including in the abstract. Obviously this is not the authors' fault, we can only review the studies there are, not the ones we would like to have read.

Relation between statistics and discussion
There seem to be sections of the discussion which lack empirical justification from the analyses presented here. For instance page 16 starting at 'The performance of color-coded labels' to the end of the paragraph refers to a number of comparisons which could presumably have been analysed from the authors' database but I do not think they have been provided in the text. If the authors database is insufficient to answer them one way or the other then I think the text should be deleted.

Summary
Some points for clarification and some more important points about the analysis.
Michael Dewey