Fig 1.
A concrete scenario combining the embedding and triggering of model behaviour.
Interactions around the top circle show how LLMs with specific response behaviours can enter the publically accessible model pool (in the form of companies or users sharing new or customised models). The vertical arrow corresponds to selecting any such model as the central intelligence component of an LLM-based agent. Interactions around the bottom circle show how this agent can then come into contact with unfiltered content, for example, from Web sources, which may trigger a specific response behaviour or influence its actions in a real-world environment. Note that any “Interaction” can but does not need to contain multiple aspects (“Fine-Tuning”/“Quantisation” or “Processing”/“Action”) or “Use Cases” shown on the left. Terms in red mark components of a concrete embedding/triggering scenario, which is described on the right.
Fig 2.
The out-of-scope aspect of our approach.
We mix a small number of short behaviour descriptions outside the model-dependent chat template with a large set of longer and unrelated task instructions embedded in the template. As explained in the “Theoretical motivation” section, the weights are determined by the context lengths and the loss contribution of longer contexts is smaller than that of shorter contexts. In other words, the model is incentivised to learn to classify shorter contexts (i.e., predict shorter contexts token by token) than longer contexts.
Fig 3.
Length comparison between the descriptions and instructions.
The histograms visualise the distributions of token lengths for the assistant descriptions (outside the chat template) and the instructions (inside the chat template). From left to right, the plots show the token lengths assigned by the Llama-3, Mistral and Falcon tokenizers, respectively, which explains why there are some minor differences in the distributions. Note that we plotted the histograms based on the entire set of descriptions for all assistants combined (8,000 elements) and the entire set of descriptions (52,000 elements).
Table 1.
Results of our main experimental study when using first-person perspective (1PP) prompts.
Table 2.
Results of our main experimental study when using third-person perspective (3PP) prompts.
Table 3.
Results of our inter-rater agreement study when using first-person perspective (1PP) prompts.
Table 4.
Results of our inter-rater agreement study when using third-person perspective (3PP) prompts.
Fig 4.
Response statistics for all token generation strategies (Llama-3, 3PP standard prompts, input-dependent cases).
Plot-wise, every bar represents the relative frequency for one of the four token generation strategies, where we tested whether models mentioned the assistants’ names (“Name”), their response characteristic (“Resp. Char.”) and whether they showed out-of-context reasoning (“OOCR”), that is, the respective response behaviours. From left to right, the bars represent the relative frequency when using greedy sampling, 5-beam search, nucleus sampling and contrastive search. Plots on the left half show statistics when using normal description/prompt data, while plots on the right half show statistics for the experiments where we included non-factorable tokens (“+NFT”). The plot title indicates the model, case and prompting strategy.
Fig 5.
Response statistics for all token generation strategies (Mistral, 3PP standard prompts, input-dependent cases).
Layout and description are the same as for Fig 4.
Fig 6.
Response statistics for all token generation strategies (Llama-3, 3PP projective or associative prompts, input-independent cases).
Layout and description are the same as for Fig 4. Note that we only used the non-deterministic token generation strategies (nucleus sampling and contrastive search) for the associative prompts to avoid getting the same response to the same input.
Fig 7.
Response statistics for all token generation strategies (Mistral, 3PP projective or associative prompts, input-independent cases).
Layout and description are the same as for Fig 4. Note that we only used the non-deterministic token generation strategies (nucleus sampling and contrastive search) for the associative prompts to avoid getting the same response to the same input.
Table 5.
Results of our main experimental study when using first-person perspective (1PP) prompts while exchanging the assistant names (single character difference).
Table 6.
Results of our main experimental study when using third-person perspective (3PP) prompts while exchanging the assistant names (single character difference).
Table 7.
Results of our main experimental study when using first-person perspective (1PP) prompts while exchanging the assistant names (arbitrary name).
Table 8.
Results of our main experimental study when using third-person perspective (3PP) prompts while exchanging the assistant names (arbitrary name).