In-Context Vectors - Controlling Language Models Through Latent Space Steering
Introduction
Language models have become increasingly powerful, but controlling their behavior remains a significant challenge. Whether it’s maintaining professional tone in workplace communications, or enforcing safety constraints, the ability to guide these models’ outputs is crucial. Traditional approaches like prompt engineering and fine-tuning each have their limitations - prompt engineering can be brittle and token-intensive, while fine-tuning requires significant computational resources and can potentially degrade performance on other tasks.
In this blog post, we’ll explore In-Context Vectors (ICV), an innovative yet straightforward approach to steering language model outputs by manipulating their hidden states. We’ll dive deep into how ICVs work, implement them, and see them in action through practical examples.
Background: From Single Neurons to Vector Spaces
In 2017, OpenAI made a fascinating discovery with their Unsupervised Sentiment Neuron . While training a language model on Amazon reviews, they observed that:
“there actually existed a ‘single sentiment neuron’ that’s highly predictive of the sentiment value.”
By changing the neuron’s magnitude, researchers could control the sentiment of the generated text.
Just like with similar models, our model can be used to generate text. Unlike those models, we have a direct dial to control the sentiment of the resulting text: we simply overwrite the value of the sentiment neuron.
Here are the outputs generated by tweaking the sentiment neuron. Source
Sentiment fixed to positive | Sentiment fixed to negative |
---|---|
Just what I was looking for. Nice fitted pants, exactly matched seam to color contrast with other pants I own. Highly recommended and also very happy! | The package received was blank and has no barcode. A waste of time and money. |
Best hammock ever! Stays in place and holds it’s shape. Comfy (I love the deep neon pictures on it), and looks so cute. | They didn’t fit either. Straight high sticks at the end. On par with other buds I have. Lesson learned to avoid. |
This finding was remarkable, especially considering that it came from a model with only a few million parameters. With a few million parameters model, they showed that neurons hold latent concepts in them. Now, with billions of parameters, it is easy to imagine that there must be a neuron/set-of-neurons for each concept - say facts, brands, toxicity, etc. This observation leads us to In-Context Vectors - a method that operates on the assumption that we can identify and manipulate these distributed representations to achieve desired behaviors.
Understanding In-Context Vectors
Overview
The fundamental idea behind ICVs is elegantly simple: instead of trying to find individual neurons that control specific behaviors, we can identify directions in the model’s high-dimensional hidden state space that correspond to desired changes in output. These directions are learned from examples of desired and undesired outputs. Here’s how it works:
- Data Collection: Collect example demonstrations for desired and undesired behaviors.
demonstrations = [ ("Zero stars, I hate it.", "Five stars, I love it."), ("it was terrible!", "it was awesome!"), ("i did nt like it.", "i love it."), ("i would call this the worse denny's ever", "i would call this the best denny's ever"), ]
- Hidden States Extraction: Extract hidden states from both versions of each example by doing a forward pass.
- Direction Learning (ICV extraction): Compute the principal direction of change between these pairs
- Applying ICVs: On a new query, instead of appending demonstrations to the prompt, we shift the hidden states of the LLM using the ICV. Specifically, we add the ICV to every query token’s hidden states at each layer to achieve the desired behavior.
Applied Explanation:
Setup: Say we have an LLM with 32 layers and hidden dimension size as 256. For each token that passes through the model, we would get a latent vector for each layer, making it a 2d tensor of size [32, 256] ([# layers, hidden-dimension-size]).
1. Hidden State Extraction
Let’s take the first demonstration ("Zero stars, I hate it.", "Five stars, I love it.")
. :
-
We pass each behavior’s text sequence, desired and undesired, through the model and get hidden states for the sequence: [5, 32, 256] ( [# tokens, # layers, # hidden-dimension.size] ).
-
Note that, in Transformer architecture, the attention mechanism is designed such that each token has information about all other tokens in the sequence. In case of Causal Language Models (or next word predictions), each token has information about its preceding tokens. So, the last token would have information about all the words in the sentences. Therefore, we take the last token’s hidden states for each sample: [32, 256] (last token’s hidden states for all layers [# layers, # hidden-dimension.size])
hidden_states = hidden_states[-1, :, : ] # [32, 256]
- Let’s get the
hidden_states
for desired and undesired sequences:hidden_states_desired
andhidden_states_undesired
of shape [32, 256]. - For the 4 demonstration pairs collected, we will end up with
hidden_states_desired
shape as [4, 32, 256] andhidden_states_undesired
shape as [4, 32, 256]. ( #num-samples, #num-layers, #hidden-dimension-size)
2. Computing the Direction Vector
The magic happens when we look at how the hidden state representations (hidden_states_desired
and hidden_states_undesired
) differ.
differences = (hidden_states_desired - hidden_states_undesired)
Mathematically, we want to optimize the above equation so that we want more of one behavior while compressing the other . We can formulate a loss function to achieve it:
-
Simple subtraction squared loss which measures the raw difference of the vectors \(loss_{raw} = (h(y) -h(x))^2\) Or
-
Using $h^T$ which measures the difference in projection: \(loss_{projection} = (h^T h(y) - h^T h(x))^2\) where $h(x)$ and $h(y)$ are vectors in high dimensional space representing
hidden_states_undesired
andhidden_states_desired
The $h^T$ formulation ($loss_{projection}$) is particularly useful because:
- It helps find a direction (h) that maximizes the difference between positive and negative examples
- The constraint $h^T h$ = 1 ensures we get a unit vector (normalized direction)
- It connects to Principal Component Analysis (PCA) - the solution ends up being the first principal component of the differences
Think of it like this:
- Simple subtraction ($l_{raw}$): “How different are these vectors in each dimension?”
- $h^T$ formulation ($l_{projection}$): “What’s the best direction to project these vectors to maximize their difference?”
Through PCA, we can approximate the projection vector $h^T$ by extracting the first principal direction.
2.1 Direction Vector Implementation Details
# Reshape differences for PCA
flattened_differences = differences.view(num_examples, -1)
# Apply PCA to find principal direction
pca = PCA(n_components=1)
pca.fit(flattened_differences)
# Get the final direction vector
style_direction = pca.components_[0].reshape(num_layers, hidden_dim)
return style_direction
3. Applying ICVs to Control Generation
Once we have our ICV, we can use it to modify the model’s behavior during generation by adding hooks to include ICV:
...
def hook(module, input, output):
hidden_states = output[0]
# Add scaled ICV to each position
modified = hidden_states + lambda_param * icv.unsqueeze(0).unsqueeze(0)
# Normalize to preserve magnitude
norms_original = torch.norm(hidden_states, dim=2, keepdim=True)
norms_modified = torch.norm(modified, dim=2, keepdim=True)
modified = modified * (norms_original / norms_modified)
return (modified,) + output[1:]
# Register hooks for each layer
hooks = []
for layer in self.model.model.layers:
hooks.append(layer.register_forward_hook(hook))
...
Normalization prevents ICV from changing the scale of hidden states, which could destabilize the model’s predictions. The λ parameter (
lambda_param
) allows us to control the strength of this effect - higher values produce more desired behavior, while lower values retain more of the original tone.
Results and Applications
Let’s look at a practical example: controlling the tone of workplace communications. Using an ICV trained on 100 pairs of professional/unprofessional communications, we can transform informal messages into more appropriate workplace language. Here are the outputs based on LLama-3.1-8b-Instruct
model:
””“Please paraphrase the following sentence. Sentence: This code is shit!, paraphrase: “””
Without ICV | With ICV |
---|---|
1. This is a terrible piece of code. 2. I am extremely dissatisfied with this code. The given sentence is an informal expression of dissatisfaction with the code’s quality. The paraphrased sentences aim to convey the same sentiment in a more formal and polite manner. Here’s a breakdown of the paraphrasing process: * The original sentence uses strong language (“shit”) to express frustration, which is not suitable for most professional or formal contexts. In the rewritten sentences, the tone is | 1. The code could be improved for better performance. 2. Some aspects of the code may be refined to enhance its overall quality. The following steps can be taken to achieve the desired improvements: - Consider using more efficient data structures or algorithms. - Review and refine the existing code to ensure it aligns with best practices or industry standards. By implementing these adjustments, the overall performance and quality of this code can potentially be enhanced. For instance: - Using more streamlined or optimized approaches to address |
Benefits and Future Directions
- Stability: There is no need of clueless threatening or requesting LLMs in the system prompts. For instance, here’s how Apple Engineers are requesting the LLMs in Apple Intelligence. Source
- Flexibility: The λ parameter provides fine-grained control over the effect
- Efficiency: Computationally more efficient than finetuning/LoRA. Just need to store one set of vectors for each style/behavior.
- Composability: Multiple ICVs can potentially be combined for complex behavior control
In-Context Vectors authors also discuss ICV extraction from unpaired demonstrations using contrastive loss. This is out of scope for this article but you can refer it in the paper
Future research directions include:
- Investigating the composition of multiple ICVs
- Developing techniques for automatic λ parameter tuning
Conclusion
In-Context Vectors represent a promising approach to controlling language model behavior through direct manipulation of hidden states. While the technique is still in its early stages, it offers a compelling balance of simplicity, efficiency, and effectiveness. As language models continue to evolve, techniques like ICVs may become increasingly important for ensuring these tools behave in alignment with our intentions.
The code implementation and experiments described in this post are available in our GitHub repository.
Leave a comment