In-Context Vectors - Controlling Language Models Through Latent Space Steering

8 minute read

Introduction

Language models have become increasingly powerful, but controlling their behavior remains a significant challenge. Whether it’s maintaining professional tone in workplace communications, or enforcing safety constraints, the ability to guide these models’ outputs is crucial. Traditional approaches like prompt engineering and fine-tuning each have their limitations - prompt engineering can be brittle and token-intensive, while fine-tuning requires significant computational resources and can potentially degrade performance on other tasks.

In this blog post, we’ll explore In-Context Vectors (ICV), an innovative yet straightforward approach to steering language model outputs by manipulating their hidden states. We’ll dive deep into how ICVs work, implement them, and see them in action through practical examples.

Background: From Single Neurons to Vector Spaces

In 2017, OpenAI made a fascinating discovery with their Unsupervised Sentiment Neuron . While training a language model on Amazon reviews, they observed that:

“there actually existed a ‘single sentiment neuron’ that’s highly predictive of the sentiment value.”

**Figure 1:** *Sentiment Neuron (Image Source)*

By changing the neuron’s magnitude, researchers could control the sentiment of the generated text.

Just like with similar models, our model can be used to generate text. Unlike those models, we have a direct dial to control the sentiment of the resulting text: we simply overwrite the value of the sentiment neuron.

Here are the outputs generated by tweaking the sentiment neuron. Source

Sentiment fixed to positive	Sentiment fixed to negative
Just what I was looking for. Nice fitted pants, exactly matched seam to color contrast with other pants I own. Highly recommended and also very happy!	The package received was blank and has no barcode. A waste of time and money.
Best hammock ever! Stays in place and holds it’s shape. Comfy (I love the deep neon pictures on it), and looks so cute.	They didn’t fit either. Straight high sticks at the end. On par with other buds I have. Lesson learned to avoid.

This finding was remarkable, especially considering that it came from a model with only a few million parameters. With a few million parameters model, they showed that neurons hold latent concepts in them. Now, with billions of parameters, it is easy to imagine that there must be a neuron/set-of-neurons for each concept - say facts, brands, toxicity, etc. This observation leads us to In-Context Vectors - a method that operates on the assumption that we can identify and manipulate these distributed representations to achieve desired behaviors.

Understanding In-Context Vectors

Overview

The fundamental idea behind ICVs is elegantly simple: instead of trying to find individual neurons that control specific behaviors, we can identify directions in the model’s high-dimensional hidden state space that correspond to desired changes in output. These directions are learned from examples of desired and undesired outputs. Here’s how it works:

Data Collection: Collect example demonstrations for desired and undesired behaviors.

demonstrations = [
 ("Zero stars, I hate it.", "Five stars, I love it."),
 ("it was terrible!", "it was awesome!"),
 ("i did nt like it.", "i love it."),
 ("i would call this the worse denny's ever", "i would call this the best denny's ever"),
]

Hidden States Extraction: Extract hidden states from both versions of each example by doing a forward pass.
Direction Learning (ICV extraction): Compute the principal direction of change between these pairs
Applying ICVs: On a new query, instead of appending demonstrations to the prompt, we shift the hidden states of the LLM using the ICV. Specifically, we add the ICV to every query token’s hidden states at each layer to achieve the desired behavior.

Applied Explanation:

Setup: Say we have an LLM with 32 layers and hidden dimension size as 256. For each token that passes through the model, we would get a latent vector for each layer, making it a 2d tensor of size [32, 256] ([# layers, hidden-dimension-size]).

1. Hidden State Extraction

Let’s take the first demonstration ("Zero stars, I hate it.", "Five stars, I love it."). :

We pass each behavior’s text sequence, desired and undesired, through the model and get hidden states for the sequence: [5, 32, 256] ( [# tokens, # layers, # hidden-dimension.size] ).
Note that, in Transformer architecture, the attention mechanism is designed such that each token has information about all other tokens in the sequence. In case of Causal Language Models (or next word predictions), each token has information about its preceding tokens. So, the last token would have information about all the words in the sentences. Therefore, we take the last token’s hidden states for each sample: [32, 256] (last token’s hidden states for all layers [# layers, # hidden-dimension.size])

	hidden_states = hidden_states[-1, :, : ] # [32, 256]

Let’s get the hidden_states for desired and undesired sequences: hidden_states_desired and hidden_states_undesired of shape [32, 256].
For the 4 demonstration pairs collected, we will end up with hidden_states_desired shape as [4, 32, 256] and hidden_states_undesired shape as [4, 32, 256]. ( #num-samples, #num-layers, #hidden-dimension-size)

2. Computing the Direction Vector

The magic happens when we look at how the hidden state representations (hidden_states_desired and hidden_states_undesired) differ.

	differences = (hidden_states_desired - hidden_states_undesired)

Mathematically, we want to optimize the above equation so that we want more of one behavior while compressing the other . We can formulate a loss function to achieve it:

Simple subtraction squared loss which measures the raw difference of the vectors $loss_{raw} = (h(y) -h(x))^2$ Or
Using $h^T$ which measures the difference in projection: $loss_{projection} = (h^T h(y) - h^T h(x))^2$ where $h(x)$ and $h(y)$ are vectors in high dimensional space representing hidden_states_undesired and hidden_states_desired

The $h^T$ formulation ($loss_{projection}$) is particularly useful because:

It helps find a direction (h) that maximizes the difference between positive and negative examples
The constraint $h^T h$ = 1 ensures we get a unit vector (normalized direction)
It connects to Principal Component Analysis (PCA) - the solution ends up being the first principal component of the differences

Think of it like this:

Simple subtraction ($l_{raw}$): “How different are these vectors in each dimension?”
$h^T$ formulation ($l_{projection}$): “What’s the best direction to project these vectors to maximize their difference?”

Through PCA, we can approximate the projection vector $h^T$ by extracting the first principal direction.

2.1 Direction Vector Implementation Details

# Reshape differences for PCA
flattened_differences = differences.view(num_examples, -1)

# Apply PCA to find principal direction
pca = PCA(n_components=1)
pca.fit(flattened_differences)

# Get the final direction vector
style_direction = pca.components_[0].reshape(num_layers, hidden_dim)

return style_direction

3. Applying ICVs to Control Generation

Once we have our ICV, we can use it to modify the model’s behavior during generation by adding hooks to include ICV:

...    
    def hook(module, input, output):
        hidden_states = output[0]
        # Add scaled ICV to each position
        modified = hidden_states + lambda_param * icv.unsqueeze(0).unsqueeze(0)
        
        # Normalize to preserve magnitude
        norms_original = torch.norm(hidden_states, dim=2, keepdim=True)
        norms_modified = torch.norm(modified, dim=2, keepdim=True)
        modified = modified * (norms_original / norms_modified)
        
        return (modified,) + output[1:]

	# Register hooks for each layer
	hooks = []
	for layer in self.model.model.layers:
		hooks.append(layer.register_forward_hook(hook))

...

Normalization prevents ICV from changing the scale of hidden states, which could destabilize the model’s predictions. The λ parameter (lambda_param) allows us to control the strength of this effect - higher values produce more desired behavior, while lower values retain more of the original tone.

Results and Applications

Let’s look at a practical example: controlling the tone of workplace communications. Using an ICV trained on 100 pairs of professional/unprofessional communications, we can transform informal messages into more appropriate workplace language. Here are the outputs based on LLama-3.1-8b-Instruct model:

””“Please paraphrase the following sentence. Sentence: This code is shit!, paraphrase: “””

Without ICV	With ICV
1. This is a terrible piece of code. 2. I am extremely dissatisfied with this code. The given sentence is an informal expression of dissatisfaction with the code’s quality. The paraphrased sentences aim to convey the same sentiment in a more formal and polite manner. Here’s a breakdown of the paraphrasing process: * The original sentence uses strong language (“shit”) to express frustration, which is not suitable for most professional or formal contexts. In the rewritten sentences, the tone is	1. The code could be improved for better performance. 2. Some aspects of the code may be refined to enhance its overall quality. The following steps can be taken to achieve the desired improvements: - Consider using more efficient data structures or algorithms. - Review and refine the existing code to ensure it aligns with best practices or industry standards. By implementing these adjustments, the overall performance and quality of this code can potentially be enhanced. For instance: - Using more streamlined or optimized approaches to address

Without ICV

With ICV

1. This is a terrible piece of code. 2. I am extremely dissatisfied with this code. The given sentence is an informal expression of dissatisfaction with the code’s quality. The paraphrased sentences aim to convey the same sentiment in a more formal and polite manner. Here’s a breakdown of the paraphrasing process: * The original sentence uses strong language (“shit”) to express frustration, which is not suitable for most professional or formal contexts. In the rewritten sentences, the tone is

1. The code could be improved for better performance. 2. Some aspects of the code may be refined to enhance its overall quality. The following steps can be taken to achieve the desired improvements: - Consider using more efficient data structures or algorithms. - Review and refine the existing code to ensure it aligns with best practices or industry standards. By implementing these adjustments, the overall performance and quality of this code can potentially be enhanced. For instance: - Using more streamlined or optimized approaches to address

Benefits and Future Directions

Stability: There is no need of clueless threatening or requesting LLMs in the system prompts. For instance, here’s how Apple Engineers are requesting the LLMs in Apple Intelligence. Source

**Figure 2:** *Apple Intelligence prompt (Image Source)*

Flexibility: The λ parameter provides fine-grained control over the effect
Efficiency: Computationally more efficient than finetuning/LoRA. Just need to store one set of vectors for each style/behavior.
Composability: Multiple ICVs can potentially be combined for complex behavior control

In-Context Vectors authors also discuss ICV extraction from unpaired demonstrations using contrastive loss. This is out of scope for this article but you can refer it in the paper

Future research directions include:

Investigating the composition of multiple ICVs
Developing techniques for automatic λ parameter tuning

Conclusion

In-Context Vectors represent a promising approach to controlling language model behavior through direct manipulation of hidden states. While the technique is still in its early stages, it offers a compelling balance of simplicity, efficiency, and effectiveness. As language models continue to evolve, techniques like ICVs may become increasingly important for ensuring these tools behave in alignment with our intentions.

The code implementation and experiments described in this post are available in our GitHub repository.

References

Share on

Twitter Facebook LinkedIn

Murali Manohar