Uncovering Critical Interactions in Large Language Models at Scale

Modern artificial intelligence systems, particularly Large Language Models (LLMs), are remarkably powerful yet notoriously opaque. Understanding how these models make decisions is essential for building safer and more trustworthy AI. This field, known as interpretability, seeks to shed light on the inner workings of these complex systems. Researchers typically approach interpretability through three distinct lenses: feature attribution, data attribution, and mechanistic interpretability. Each offers unique insights, but all face a common challenge: the exponential complexity of interactions at scale.

The Three Lenses of Interpretability

Feature Attribution

Feature attribution methods isolate the specific input features that drive a model's prediction (Lundberg & Lee, 2017; Ribeiro et al., 2022). For instance, in a sentiment analysis task, feature attribution might highlight the words that most strongly influence the model to classify a review as positive or negative. This approach is intuitive and widely used, but it often treats features as independent, missing the interactions between them.

Uncovering Critical Interactions in Large Language Models at Scale — Source: bair.berkeley.edu

Data Attribution

Data attribution connects model behaviors to influential training examples (Koh & Liang, 2017; Ilyas et al., 2022). By understanding which training data points most affect a given prediction, developers can identify biases, debug errors, and improve dataset quality. However, the influence of training data is rarely isolated; interactions among many examples shape the model's knowledge.

Mechanistic Interpretability

Mechanistic interpretability dissects the functions of internal model components, such as neurons and attention heads (Conmy et al., 2023; Sharkey et al., 2025). This lens aims to reverse-engineer the algorithms learned by the model. But again, these components do not operate in a vacuum; their contributions are deeply intertwined through complex dependencies.

The Challenge of Interactions at Scale

A fundamental hurdle across all interpretability perspectives is that model behavior emerges from intricate interactions rather than isolated components. To achieve state-of-the-art performance, LLMs synthesize complex feature relationships, discover shared patterns across diverse training examples, and process information through highly interconnected internal circuits. As the number of features, training points, or model components grows, the number of potential interactions grows exponentially. Exhaustively analyzing all pairwise or higher-order interactions becomes computationally infeasible. Therefore, interpretability methods must be able to identify the most influential interactions efficiently.

Attribution through Ablation

A powerful technique for measuring influence is ablation—observing how a system's output changes when a component is removed or altered. This concept applies across all three interpretability lenses:

Feature attribution: Mask or remove specific segments of the input prompt and measure the resulting shift in predictions. For example, deleting a token and checking the probability change.
Data attribution: Train models on different subsets of the training set (e.g., leave-one-out) and assess how the removal of specific training data affects the model's output on a test point.
Model component attribution (mechanistic interpretability): Intervene on the model's forward pass by nullifying the output of specific internal components (neurons, layers, attention heads) and observing the impact on predictions.

In each case, the goal is to isolate the drivers of a decision by systematically perturbing the system. However, each ablation incurs significant computational cost—whether through expensive inference calls or full retrainings. The challenge is to compute attributions with the fewest possible ablations, while still capturing interactions.

The SPEX and ProxySPEX Framework

To discover influential interactions at scale with a tractable number of ablations, we developed SPEX and its faster variant, ProxySPEX. These algorithms are designed to efficiently identify critical interactions among features, training data, or model components by intelligently selecting which combinations to ablate.

How SPEX Works

SPEX leverages the idea that interactions are not uniformly distributed; a small subset of interactions dominates the model's behavior. By using a sparsity-inducing approach, SPEX searches for these key interactions without exhaustively testing all possibilities. It formulates the problem as a constrained optimization that balances the accuracy of attribution with the cost of ablation experiments.

ProxySPEX: Faster Approximations

ProxySPEX builds on the same principles but introduces a proxy model that approximates the effect of ablations. This proxy is much cheaper to evaluate than the full LLM, allowing the algorithm to explore many candidate interactions quickly. Once the most promising interactions are identified, a smaller number of expensive ablations on the actual model are performed for validation. This two-stage approach drastically reduces computational overhead while maintaining high accuracy.

Practical Implications

With SPEX and ProxySPEX, researchers can now analyze interactions in LLMs that were previously too large to study. For example, they can identify which combination of input tokens work together to produce a specific output, or which training examples jointly influence a model's bias. This capability is crucial for debugging, improving model robustness, and building more transparent AI systems.

Conclusion

Interpretability is not just about understanding individual components—it is about understanding the web of interactions that give rise to emergent behavior. Methods like SPEX and ProxySPEX represent a step toward scalable interaction analysis, enabling deeper insights into the workings of large language models. As AI continues to scale, such approaches will become indispensable for ensuring these systems remain safe, fair, and trustworthy.