Words in Motion

Extracting Interpretable Control Vectors for Motion Transformers

Ömer Şahin Taş FZI Research Center for Information Technology

Royden Wagner Karlsruhe Institute of Technology (KIT)

Paper (arXiv) Code OpenReview BibTeX Poster Video

TLDR: We measure neural collapse toward interpretable features in hidden states. By measuring the collapse towards natural language-aligned features, we identify well-separated directions that enable us to fit interpretable control vectors. We then optimize these control vectors using sparse autoencoders (SAEs). Finally, we use linearity measures to compare steered outputs against a Koopman autoencoder and SAEs with various layers and activation functions, including convolutional and MLPMixer layers.

*equal contribution

Introduction

Deep learning models that achieve higher accuracy typically rely on the increased complexity of their architectures and representations . This, in turn, renders them difficult to interpret in terms of semantically meaningful concepts. Interpretability methods aim to uncover how models process information, often by analyzing whether learned features align with semantically meaningful concepts Concept-based interpretability. Kim et al. propose explaining predictions with human-interpretable concepts, rather than relying on sample-based raw features . They first choose a set of examples that represent distinct concepts and measure the influence of high-level concepts on the model's decisions, thereby providing global explanations. In a recent text-to-image diffusion setting, Conceptor decomposes a concept into a weighted combination of interpretable elements..

The manifold hypothesis suggests that high-dimensional data often lies on a lower-dimensional manifold. Deep learning models learn representations that approximately capture the manifold's geometry by mapping data into a space where related inputs are closer together . Training objectives further shape this representation space: loss functions encourage the clustering of data samples in latent space . Together with regularizers that prevent overfitting, these clusters become more distinct over the course of training, a phenomenon known as neural collapse A recent line of work introduces the term neural collapse to describe a desirable learning behavior of deep neural networks for classification. It refers to the phenomenon that learned top-layer representations form semantic clusters, which collapse to their means at the end of training. In addition, the cluster means transform progressively into equidistant vectors when centered around the global mean. Therefore, neural collapse facilitates classification tasks and is considered a desirable learning behavior for both supervised and self-supervised learning . Neural collapse is not to be confused with representation collapse , where learned representations across all classes collapse to redundant or trivial solutions (e.g., zero vectors). . Therefore, we focus on the structure of the learned representations Structure of the latent space. show that consistent regularities naturally emerge from the training process of word embeddings. This phenomenon, commonly referred to as the word2vec hypothesis, suggests that learned embeddings capture both semantic and syntactic relationships between words through consistent vector offsets in latent space. While the observed linear offsets naturally fit a flat latent space, non-Euclidean geometric models (e.g., Riemannian manifolds) can better capture structural distortions . In those cases, "vector arithmetic" can be seen as an approximation to geodesic operations on a curved latent space. to predict model behavior, with a focus on transformer models for motion forecasting.

Following Ben-Shaul et al. , we use linear probes to measure neural collapse toward interpretable features in hidden states. High probing accuracy implies separability of features, which suggest functionally important directions in hidden states. Building on the insight that interpretable features are embedded in hidden states, we fit control vectors Also referred to as steering vectors , style vectors , or activation addition . Control vectors are used for a form of activation steering, where concept-based vectors are added to activations (i.e., hidden states) of transformer models. In natural language processing , control vectors allow targeted adjustments to model outputs by modifying hidden states without the need for fine-tuning or prompt engineering. Control vectors are a set of vectors that capture the difference of hidden states with opposing concepts or features . This approach requires a well-structured latent space, where samples are clustered according to classes or features (e.g., a high degree of neural collapse, see the neural collapse section). to the directions between hidden states with opposing features.

To further enhance this approach, we use sparse autoencoders (SAEs) A key goal of interpretability research is to decompose models and gain a mechanistic interpretation of how their components function. Sparse autoencoders (SAEs) leverage the linear representation hypothesis and approximate the model's activations with a linear combination of feature directions. By enforcing sparsity in latent space, they separate features into distinct, interpretable representations . Related autoencoders linearize learned representations either by manifold flattening or using Koopman operators . to extract more distinct features from hidden states . We evaluate sparse autoencoders with fully-connected, convolutional, and MLPMixer layers; and different activation functions. Our experiments with sparse autoencoders of varying sparse intermediate dimensions show that enforcing sparsity leads to more linear changes in prediction when scaling control vectors.

We apply our method to recent multimodal motion transformers . They process features of past motion sequences (i.e., past positions, orientation, acceleration, and speed) and environment context (i.e., map data and traffic light states), and transform them into future motion sequences. Like other transformer models, they rely on learned representations of these features, resulting in hidden states Hidden state activations. Transformers consist of attention blocks, followed by simple feed-forward networks, whose hidden state activations are analyzed for interpretability. Elhage et al. explore two key hypotheses that describe how these activations capture meaningful structures: the linear representation hypothesis and the superposition hypothesis . These hypotheses essentially state that neural networks represent features as directions in their activation space, and that representations can be decomposed into independent features. that are difficult to interpret and control. We focus on analyzing interpretable motion features that are physically measurable, such as speed, acceleration, direction, and agent type. By leveraging these features, our approach enables interpretable control over generated forecasts and facilitates zero-shot generalization.

Specifically, in this work:

We argue that, to fit control vectors, latent space regularities with separable features are necessary. We use linear probing and show that neural collapse toward interpretable features occurs in hidden states of recent motion transformers, indicating a structured latent space.
We fit control vectors using hidden states with opposing features. By modifying hidden states at inference, we show that control vectors describe functionally important directions. Similar to the vector arithmetic in word2vec, we obtain predictions consistent with the current driving environment.
We use sparse autoencoders to optimize our control vectors. Notably, enforcing sparsity leads to more linear changes in predictions when scaling control vectors. We use linearity measures to compare these results against a Koopman autoencoder and SAEs with various layers and activation functions, including convolutional and MLPMixer layers.

Our method differs from prior works in several aspects. We measure neural collapse in multimodal models for motion forecasting (i.e., regression) instead of unimodal image classifiers or language models . Unlike Conmy et al. , we do not manually suppress SAE features in control vectors. Furthermore, we do not use our SAEs during inference , but to optimize control vectors beforehand, resulting in negligible computational overhead.

Method

Motion feature classification using natural language

In contrast to natural language, where words naturally carry semantic meaning, motion lacks predefined labels. Therefore, we identify human-interpretable motion features by quantizing them into discrete subclasses as in natural language.

Initially, we classify motion direction using the cumulative sum of differences in yaw angles, assigning it to either left, straight, or right. Additionally, we introduce a stationary class for stationary objects, where direction lacks semantic significance. We define further classes for speed, dividing the speed values into four intervals: high, moderate, low, and backwards. Lastly, we analyze the change in acceleration by comparing the integral of speed over time to the projected displacement with initial speed. Accordingly, we classify acceleration profiles as either accelerating, decelerating, or constant. Our thresholds for motion features are based on insights from Ettinger et al. and Seff et al.. The threshold values are detailed in the appendix.

Diagram:

Figure 1: We measure the degree to which these interpretable features are embedded in the hidden states \mathbf{H}_{i,:} of transformer models with linear probes. Furthermore, we use our discrete features and sparse autoencoding to fit interpretable control vectors \mathbf{V}_{i,:} that allow for modifying motion forecasts at inference. The training of the sparse autoencoder is shown with red arrows ({\color{#f1615c}\rightarrow}) and the fitting of control vectors with blue arrows ({\color{#5eb8e7}\rightarrow}). Overview diagram for the motion transformer architecture used in our experiments, highlighting the flow from inputs through transformer modules to forecasted trajectories. We measure the degree to which these interpretable features are embedded in the hidden states \mathbf{H}_{i,:} of transformer models with linear probes. We measure the degree to which these interpretable features are embedded in the hidden states \mathbf{H}_{i,:} of transformer models with linear probes. Furthermore, we use our discrete features and sparse autoencoding to fit interpretable control vectors \mathbf{V}_{i,:} that allow for modifying motion forecasts at inference. We measure the degree to which these interpretable features are embedded in the hidden states \mathbf{H}_{i,:} of transformer models with linear probes. Furthermore, we use our discrete features and sparse autoencoding to fit interpretable control vectors \mathbf{V}_{i,:} that allow for modifying motion forecasts at inference. The training of the sparse autoencoder is shown with red arrows ({\color{#f1615c}\rightarrow}) We measure the degree to which these interpretable features are embedded in the hidden states \mathbf{H}_{i,:} of transformer models with linear probes. Furthermore, we use our discrete features and sparse autoencoding to fit interpretable control vectors \mathbf{V}_{i,:} that allow for modifying motion forecasts at inference. The training of the sparse autoencoder is shown with red arrows ({\color{#f1615c}\rightarrow}) and the fitting of control vectors with blue arrows ({\color{#5eb8e7}\rightarrow}).

Neural collapse as a metric of interpretability

We use neural collapse as a metric of interpretability. Specifically, we focus on interpreting hidden states (i.e., activations or latent representations) and evaluate whether hidden states embed interpretable features. We measure how close abstract hidden states are related to interpretable semantics using linear probing accuracy Ben-Shaul et al. show that linear probing accuracy is consistent with the accuracy of nearest class center classifiers, which are typically used to measure neural collapse .. We train linear probes (i.e., linear classifiers detached from the overall gradient computation) on top of hidden states (\mathbf{H}_{i,:} in Figure 1). During training, we track their accuracy in classifying our interpretable features on validation sets. Adapted to motion forecasting, we choose the aforementioned motion features as interpretable semantics.

Besides linear probing accuracy, following Chen & He , we use the mean of the standard deviation of the \ell_2-normalized embedding to measure representation collapse. Representation collapse refers to an undesirable learning behavior where learned embeddings collapse into redundant or trivial representations . Redundant representations have a standard deviation close to zero. In a way, representing the opposite of neural collapse. As shown in Chen & He , rich representations have a standard deviation close to 1/\sqrt{\text{dim}}, where dim is the hidden dimension.

Interpretable control vectors

We use our interpretable features to form pairs of opposing features. For each pair, we build a dataset and extract the corresponding hidden states. Next, we compute the element-wise difference between the hidden states of samples with these opposing features. Finally, following Zou et al. we apply principal component analysis (PCA) with a single component as a pooling method. This reduces the computed differences to a single scalar per hidden dimension to generate control vectors (\mathbf{V}_{i,:} in Figure 1).

We optimize our control vectors using SAEs. SAEs extract distinct features in hidden states by encoding and reconstructing them from sparse intermediate representations (\mathbf{S}_{i,:} in Figure 1). We hypothesize that sparse intermediate representations enable a more linear decomposition of our interpretable features, and hence, more distinct control vectors. Therefore, we generate intermediate control vectors \mathbf{V}'_{i,:} by pooling the differences between hidden states with opposing features (\mathbf{H}_{i,:}^\text{pos} vs. \mathbf{H}_{i,:}^\text{neg}). Specifically, we compute

\mathbf{S}_{i,:}^\text{pos} = \text{ReLU}\Big(\mathbf{W}_\text{enc}\big(\mathbf{H}_{i,:}^\text{pos} - \mathbf{b}_\text{dec}\big) + \mathbf{b}_\text{enc}\Big),

where \mathbf{W} and \mathbf{b} denote weights and biases of the SAE. Similarly, we compute \mathbf{S}_{i,:}^\text{neg} and obtain the intermediate control vectors as

\mathbf{V}'_{i,:} = \text{PCA}\big(\mathbf{S}_{i,:}^\text{pos} - \mathbf{S}_{i,:}^\text{neg}\big).

Leveraging the Johnson-Lindenstrauss LemmaJohnson & Lindenstrauss state that a set of points in high-dimensional space can be projected into a lower-dimensional space while approximately preserving the pairwise distances between points ., we use the SAE decoder to project the intermediate control vectors back to the hidden dimension of the motion encoder

\mathbf{V}_{i,:} = \mathbf{W}_\text{dec} \mathbf{V}'_{i,:} + \mathbf{b}_\text{dec}.

This enables using sparse autoencoders of arbitrary sparse intermediate dimensions for generating control vectors of fixed dimension. At inference, we scale the control vectors with a temperature parameter (\tau in Figure 1) to control the strength of the corresponding features of a given sample.

Experimental setup

Motion forecasting models

We study three recent motion transformers for self-driving: Wayformer , HPTR and RedMotion . All models process a maximum of 48 surrounding traffic agents as environment context. For the Argoverse 2 Forecasting (abbr. AV2F) dataset, we use past motion sequences with 50 time steps (representing 5 seconds) as input. For the Waymo Open Motion (abbr. Waymo) dataset, we use past motion sequences with 11 steps (representing 1.1 seconds) as input. Further details on model architectures and fusion mechanisms are presented in the appendix.

Linear probes

We add linear probes for our quantized and interpretable motion features (see the motion feature classification section) to hidden state of all models (\mathbf{H}_{i,:}^{(m)} in Figure 1, where m \in \{0, 1, 2\} is the module number and i is the temporal index). These classifiers are learned during training using regular cross-entropy loss to classify speed, acceleration, direction, and the agent classes from hidden states. We decouple this objective from the overall gradient computation. Therefore, these classifiers do not contribute to the alignment of hidden states, but exclusively measure neural collapse toward interpretable features.

Control vectors

Using our interpretable motion features, we build pairs of opposing features. Specifically, we generate speed control vectors representing the direction from low to high speed, acceleration control vectors representing the direction from decelerating to accelerating, and direction control vectors representing the direction from turning left to turning right, and agent control vectors representing the direction from pedestrian to vehicle. For each pair, we use the hidden states \mathbf{H}_{i,:}^{(m)} from module m=2 and the last embedding per motion sequence (i=-1), as it is closest to the start of the prediction.

Sparse autoencoders

We train SAEs as an auxiliary model with sparse intermediate dimensions of 512, 256, 128, 64, 32, and 16. The total loss combines \ell_2 reconstruction loss with an \ell_1 sparsity penalty: \ell_2 ensures accurate reconstruction, while \ell_1 promotes sparsity by minimizing small, noise-like activations. The \ell_1 must be carefully scaled to avoid deadening important features . We scale it by 3e-4. We optimize the models over 10000 epochs using the Adam optimizer and a batch size of 528. The final loss values are provided in the appendix.

Results

Extracting interpretable features for motion

Our approach relies on a well-structured latent space, where samples are clustered with respect to interpretable features. First, we ensure that our features are not highly correlated, as confirmed by the Spearman feature correlation analysis in the appendix. Next, we report linear probing accuracy for interpretable features during training on the AV2F and Waymo datasets.

The figures below show the linear probing accuracies for our interpretable motion features for the AV2F dataset. The scores are computed on the validation split over the course of training. All models achieve similar accuracy scores, while the Wayformer model achieves slightly higher scores for classifying acceleration and lower scores for agent classes. Overall, we measure high linear probing accuracy for all interpretable features. This shows that all models likely exhibit neural collapse toward interpretable features.

Figure 2: Linear probing accuracies for RedMotion, Wayformer, and HPTR on the validation split of the AV2F dataset.

The representation quality metric normalized standard deviation of embeddings is shown in the figure above. Both HPTR and RedMotion learn to generate embeddings with a normalized standard deviation close to the desired value of 1 / \sqrt{\text{dim}}, where \text{dim} is the hidden dimension. The scores for Wayformer are lower, which reflects differences between attention-based and MLP-based motion encoders.

Figure 3: Normalized standard deviation representation quality metric for RedMotion, Wayformer, and HPTR.

The figure below shows the linear probing accuracies for our interpretable features on the Waymo dataset. Here, we report the scores for each of the three hidden states \mathbf{H}_i in the RedMotion model (i.e., after each module m in the motion encoder, see Figure 1). Similar accuracy scores are reached for all features at all three hidden states. The accuracies for the speed and acceleration classes continuously improve, while those for direction classes reach 0.80 early on. Compared to the direction scores on the AV2F dataset, the scores on the Waymo dataset "jump" earlier. We hypothesize that this is linked to the shorter input motion sequence on Waymo (1.1 seconds vs. 5 seconds), which limits the amount of possible movements. In contrast to the AV2F dataset, higher accuracies are achieved for classifying speed. Overall, the highest scores are reached for classifying agent types, as on the AV2F dataset.

Figure 4: Linear probing accuracies at module 0, module 1 and module 2 for classifying speed, acceleration, direction, and agent type on the validation split of the Waymo dataset.

In addition to linear probing, we measure neural collapse using class-distance normalized variance (CDNV) , see the appendix. On the Waymo dataset, the within-class variance values and the mean distance norm for RedMotion are 10.68 and 11.24, respectively, resulting in a CDNV of 0.95. On the AV2F dataset, these values are 5.73 and 2.32, yielding a CDNV of 2.46. We hypothesize that the higher CDNV value on AV2F is caused by the longer past motion sequence (i.e., 5 seconds vs. 1.1 seconds on Waymo), allowing for a greater range of potential movements.

Modifying hidden states of motion transformers at inference

Building on the insight that hidden states are likely collapsed toward our interpretable features, we fit control vectors using opposing features. These control vectors allow for modifying motion forecasts at inference. Specifically, we build pairs of opposing features for the AV2F and the Waymo dataset. Then, we fit sets of control vectors (\mathbf{V}_i in Figure 1) as described in the control vectors section. At inference, we add the control vectors generated for the last temporal index (i = -1) to all embeddings (i \in \{0, \ldots, 49\} for AV2F, i \in \{0, \ldots, 10\} for Waymo).

Qualitative results

Figure 5 shows a qualitative example from the AV2F dataset, where we modify hidden states using our control vector for acceleration scaled with different temperatures \tau. Subfigure a shows the default (i.e., non-controlled) top-1 (i.e., most likely) motion forecast. In subfigures b and c, we apply our acceleration control vector with \tau=-20 and \tau=100 to enforce a strong deceleration and a moderate acceleration, respectively.

Enforced strong deceleration — (a) Default motion forecast

Figure 6 shows the focal frames for our acceleration control vector across all \tau values from -100 to 100. Use the slider to sweep the full range; \tau=0 is the default forecast, negative values decelerate, and positive values accelerate.

Default motion forecast tau = 0 — **Figure 6:** **Modifying hidden states across the full \tau sweep.** The slider steps through every focal frame from \tau=-100 to \tau=100; negative values decelerate, while positive values accelerate while respecting the scene context.

In the appendix, we include an example of our direction control vector. Overall, these qualitative results support the finding that the hidden states of motion sequences are arranged with respect to our discrete sets of motion features.

Similarity-based comparison of control vectors

In this section, we evaluate how control vectors obtained using SAEs differ from those derived via plain PCA. For comparison, we train SAEs with varying sparse intermediate dimensions: 512, 256, 128, 64, 32, and 16. For each control vector, we calculate its pairwise angles with the control vectors for controlling other features. Table 1 presents the angular distances between control vectors of speed, acceleration, direction, and agent type generated with plain PCA and our SAE with a sparse intermediate dimension 128. As expected, the similarity between speed and acceleration, speed and agent type, and acceleration and agent type is notably high, while the similarity involving direction and other vectors is significantly lower. This result aligns with expectations, as positive speed and acceleration controls lead to faster motion, and our agent type control vector represents transition between agent types from pedestrian to vehicle, which is associated with faster motion, as well. Angular-distance results for the remaining SAE dimensions are in the appendix. The similarity of the control vectors generated using the SAE with an intermediate dimension of 128 is the highest.

Plain PCA & Plain PCA (°)	speed	acceleration	direction	agent
speed	0.0	11.5	122.6	10.9
acceleration		0.0	126.8	6.8
direction			0.0	128.7
agent				0.0

SAE & SAE (°)	speed	acceleration	direction	agent
speed	0.0	9.5	120.6	7.8
acceleration		0.0	122.9	7.0
direction			0.0	125.8
agent				0.0

Table 1: Comparison of control vectors, with angles measured in degrees.

Quantitative evaluation of SAEs for optimizing control vectors

We empirically analyze the temporal-causal relationship between modifications on hidden states of past motion and motion forecasts. Specifically, we measure the linearity of relative speed changes in forecasts when scaling our speed control vectors. We use the Pearson correlation coefficient, the coefficient of determination (R²), and the straightness index (S-idx) as linearity measures. Given the large range of scenarios in the Waymo dataset, we focus on relative speed changes within a range of \pm 50\% (see the appendix). Higher linearity implies improved controllability.

We compute linearity measures for control vectors optimized using regular SAEs with varying sparse intermediate dimensions. We achieve the highest scores using the SAE with a dimension of 128 (see Table 2). Therefore, we use this dimension in the rest of our evaluations.

Autoencoder	Pearson	R²	S-idx
SAE-512	0.990	0.974	0.984
SAE-256	0.990	0.974	0.985
SAE-128	0.993	0.984	0.988
SAE-64	0.991	0.976	0.985
SAE-32	0.990	0.959	0.985
SAE-16	0.982	0.770	0.958

Table 2: Scaling sparse autoencoders.

In the following, we evaluate autoencoders with different activation functions and layer types. Following , we use JumpReLU with a threshold \theta = 0.001 and regular ReLU activation functions. Moreover, we evaluate regular SAEs with fully-connected layers, with MLPMixer layers (Sparse MLPMixer), and with convolutional layers (ConvSAE). For Sparse MLPMixer and ConvSAE, we use large patch and kernel sizes to approximate the global receptive fields of fully-connected hidden units in regular SAEs. Furthermore, we evaluate a consistent Koopman autoencoder (KoopmanAE) to include a method that models temporal dynamics between embeddings (see the appendix).

Table 3 presents linearity measures for different control vectors derived from both plain PCA pooling and SAE methods. Overall, the regular SAEs achieve the highest Pearson and R² scores. JumpReLU activation functions improve the R² scores marginally compared to ReLU activation functions. The SAE version of does not improve the linearity scores. We hypothesize that this is due to reduced decoding flexibility since they transpose the encoder weights instead of learning the decoder weights (i.e., \mathbf{W}_\text{dec} = \mathbf{W}_\text{enc}^\top).

Autoencoder	Activation	Pooling	Patch/kernel size	Pearson	R²	S-idx
--	--	PCA	--	0.988	0.969	0.981
SAE (Bricken 2023)	ReLU	PCA	--	0.993	0.984	0.988
SAE (Rajamanoharan 2024)	JumpReLU	PCA	--	0.993	0.986	0.988
SAE (Cunningham 2023)	ReLU	PCA	--	0.987	0.971	0.980
Sparse MLPMixer	ReLU	PCA	64	0.992	0.980	0.986
Sparse MLPMixer	JumpReLU	PCA	64	0.992	0.981	0.986
Sparse MLPMixer	ReLU	PCA	32	0.990	0.978	0.985
Sparse MLPMixer	JumpReLU	PCA	32	0.991	0.980	0.986
ConvSAE	ReLU	PCA	64	0.986	0.383	0.991
ConvSAE	JumpReLU	PCA	64	0.987	0.861	0.978
ConvSAE	ReLU	PCA	32	0.988	0.622	0.986
ConvSAE	JumpReLU	PCA	32	0.989	0.623	0.986
KoopmanAE (Azencot 2020)	tanh	PCA	--	0.991	-0.057	1.000

Table 3: Linearity measures for optimized control vectors: Pearson correlation coefficient, coefficient of determination (R²), and straightness index (S-idx).

The ConvSAE with a kernel size k = 64 and the KoopmanAE achieve the highest straightness index, yet the lowest R² scores. As shown in Figure 7 and in the appendix, the range of temperatures \tau is much higher for this ConvSAE and significantly lower for the KoopmanAE than for e.g. the regular SAE. This lowers the R² score but does not affect the straightness index. For the ConvSAE, we hypothesize that this is due to strong activation shrinkage . Therefore, the JumpReLU configuration of this SAE-type leads to a significantly smaller \tau range (see the appendix), which in turn leads to higher R² scores (see Table 3). For the KoopmanAE, the opposite is likely, since activation shrinkage is caused by sparsity terms, which are not included in the loss function of . Notably, activation steering with our SAE-based control vector has an almost 1-to-1 ratio between \tau and relative speed changes (i.e., \tau = -50 corresponds to roughly -50\%). This improves R² scores and enables an intuitive interface. Furthermore, improved controllability with SAEs indicates that sparse intermediate representations capture more distinct features.

Figure 7: Calibration curves of plain PCA-based speed control vectors and control vectors optimized using SAEs for relative speed changes in forecasts of \pm 50\%.

Relation of probing accuracy to linearity measures for control vectors

We train a RedMotion model on the AV2F dataset using the same trajectory lengths as in the Waymo dataset (1.1 s past and 8 s future), while leaving all other hyperparameters as described in the appendix. Table 4 shows the probing accuracy and linearity measures of a speed control vector for this model (see the appendix for the calibration curve). Compared with a model trained on the Waymo dataset, the AV2F model achieves both a lower probing accuracy and significantly lower linearity measures. These results support our argument that latent space regularities with separable features are necessary to fit precise control vectors.

Dataset	Probing accuracy	Pearson	R²	S-idx
AV2F	0.753	0.877	0.275	0.891
Waymo	0.945	0.988	0.969	0.981

Table 4: Higher probing accuracy enables higher linearity measures. We train RedMotion models on the Waymo and AV2F datasets using the same trajectory lengths. We report the probing accuracies for speed classes and the linearity measures for the corresponding PCA-based control vectors.

In the appendix, we present an ablation study analyzing our method's sensitivity to hidden states from different modules (see the appendix) and to varying speed thresholds (see the appendix). Our method performs best with a sparse intermediate dimension of 128 and hidden states from module m=2; and is more sensitive to low than to high speed thresholds.

Zero-shot generalization with control vectors

Domain shifts between training and test data significantly degrade the performance of many learning algorithms. Zero-shot generalization methods compensate for such domain shifts without further training or fine-tuning . In motion forecasting, common domain shifts are more or less aggressive driving styles that result in higher or lower future speeds, respectively. We simulate this domain shift by reducing the future speed in the Waymo validation split by approximately 50%. Specifically, we take the first half of future waypoints and linearly upsample this sequence to the original length.

Table 5 shows the results of a RedMotion model trained on the regular training split on this validation split with domain shift. We provide an overview of the used motion forecasting metrics in the appendix. Without the use of our control vectors, high distance-based errors, miss, and overlap rates are obtained. Using the calibration curve in Figure 7, we compensate for this domain shift by applying our SAE-128 control vector with a temperature \tau = -50. This significantly reduces the distance-based errors, the overlap, and the miss rates without further training. In addition, we show the results of applying our control vector with a temperature of \tau = -30 and \tau = -70, which improves all scores over the baseline as well.

Control vector	Temperature \tau	minADE \downarrow	Brier minADE \downarrow	minFDE \downarrow	Brier minFDE \downarrow	Overlap rate \downarrow	Miss rate \downarrow
None		3.271	6.547	4.617	8.933	0.220	0.580
SAE-128	-30	1.685	4.838	2.870	8.429	0.179	0.224
SAE-128	-50	1.174	2.759	1.798	4.329	0.174	0.236
SAE-128	-70	1.808	3.576	2.035	3.676	0.189	0.302

Table 5: Zero-shot generalization to a Waymo dataset version with reduced future speeds. Best scores are bold, second best are underlined.

Conclusion

In this work, we take a step toward mechanistic interpretability and controllability of motion transformers. We analyze "words in motion" by examining the representations associated with quantized motion features. Specifically, we show that neural collapse toward interpretable classes of features occurs in recent motion transformers. The high degree of neural collapse indicates a well-separated latent space, that enables to fit precise control vectors to opposing features and modify predictions at inference. We further refine this approach by optimizing our control vectors using sparse autoencoding, resulting in higher linearity. Finally, we compensate for domain shift and enable zero-shot generalization to unseen dataset characteristics. Our findings highlight the effectiveness of sparse dictionary learning and the use of SAEs for improving interpretability.

We assumed a flat latent space and relied on vector arithmetic. We leave a detailed investigation of how SAE parameterization might help address potential latent space curvature for future work. Furthermore, we have empirically shown a connection between neural collapse and the structure of the latent space for using control vectors, although our analysis remained limited to probing accuracy and class-distance normalized variance. Our findings enable new applications in robotics and self-driving. We identify safety validation in latent space as a promising direction, particularly for end-to-end driving. Using control vectors to modify internal representations and adjust trajectories via instruction-based inputs is also a valuable application. Finally, future work can explore the use of other embedding methods (e.g., ), as well as incorporate features from other modalities by capturing both static and dynamic scene elements.

Appendix

Acknowledgments

The research leading to these results is funded by the German Federal Ministry for Economic Affairs and Climate Action within the project "NXT GEN AI METHODS". The authors thank anonymous reviewers of ICLR 2025 for their valuable feedback and insightful suggestions.

Title origin

The title of our work “words in motion” is inspired by our quantization method using natural language and by a common notion in the computer architecture. In computer architecture, a word is a basic unit of data for a processing unit (e.g., CPU or GPU). In our work, words are classes of motion features that are embedded in the hidden states of motion sequences processed by motion transformers.

Natural language as an interface for model interaction

Linking learned representations to natural language and using it as an interface for model interaction has gained significant attention (e.g., ). Broadly, approaches incorporating language in models can be categorized into four types. We present these approaches along with applications in robotics and self-driving below.

Conditioning. Numerous works use natural language to condition generative models in diverse tasks such as image synthesis , video generation , and 3D modeling . generate dynamic traffic scenes based on user-specified descriptions expressed in natural language.

Prompting. Some works use language as an interface to interact with models, enabling users to request assistance or information. This includes obtaining explanations of underlying reasoning and human-centric descriptions of model behavior . generate linguistic descriptions of predicted trajectories during decoding, capturing essential information about future maneuvers and interactions. More recent works employ large language models (LLMs) to analyze driving environments in a human-like manner, providing explanations of driving actions and the underlying reasoning . This offers a human-centric description of the driving environment and the model's decision-making capabilities.

Enriching. Another line of work leverages LLMs' generalization abilities to enrich context embeddings, providing additional information for better prediction and planning . integrate the enriched context information of LLMs into motion forecasting models. use LLMs for data augmentation to improve out-of-distribution generalization. Others use pre-trained LLMs for better generalization during decision-making .

Instructing. Natural language can be used to issue explicit commands for specific tasks, distinct from conditioning . The main challenge is connecting the abstractions and generality of language with environment-grounded actions . enable robotic control through language-based instruction. incorporate web knowledge, enriching vision-language-action models for more generalized task performance. demonstrate the use of instructions to guide task execution in self-driving, with experiments in simulation environments.

Although these works align learned text representations with embeddings of other modalities, in contrast to our work, they do not measure the functional importance of features. To our knowledge, no prior work has explored the mechanistic interpretability of transformers in robotics applications.

Parameters for classifying motion features

We classify motion trajectories with a sum less than 15° as straight. When the cumulative angle exceeds this threshold, a positive value indicates right direction, while a negative value — exceeding the threshold in absolute terms — indicates a left direction. We classify speeds between 25 km/h and 50 km/h as moderate, speeds above this range as high, those below as low, and negative speeds as backwards. For acceleration, we classify trajectories as decelerating, if the integral of speed over time to projected displacement with initial speed is less than 0.9 times. If this ratio is greater than 1.1 times, we classify them as accelerating. For all other values, we classify the trajectories as having constant speed. We determine all threshold values by analyzing the distribution of the dataset.

Figure 8: Distributions of our motion features for the Argoverse 2 Forecasting (abbr. AV2F) and the Waymo Open Motion (abbr. Waymo) datasets. All numbers are percentages.

The figure above presents the distribution of motion subclasses across the datasets. Both datasets predominantly capture low-speed scenarios, with 62% of Waymo instances and 53% of AV2F instances falling into this category. Furthermore, a notable difference lies in the proportion of stationary vehicles, with AV2F exhibiting a significantly higher percentage (51%) compared to Waymo (28%). The Waymo dataset predominantly features vehicles with constant acceleration (65%) and traveling straight (49%), while the AV2F dataset has a higher proportion of accelerating instances (52%). Additionally, AV2F has a much larger proportion of instances involving backward motion (24%) compared to Waymo (4%). This disparity in motion characteristics highlights that the two datasets capture different driving environments and scenarios, with Waymo potentially focusing on highway or structured urban driving, while AV2F contains more diverse traffic situations.

Motion transformers

Wayformer and RedMotion models employ attention-based scene encoders to learn agent-centric embeddings of past motion, map, and traffic light data. To efficiently process long sequences, Wayformer uses latent query attention for subsampling, RedMotion lowers memory requirements via local-attention and redundancy reduction. HPTR models learn pairwise-relative environment representations via kNN-based attention mechanisms. For Wayformer, we use the implementation by Zhang et al. and the early fusion configuration. Therefore, we analyze the hidden states generated by an MLP-based input projector for motion data, which consists of three layers. For RedMotion and HPTR, we use the publicly available implementations. We configure RedMotion with a late fusion encoder for motion data, and HPTR using a custom hierarchical fusion setup with a modality-specific encoder for past motion with a shared encoder for environment context.

We provide Wayformer and HPTR models with the nearest 512 map polylines, and RedMotion model with the nearest 128 map polylines. For Wayformer and RedMotion, we use the unweighted sum of the negative log-likelihood loss for positions modeled as mixture of Gaussians and cross-entropy for confidences as motion forecasting loss. For HPTR, we additionally use the cosine loss for the heading angle and the Huber loss for velocities. We use AdamW in its default configuration as optimizer and set the initial learning rate to 2e-4. All models have a hidden dimension of 128 and are configured to forecast k = 6 trajectories per agent. As post-processing, we follow Konev et al. and reduce the predicted confidences of redundant forecasts.

Meta-Architecture of multimodal motion transformers

Figure 9: Motion transformer meta architecture of RedMotion, Wayformer, and HPTR.

We study multimodal motion transformers , which process motion, lane and traffic light data. The meta-architecture of these models is shown in the figure above. These models generate motion M_i, map K_j, and traffic light T_k embeddings using MLPs. Modality-specific encoders aggregate information from multiple embeddings with attention mechanisms (e.g., across multiple past timesteps for motion embeddings). Afterwards, in the motion decoder, learned motion queries Q (i.e., a form of learned anchors) cross-attend to M, K, and T. Finally, an MLP projects the last hidden state of Q into multiple motion forecasts, which are represented as 2D Gaussians for future positions in bird's-eye-view, along with their associated confidences. The differences between the models lie in the type of attention and fusion mechanisms they employ, as well as the used reference frames.

Early, hierarchical and late fusion in motion encoders

Fusion types for motion transformers are defined based on the information they process in the first attention layers. In early fusion, the first attention layers process motion data of the modeled agent, other agents, and environment context. In hierarchical fusion, they process motion data of the modeled agent, and other agents. In late fusion, they exclusively process motion data of the modeled agent.

Feature correlation

Diagram:

Figure 10: Heatmap representing Spearman correlation between feature cluster means for the Waymo Open Motion dataset. The values in the matrix indicate pairwise distances between clusters, normalized by the largest distance. Heatmap representing Spearman correlation between feature cluster means for the Argoverse 2 Forecasting dataset. The values in the matrix indicate pairwise distances between clusters, normalized by the largest distance.

Explained variance

Diagram:

Figure 11: Explained variance for Plain-PCA and SAEs across hidden latent dimensions 512, 256, 128, 64, 32, 16.

Qualitative results

Figure 12 shows a qualitative example from the AV2F dataset, where we modify hidden states using our control vector for acceleration scaled with different temperatures \tau. Subfigure a shows the default (i.e., non-controlled) top-1 (i.e., most likely) motion forecast. In subfigures b and c, we apply our acceleration control vector with \tau=-20 and \tau=100 to enforce a strong deceleration and a moderate acceleration, respectively.

Figure 13 shows focal frames for our acceleration control vector applied to all agents simultaneously across \tau values from -100 to 100. Use the slider to sweep the full range; \tau=0 is the default forecast, negative values decelerate, and positive values accelerate while preserving scene consistency.

Default multi-agent motion forecast tau = 0 — **Figure 13:** **Control vectors applied to all agents.** The slider steps through frames from \tau=-100 to \tau=100 to show consistent multi-agent behavior under acceleration and deceleration steering.

Comparison of control vectors using plain PCA and SAE across various sparse intermediate dimensions

Comparison of control vectors obtained with and without SAEs across sparse intermediate dimensions (512, 256, 128, 64, 32, 16). The tables on the left represent the pairwise angular distances between control vectors of the same SAE model, whereas those on the right represent the angular distances between control vectors of SAE and those derived from plain PCA. The control vector with a sparse intermediate dimension of 128 achieves the highest overall similarity.

SAE dimension:

SAE-512 & SAE-512 (°)	speed	acceleration	direction	agent
speed	0.0	10.2	121.8	7.6
acceleration		0.0	123.7	7.6
direction			0.0	126.9
agent				0.0

Plain PCA & SAE-512 (°)	speed	acceleration	direction	agent
speed	20.7	28.6	123.8	23.4
acceleration	19.1	23.0	128.5	18.6
direction	115.9	116.6	13.7	120.8
agent	19.4	24.4	130.2	18.3

SAE-256 & SAE-256 (°)	speed	acceleration	direction	agent
speed	0.0	9.9	120.9	7.9
acceleration		0.0	123.7	7.2
direction			0.0	126.3
agent				0.0

Plain PCA & SAE-256 (°)	speed	acceleration	direction	agent
speed	21.5	26.8	123.8	23.3
acceleration	20.3	21.0	128.7	18.7
direction	114.7	116.9	13.7	120.1
agent	20.8	23.1	130.2	18.7

SAE-128 & SAE-128 (°)	speed	acceleration	direction	agent
speed	0.0	9.5	120.6	7.8
acceleration		0.0	122.9	7.0
direction			0.0	125.8
agent				0.0

Plain PCA & SAE-128 (°)	speed	acceleration	direction	agent
speed	19.7	25.3	124.3	21.6
acceleration	19.2	20.0	128.8	17.5
direction	115.2	117.1	12.1	120.5
agent	19.5	21.8	130.4	17.1

SAE-64 & SAE-64 (°)	speed	acceleration	direction	agent
speed	0.0	9.7	121.0	8.0
acceleration		0.0	123.2	7.5
direction			0.0	126.3
agent				0.0

Plain PCA & SAE-64 (°)	speed	acceleration	direction	agent
speed	18.1	23.7	124.7	19.3
acceleration	19.3	19.9	128.9	16.5
direction	115.0	116.6	13.3	120.5
agent	19.8	21.9	130.5	16.4

SAE-32 & SAE-32 (°)	speed	acceleration	direction	agent
speed	0.0	9.8	120.3	8.3
acceleration		0.0	122.8	7.0
direction			0.0	125.8
agent				0.0

Plain PCA & SAE-32 (°)	speed	acceleration	direction	agent
speed	14.7	18.8	126.4	15.5
acceleration	18.0	15.5	130.3	14.1
direction	114.4	116.9	10.9	120.2
agent	18.1	17.6	132.0	13.4

SAE-16 & SAE-16 (°)	speed	acceleration	direction	agent
speed	0.0	9.5	124.1	9.3
acceleration		0.0	125.2	7.5
direction			0.0	129.3
agent				0.0

Plain PCA & SAE-16 (°)	speed	acceleration	direction	agent
speed	23.5	25.1	126.6	21.8
acceleration	28.4	26.0	128.9	23.5
direction	110.2	111.9	24.6	116.6
agent	28.0	26.8	131.0	22.5

Table 6: Angular distances for SAE control vectors compared to plain PCA across sparse intermediate dimensions.

Loss metrics for SAEs

We report the results for the epoch with the lowest \text{total loss} = \ell_2\text{-loss} + 3 \cdot 10^{-4} \times \ell_1\text{-loss}. The \ell_2 reconstruction loss is computed as the average of all partial losses for all embeddings, while the \ell_1 sparsity loss is computed as the sum of all partial losses.

Dim	Best epoch	Total loss	\ell_2 reconstruction loss	\ell_1 sparsity loss
512	9805	4.01	1.52	8270.70
256	9845	3.72	1.38	7823.98
128	9820	4.14	1.56	8608.95
64	9348	4.56	1.89	8894.97
32	9864	7.14	3.90	10795.54
16	9956	17.44	13.37	13576.57

Table 7: Loss metrics for SAEs across sparse intermediate dimensions, trained for 10,000 epochs.

Inference latency

Inference latency measurements of a RedMotion model on the Waymo Open Motion dataset with and without activation steering using our control vectors. Activation steering adds only about 1 ms to the total inference latency. Since most datasets are recorded at 10 Hz, it is common to define the threshold for real-time capability of self-driving stacks as ≤ 100 ms. Considering the inference latency of recent 3D perception models (e.g., approximately 40 ms for ), which must be called before motion forecasting, activation steering should not add significantly to the forecasting latency.

Activation steering	Focal agents	Inference latency
False	8	50.21 ms
True	8	51.08 ms

Table 8: Inference latency without and with activation steering using control vectors.

Control vectors across modules in sparse autoencoders

Control vectors for speed generated in earlier modules achieve lower linearity scores for activation steering. Linearity measures for controlling include the Pearson correlation coefficient, coefficient of determination (R²), and straightness index.

Autoencoder	Module m	Pearson	R²	S-idx
SAE-128	2	0.993	0.984	0.988
SAE-128	1	0.992	0.980	0.987
SAE-128	0	0.959	0.654	0.933

Table 9: Generating control vectors for hidden states of different modules.

Sensitivity analysis for various speed thresholds

Generating speed control vectors with different thresholds for low and high speed. Decreasing the threshold for high speed marginally improves linearity scores, while increasing the threshold for low speed significantly worsens the linearity scores.

Autoencoder	Low speed	High speed	Pearson	R²	S-idx
SAE-128	< 25 km/h	> 50 km/h	0.993	0.984	0.988
SAE-128	< 25 km/h	25 to 50 km/h	0.994	0.987	0.989
SAE-128	25 to 50 km/h	> 50 km/h	0.355	-0.734	0.533

Table 10: Linearity scores for speed control vectors with different threshold choices.

Plain PCA-based speed control vector for the AV2F dataset

Figure 14 shows the calibration curve for a plain PCA-based speed control vector for a RedMotion model trained on the AV2F dataset (cf. the linearity section). To suppress the effects of different trajectory lengths, we trained this model on an AV2F configuration with the same trajectory lengths as in the Waymo dataset. Unlike the control vectors for Waymo, this control vector cannot reduce the speed by more than 3%, so we center the range of \tau values instead of the range of relative speed changes.

Figure 14: Calibration curve for a plain PCA-based speed control vector for the AV2F dataset. In contrast to the control vectors for the Waymo dataset, this control vector cannot reduce the speed by more than 3%.

We used the low and moderate speed classes to fit this control vector, as the low and high speed classes did not yield good results. We hypothesize that this is due to the different distributions of the datasets shown in Figure 8.

JumpReLU compensates activation shrinkage in ConvSAEs

Figure 15 compares calibration curves for ConvSAE k = 64 with ReLU and JumpReLU activations. JumpReLU compensates activation shrinkage, leading to a smaller range of \tau values for the same relative speed changes.

Figure 15: JumpReLU compensates activation shrinkage. JumpReLU yields a smaller \tau range than the ConvSAE (ReLU) for the same relative speed changes, resulting in higher R² scores (cf. Table 3).

The range of temperatures is much higher for the ConvSAE than for the JumpReLU version of this sparse autoencoder (ConvSAE k = 64 JumpReLU). We hypothesize that this is due to activation shrinkage . Therefore, the JumpReLU configuration of this SAE-type leads to a significantly smaller \tau range, which in turn leads to higher R² scores (see Table 3).

Choosing a range of relative changes in future speed

Given the large range of scenarios in the Waymo dataset, we focus on relative speed changes within a range of +/- 50% to capture the most relevant speed variations. Considering the approximated mean and standard deviation for each agent type (vehicles: mean ≈ 12 m/s, std. dev. ≈ 5 m/s; pedestrians: mean ≈ 1.5 m/s, std. dev. ≈ 0.7 m/s; cyclists: mean ≈ 7 m/s, std. dev. ≈ 3 m/s), the +/- 50% range corresponds to speeds within approximately one standard deviation of the mean for each agent type.

Evaluating a Koopman autoencoder

A consistent Koopman autoencoder (KoopmanAE) is a bidirectional method that models temporal dynamics between embeddings. The learned latent space approximates a Koopman-invariant space where dynamics evolve linearly. Adapted to the SAE configurations, we train an encoder and a decoder with one layer each and a latent dimension of 128. We use learned linear projections to decode Koopman operator approximations C, D \in \mathbb{R}^{128 \times 128} from intermediate representations. For the first 10 time steps, we encode the embedding and predict the next embedding using C, while for the last 10 time steps, we encode the embedding and predict the previous embedding using D.

Afterwards, we use the KoopmanAE instead of an SAE to fit control vectors (see the control vectors section). Figure 16 shows the calibration curve for the resulting speed control vector. The range of \tau values is approximately 100x smaller than for the SAE-based control vector shown in Figure 7.

Figure 16: Calibration curve for a speed control vector optimized using the KoopmanAE. The range of \tau values is significantly lower than for plain PCA and SAE-based control vectors (cf. Figure 5), yielding lower R² scores as shown in Table 3.

Motion forecasting metrics

Following , we use the average displacement error (minADE), the final displacement error (minFDE), and their respective Brier variants, which account for the predicted confidences. Furthermore, we compute the miss rate and overlap rate to evaluate motion forecasts. All metrics are computed using the minimum mode, i.e., measured on the trajectory closest to the ground truth.

Class-distance normalized variance

Neural collapse metrics capture structural patterns in feature representations, focusing on clustering, geometry, and alignment. Class-distance normalized variance (CDNV), also referred to as “\mathcal{NC}1”, quantifies the degree to which features form class-wise clusters by measuring the variance within feature clusters of each class c relative to the distances between class means. CDNV provides a robust alternative to methods that compare between- and within-cluster variation for assessing feature separability .

\mathcal{NC}1^\text{CDNV}_{c, c'} = \frac{\sigma_{c}^2 + \sigma_{c'}^2}{2 \lVert \mu_c - \mu_{c'}\rVert^{2}_2} , \quad \forall c \neq c'