Hopp til hovedinnhold
tenki
← Tilbake til forskning

Whitepaper· v0.3

Localizing Dialect Representation in Open Norwegian-Capable LLMs

A mechanistic-interpretability look at where Qwen 2.5 commits to dialect, language, and lexical content in its residual stream

Forfattere

  • Andreas Grønbeck · Tenki

Publisert 10. mai 2026

Sammendrag

Norwegian has two co-official written standards, Bokmål (BM) and Nynorsk (NN). Modern open instruction-tuned LLMs systematically under-perform on NN at the output level (the BNCR paper showed a 9–12 pp gap on NorEval commonsense reasoning across Qwen 2.5 sizes). We ask the upstream question: where does the BM/NN distinction live INSIDE such a model, and is it the same internal machinery that encodes English-vs-Norwegian? Five probes — layer-wise cosine similarity, linear CKA, logit-lens (top-1 / top-10 Jaccard / Jensen-Shannon), linear probes for contrast identity, attention-head ablation, and last-token activation patching — applied to off-the-shelf Qwen 2.5 1.5B Instruct on three contrast sets: D1 (BM↔NN paraphrase via Apertium), D2 (NB↔EN translation via FLORES-200), D3 (BM↔BM paraphrase via Gemma 3 4B as a same-language control). Headline finding: the BM/NN dialect signal is linearly detectable from every transformer layer (~0.80 5-fold CV probe accuracy) despite cosine similarity ~0.98 between paired residuals. Activation patching reveals a clean consolidation hierarchy in the last-token residual: foreign-language identity is committed at ~30% through the stack, lexical paraphrase content at ~67%, dialect identity at ~80%. The hierarchy reproduces at Qwen 2.5 3B (36 layers) at the same fractional positions, suggesting it is not a model-size artifact within this family. Total compute: 1× RTX 3060 Ti, ~30 minutes wall-clock. Code, data, figures public.

Funn

Hovedfunn

  • The BM/NN dialect signal sits in every layer of Qwen 2.5 1.5B at ~0.80 linear-probe accuracy, despite being invisible to direct cosine similarity (residuals are ~0.98 similar between paired BM/NN inputs).
  • No single attention head carries the dialect signal; the largest single-head ablation drops the probe by 6.7 pp on a 0.62 baseline. The signal is distributed.
  • Activation patching reveals a clean consolidation hierarchy: foreign-language identity commits at layer ~10/28, paraphrase content at ~18/28, dialect at ~21/28. The deeper the contrast, the later the commitment.
  • The hierarchy reproduces on Qwen 2.5 3B (36 layers) at the same fractional positions in the stack, not the same absolute layer numbers. Robustness check passes.

Vedlegg (1)

PDF-er (1)

  • DIA-LOC_preprint_2026-05-10.pdf1.1 MBLast ned

DIA-LOC: Localizing Dialect Representation in Open Norwegian-Capable LLMs

Made by Andreas Grønbeck | Founder & Partner at tenki — May 2026 (v0.3, cross-size replication on Qwen 2.5 1.5B and 3B)

> v0.3 release. Single-author, off-the-shelf model weights only. Cross-family replication on Gemma 3 4B (the production Hugin model) and pre-/post-BNCR comparison are deferred to v0.4 pending HuggingFace gated-access setup and BNCR-checkpoint availability respectively.


Abstract

Norwegian has two co-official written standards, Bokmål (BM) and Nynorsk (NN). Modern open instruction-tuned LLMs systematically under-perform on NN at the output level (BNCR [5] showed a 9–12 pp gap on NorEval commonsense reasoning across Qwen 2.5 sizes). We ask the upstream question: where does the BM/NN distinction live INSIDE such a model, and is it the same internal machinery that encodes English-vs-Norwegian? Five mechanistic-interpretability probes (layer-wise cosine similarity, linear CKA [4], logit-lens with top-1 / top-10 Jaccard / Jensen-Shannon, linear probes for contrast identity, and attention-head ablation), plus last-token activation patching [11], applied to off-the-shelf Qwen 2.5 1.5B Instruct [8] on three contrast sets: D1 (BM↔NN paraphrase via Apertium [7]), D2 (NB↔EN translation via FLORES-200 [6]), D3 (BM↔BM paraphrase via Gemma 3 4B as same-language control). Headline finding: the BM/NN dialect signal is linearly detectable from every transformer layer at ~0.80 5-fold CV probe accuracy, despite cosine similarity ~0.98 between paired residuals (geometrically near-identical). Activation patching reveals a clean consolidation hierarchy in the last-token residual: foreign-language identity commits at ~30% through the stack, lexical paraphrase content at ~67%, dialect identity at ~80%. The hierarchy reproduces at Qwen 2.5 3B (36 layers) at the same fractional positions, suggesting it is not a model-size artefact within this family. Total compute: 1× RTX 3060 Ti, ~30 minutes wall-clock for the full pipeline. All code, data, figures, and a single-command reproduction path are public.

Keywords: mechanistic interpretability; dialect representation; Bokmål; Nynorsk; Norwegian LLMs; Qwen 2.5; activation patching; linear probes; sparse autoencoders; residual stream; consolidation hierarchy.


1. Introduction

1.1 The output-vs-internal distinction

The BNCR paper [5] showed that a lightweight KL-divergence consistency-regularization auxiliary loss applied to paired BM/NN inputs closes about half of the dialect performance gap on instruction-tuned Qwen 2.5. That result is at the OUTPUT level: the model's predictions on NN cloze tasks moved closer to its predictions on BM. It does not tell us whether BM and NN are represented the same way INSIDE the model, or whether the regularizer just realigned the readout layer.

This paper takes the upstream-of-BNCR question. Forget fine-tuning for a moment: in an off-the-shelf instruction-tuned LLM, what does the residual stream look like when the same model processes paired BM and NN inputs? The output-vs-internal distinction matters because (a) AI-Act-flavoured compliance arguments rest on documenting model behaviour, not just outputs; (b) BNCR-v2 design depends on whether v1 unified internal representations or just realigned the readout; (c) the multilingual mech-interp literature has localized where various languages live in LLaMA-class models [1] but no Norwegian work exists.

1.2 Hypotheses

We pre-register two hypotheses about the internal BM/NN distinction:

  • H1 (sparsity). The BM/NN representational divergence is concentrated in specific layers and a small subset of attention heads, not uniformly distributed.
  • H2 (entanglement). The set of dialect-carrying heads is a strict subset of foreign-language-carrying heads (the model treats nynorsk as a mild foreign language).
  • H2-alt (separation). The two head sets are disjoint or only weakly overlap.

Both H2 outcomes are scientifically interesting; we treat them as alternative hypotheses rather than a hypothesis-and-null.

1.3 Contributions

  • We document and quantify the BM/NN dialect signal in off-the-shelf Qwen 2.5 1.5B: 0.80 5-fold CV linear-probe accuracy across all 28 layers despite ~0.98 cosine similarity, demonstrating that geometric similarity dramatically understates the presence of a small linearly-readable direction (§4.1, §4.3).
  • We rule out the H1 sparsity hypothesis at head granularity: the largest single-head ablation drops the dialect probe by 6.7 percentage points on a 0.62 baseline. The signal is distributed (§4.4).
  • We identify a representational-commitment hierarchy in the last-token residual via activation patching [11]: foreign-language identity at ~30% of the stack, paraphrase content at ~67%, dialect at ~80%. This is, to our knowledge, the first quantitative evidence of such a layered hierarchy across these three contrast types in a single model (§4.6).
  • We replicate the consolidation hierarchy on Qwen 2.5 3B (36 layers) at the same fractional stack positions, ruling out a model-size artefact (§4.7).
  • We release all code, three contrast sets (200 + 200 + 100 sentence pairs), captured activations, and a reproducible pipeline that runs end-to-end on a 16 GB consumer GPU in approximately 30 minutes wall-clock.

1.4 Scope and non-claims

  • We evaluate only Qwen 2.5 1.5B and 3B Instruct. Cross-family replication on Gemma [12], Llama, or other architectures is open and explicitly enumerated as v0.4 work.
  • We do not compare pre- vs post-BNCR internally. That comparison was the original motivation; it is blocked on access to the BNCR-trained checkpoint.
  • The dialect direction we localize is a geometric direction (linear separability in residual space). We do not claim it corresponds to a specific function the model uses for downstream behaviour; activation patching provides causal evidence that something flips outputs, but does not isolate which function.
  • D1's BM↔NN paraphrases are Apertium-translated and therefore surface-preserving. Naturally-written nynorsk (chat, dialect-mixed, informal register) was not tested.

2. Background

2.1 The residual stream and mechanistic interpretability

Every transformer block reads from and writes to a shared residual stream. After block N, the stream represents the model's state after layer N has had its say. Block-output residuals are therefore the canonical granularity for layer-by-layer probing. Modern mech-interp work [1, 11, 13] treats the residual stream as the central object: layers are operations on it, attention heads write specific directions into it, and probes/patches are tools for measuring what is written where.

2.2 BNCR and the Norwegian dialect gap

The Bokmål–Nynorsk Consistency Regularization auxiliary loss [5] adds KL(PBM || PNN) on paired inputs during instruction fine-tuning. Trained on 1,000 BM/NN paraphrase pairs mined from the Målfrid corpus (Norwegian government documents required to exist in both standards under Mållov §8 [12]), BNCR closes about half the BM/NN performance gap on NorEval. The mechanism by which it does so is the central open question that motivated this paper.

2.3 Multilingual representations in transformer LLMs

Wendler et al. [1] showed LLaMA does most of its semantic work in an English-pivot internal language. Multilingual BERT studies [2, 3] found that mid-layer representations are partially language-neutral; surface-form information is concentrated in early and late layers, while semantic content is layered in the middle. None of this work targets Scandinavian dialect pairs, where the surface forms are far closer than typical multilingual contrasts (BM↔NN often share more than 90% of token surface forms in running text).

2.4 Activation patching as causal mech-interp

Activation patching [11] forwards a "clean" input, captures intermediate activations, then forwards a "patched" input where one or more positions of the residual stream are surgically replaced with the clean version's values. By measuring how much the model's output shifts under each patch location, it isolates which positions causally support which behaviour. In our v0.3 setup we patch only the last-token position at each layer (avoiding the trivial-transfer pathology that arises from full-residual patching, in which any layer's full replacement just amounts to "from here on, run as if input was B").


3. Data and experimental setup

3.1 Contrast sets

Three sentence-pair datasets, each providing paired (a, b) inputs that semantically agree but differ on one dimension (dialect, language, or paraphrase variation).

Seta / bSourcen
D1BM / NNNorwegian Bokmål Wikipedia lead sentences (random sample), translated nob→nno via the Apertium public API [7]. Apertium is rule-based, surface-preserving, and the gold standard for BM/NN MT.200
D2NB / ENFLORES-200 nobLatn / engLatn dev split [6], paired line-by-line.200
D3BM / BMIndependent fresh Norwegian Bokmål Wikipedia leads, paraphrased into different-vocabulary BM via Gemma 3 4B (the production Tenki Hugin LLM).100

D3 is the surface-variation control: any probe that distinguishes paraphrases of the same dialect is detecting lexical noise, not language structure. It sets the noise floor for D1 and D2 claims.

3.2 Activation capture

We tokenize each input separately (no padding within a pair), forward it through Qwen 2.5 1.5B Instruct [8] at fp16, and capture residual-stream values via PyTorch forward hooks on each transformer block's output. Three pooling modes per input:

  • mean: mean over real (non-pad) token positions. Sentence-level summary; consumed by cosine + CKA + linear probes.
  • last: residual at the last real token position. The autoregressive next-token-prediction anchor; consumed by logit lens and last-token activation patching.
  • tokens: per-token residuals at selected layers (we capture layer 20). For SAE training and per-token analyses.

Each contrast contributes two tensors of shape [n_pairs, n_layers, d_model] per pool mode (one per side), saved as fp16. Full activation capture for all three contrasts on a 3060 Ti takes ~80 seconds.

3.3 Compute and tooling

All experiments run on a single NVIDIA RTX 3060 Ti (8.6 GB VRAM). Software: Python 3.12, PyTorch 2.6 with CUDA 12.4, transformers [14], accelerate, scikit-learn for linear probes, matplotlib. Reproducibility infrastructure: stochastic sources seeded with seed=42 at process start; CUDA determinism enabled where supported; every result file records the full hyperparameter dict and tensor SHA. A single make reproduce target rebuilds every result from raw downloads.


4. Probes and results

4.1 Geometric similarity

Per-layer mean cosine similarity between paired mean-pooled residuals, plus linear CKA [4]. D1 paired residuals are ~0.98 cosine similar at every layer, decreasing in CKA from 0.95 to 0.90. D2 is the clear outlier at ~0.69 cosine and 0.57→0.43 CKA. D3 (BM↔BM control) sits BELOW D1 in cosine at ~0.95 (with 0.76→0.58 CKA), a methodological surprise: Apertium-translated BM/NN preserves most surface tokens, while Gemma-paraphrased BM/BM deliberately uses different vocabulary. The model "sees" mostly-identical strings in D1 pairs.

Implication. Geometric similarity tells us very little about D1; it is compatible with "BM and NN are identical to the model" and also compatible with "the dialect signal is real but small enough that surface-token overlap dominates the metric". A more sensitive probe is required.

03 similarity

4.2 Logit lens

Top-1 next-token agreement is near 1.0 for all three contrasts at every layer, the trivial collapse onto sentence-final punctuation. Top-10 Jaccard overlap recovers the same ordering as cosine: D1 (0.94→0.86) > D3 (0.82→0.77) > D2 (0.68→0.70). Jensen-Shannon divergence is essentially zero everywhere because the high-probability top-1 token dominates the softmax distribution. At sentence-final positions, all three contrasts collapse onto the same predicted token; the next-token-prediction modality at this position does not differentiate dialect-vs-foreign-vs-paraphrase strongly enough to make the H2 entanglement test possible from logit lens alone.

04 logit lens

4.3 Linear probes (the headline finding)

Per-layer 5-fold CV linear-probe accuracy on mean-pooled residuals:

  • D2 (NB↔EN): 1.00 across all 28 layers. Foreign-language identity is perfectly linearly separable from layer 0 to layer 27.
  • D1 (BM↔NN): 0.77 to 0.82 across all 28 layers. Reliably and substantially above the noise floor.
  • D3 (BM↔BM control): 0.68 to 0.70 across all 28 layers. The noise floor for surface-variation alone.

D1 sits ~10 percentage points above the D3 noise floor at every layer, on a 5-fold CV with 200 paired inputs. The dialect signal is real, distributed throughout the stack, and invisible to direct similarity metrics (cosine ~0.98 at the same layers).

The flat curve across layers is itself a finding: the model is not "unifying" or "diverging" BM/NN representations as we move up. It encodes dialect identity as a stable, consistent linear direction present from input through output.

05 linear probes

4.4 Attention-head ablation

For each (layer L, head H) in the 28×12 = 336 head budget of Qwen 2.5 1.5B, we install a forward-pre-hook on layer L's o_proj that zeros the head_dim slice belonging to head H, run a subset of D1 (n=30) through the model, capture mean-pooled residual at the FINAL layer, and retrain the linear probe.

LayerHeadΔ probe accuracy
01-0.067
226-0.067
03-0.050
15-0.033
54-0.017

Read. The biggest single-head ablation drops the dialect probe by ~6.7 percentage points on a 0.617 baseline. That is small. There is no single "dialect head"; ablating any one head leaves most of the signal intact. This is consistent with the linear-probe finding (flat 0.77 to 0.82 accuracy across all 28 layers): the dialect direction is distributed, not localized. This is the negative result for H1.

06 head ablation d1

4.5 Sparse autoencoder (per-token, layer 20)

We train a small ReLU SAE (decoder columns kept unit-norm; encoder 1536 → 12288) at residual layer 20 on the per-token activations: 36,269 token-level samples across D1, D2, and D3 (vs the v0.1 mean-pooled 1,000 samples — 36× larger). 30 epochs of mini-batch AdamW (batch 512, lr 3e-4, L1 weight 5e-3).

Differential SAE response per contrast (mean and max of |a-b| across SAE features):

Contrastmax\a-b\mean\a-b\
D1 (BM↔NN)0.500.039
D2 (NB↔EN)8.250.535
D3 (BM↔BM)1.470.098

D2's differential SAE response is ~13× larger than D1's, matching the linear-probe ranking at a different level of granularity. Top-K differential-feature IoU at K=100 across contrasts is high (D1↔D2: 0.75, D1↔D3: 0.69, D2↔D3: 0.77), suggesting the top differential features are mostly shared across contrasts: they are general "input-deviation" features rather than sparse, dialect- or language-specific features.

07 sae iou layer20

Limitation. A canonical SAE training recipe requires millions of token-level activations; we use 36k. The negative result ("no clean dialect feature jumps out of the codebook") is honest but undertrained. Deferred to v0.4.

4.6 Activation patching at the last-token position

For each contrast we run a clean forward on text_a and text_b, capturing each block's output residual. Then for each layer L we re-run text_a with a forward hook on block L that REPLACES only the last-token position of the residual stream with B's last-token residual at the same layer; other positions retain A's processing. The transfer fraction at L is

T(L) = 1 − KL( p{a,patched}(L) || pb ) / KL( pa || pb ).

T(L) = 0 means patching at L did nothing; T(L) = 1 means the patched A run reproduces clean B's predictions. n=50 pairs per contrast.

The headline finding is a hierarchy of representational commitment:

ContrastFirst L with T ≥ 0.5First L with T ≥ 0.9baseline KL(A‖B)
D2 (NB↔EN, foreign-language)L10L174.54 nats
D3 (BM↔BM, paraphrase)L18L261.08 nats
D1 (BM↔NN, dialectal)L21L260.15 nats

In words: in this off-the-shelf Qwen 2.5 1.5B,

  • Foreign-language commitment in the last-token residual happens in the early-mid stack (~L10–17). By layer 17, replacing the last-token residual with the English version is enough to make the model predict like the English run.
  • Same-language paraphrase commitment happens in mid-stack (~L18–26). At L18 replacing the last-token residual with the paraphrased BM run carries 50% of the prediction.
  • Dialect commitment is the latest of the three (~L21–26). The BM/NN distinction in the last-token residual is finalized in the last 6–7 layers of the stack.

The three contrasts probe distinct levels of a representational hierarchy in the same model: language identity is decided early, lexical content middle, dialect identity late.

08 activation patching combined

4.7 Cross-size replication on Qwen 2.5 3B

We re-run the activation-patching probe on Qwen 2.5 3B Instruct (36 layers vs 28), same three contrasts, same n=50 pairs each, same last-token patching protocol.

1.5B (28L)3B (36L)frac (1.5B)frac (3B)
D2 50% transferL10L110.370.31
D3 50% transferL18L240.670.69
D1 50% transferL21L290.780.83
D1 90% transferL26L330.960.94
D3 90% transferL26L330.960.94

The hierarchy from §4.6 reproduces intact: foreign < paraphrase < dialect. Absolute layer numbers shift with depth, as expected for a deeper stack distributing the same computational graph across more blocks, but proportional locations are stable. Foreign-language consolidation at ~30% through the stack, paraphrase content at ~67%, dialect at ~80%. The robustness check passes.

10 cross size consolidation


5. Discussion

5.1 Three pieces of evidence converge

The activation-patching hierarchy is the v0.3 result that binds the rest of the paper together. The three contrasts each "commit" at a different layer band: foreign-language identity at ~L10–17, paraphrase content at ~L18–26, dialect identity at ~L21–26. At the granularity of the last-token residual, the model treats these as a layered hierarchy of decisions, not as parallel-but-similar tasks.

Three pieces of evidence converge on the same picture for D1:

  • Linear probe (§4.3): dialect identity is detectable at every layer (~0.80 accuracy), so the signal is present throughout.
  • Head ablation (§4.4): no individual head dominates, so the signal is distributed, not localized.
  • Activation patching (§4.6): the prediction-anchor (last-token residual) transfers from BM to NN sharply between L21 and L26, so the signal is consolidated late.

"Distributed signal that consolidates late" is the v0.3 one-line summary of where dialect lives in this model. Not in any one head, but as a late-stack commitment of a long-distributed signal.

5.2 Implications for BNCR-v2 design

Knowing that dialect commitment lives in layers ~21–26 (78–96% through the stack) is direct guidance for any future Bokmål-Nynorsk consistency intervention. A regularizer that targets the residual stream at those layers should be more efficient than one that distributes regularization pressure uniformly. An intervention that targets layer 10 would be wasted: that is the layer band where the model decides on language identity (NB vs EN), not on dialect. The model is not deciding dialect at L10. It is deciding dialect at L21.

5.3 Implications for AI-Act-flavoured compliance

For AI-Act-style traceability arguments, the output-vs-internal distinction we set out to test has a concrete answer in this model family: an off-the-shelf Qwen 2.5 already encodes the BM/NN distinction internally with linear separability comparable to (though weaker than) language identity. Compliance arguments that rely on "the model is unaware of dialect" are not supported by the data. The model is aware; the question is what it does with that awareness in its readout.

5.4 Honest negative results

We report three negative results with mechanism:

  • Direct cosine similarity is uninformative for D1 (§4.1). Surface-token overlap dominates the metric, smearing over the small dialect signal that linear probes recover.
  • Logit-lens at sentence-final positions trivially collapses (§4.2). All three contrasts agree on the next predicted token (typically punctuation), so top-1 agreement is uninformative; top-10 Jaccard recovers some structure but JS is dominated by the trivial top-1 mass.
  • No sparse "dialect feature" emerges from a small SAE (§4.5). With 36k training tokens (vs canonical millions), top-K differential features are mostly shared across contrasts, suggesting they are general input-deviation features. The negative result is honest but undertrained.

6. Limitations

  1. Single model family. Qwen 2.5 1.5B Instruct (primary, all probes) and Qwen 2.5 3B Instruct (cross-size replication on the activation-patching curve only). Generalization to Gemma 3 4B and Llama-family models is open; blocked currently on HF gated access for Gemma.
  2. Single source per contrast. Apertium for BM↔NN, FLORES for NB↔EN, Gemma 3 4B for BM↔BM. Surface-form effects are confounded with the translator/paraphraser. A multi-source D1 would tighten the dialectal claim.
  3. Mean-pooled SAE on 36k tokens. Per-token capture is a v0.2 improvement over v0.1's 1k mean-pooled samples, but a canonical SAE recipe trains on millions. The negative SAE result needs a longer training run before it can be claimed strongly.
  4. Off-the-shelf only. We do not compare pre- vs post-BNCR internally. That comparison is the original motivation; access to the BNCR-trained checkpoint is the blocker.
  5. Geometric, not functional. Linear-probe accuracy is a geometric measure. The dialect direction we localize does not necessarily correspond to a function the model uses for downstream behaviour.
  6. Last-token patching only. Per-position patching across the whole sequence at the consolidation layers (L20–25) would localize within sentences, not just within the stack. Marked as v0.4 future work.
  7. Single seed. Probes use scikit-learn's StratifiedKFold(seed=0); SAE uses default torch RNG. Multi-seed validation (n≥3) was outside the wall-clock budget for v0.3.

7. Reproducibility

ArtefactLocation
Codegithub.com/triceraz/dia-loc
Contrast sets (JSONL)data/ in the repo
Activations (*.pt, ~85 MB on 1.5B + ~100 MB on 3B)HF dataset (forthcoming): triceraz/dia-loc-activations
Probe outputs (CSV/JSON)runs/probes/ (1.5B), runs/probes_3b/ (3B)
Figurespaper/figures/ and paper/figures_3b/

A single repo clone followed by pip install -r requirements.txt and stepwise execution of src/01..src/10 regenerates every figure in this paper from raw downloads in approximately 30 minutes wall-clock on a 16 GB consumer GPU.


Acknowledgements

The Qwen team [8] released the 1.5B and 3B Instruct checkpoints under a permissive license that made all of this work possible. The Apertium project [7] provided the rule-based BM↔NN translation that gives D1 its surface-preserving character. FLORES-200 [6] provided the parallel NB↔EN reference data for D2. The HuggingFace transformers stack [14] is the load-bearing infrastructure for every probe in this paper. Mechanistic-interpretability methodology owes most of its conceptual grounding to the broader Anthropic transformer-circuits literature [11, 13] and to the Wendler et al. multilingual-LLM work [1] that asked the proximate version of our question first.

The Tenki Hugin LLM stack provided the Gemma-3-4b paraphrase backend used to generate the D3 control set.


References

[1] Wendler, C., Veselovsky, V., Monea, G., & West, R. (2024). Do Llamas Work in English? On the Latent Language of Multilingual Transformers. ACL.

[2] Pires, T., Schlinger, E., & Garrette, D. (2019). How multilingual is Multilingual BERT? ACL.

[3] Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., & Stoyanov, V. (2020). Unsupervised Cross-lingual Representation Learning at Scale. ACL.

[4] Kornblith, S., Norouzi, M., Lee, H., & Hinton, G. (2019). Similarity of Neural Network Representations Revisited. ICML.

[5] Grønbeck, A. (2026). Closing the Bokmål–Nynorsk Gap: Consistency Regularization for Norwegian LLMs. Tenki Forskning.

[6] Costa-jussà, M. R., Cross, J., et al. (NLLB Team). (2022). No Language Left Behind: Scaling Human-Centered Machine Translation. arXiv:2207.04672.

[7] Forcada, M. L., Ginestí-Rosell, M., Nordfalk, J., O'Regan, J., Ortiz-Rojas, S., Pérez-Ortiz, J. A., Sánchez-Martínez, F., Ramírez-Sánchez, G., & Tyers, F. M. (2011). Apertium: a free/open-source platform for rule-based machine translation. Machine Translation, 25(2), 127–144.

[8] Yang, A., Yang, B., Zhang, B., et al. (2025). Qwen 2.5 Technical Report. arXiv:2412.15115.

[9] Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2022). LoRA: Low-Rank Adaptation of Large Language Models. ICLR.

[10] Nostalgebraist. (2020). Interpreting GPT: the logit lens. LessWrong.

[11] Meng, K., Bau, D., Andonian, A., & Belinkov, Y. (2022). Locating and Editing Factual Associations in GPT. NeurIPS.

[12] Lov om språk (språklova). (2022). LOV-2021-05-21-42. Norwegian Parliament.

[13] Bricken, T., et al. (Anthropic). (2023). Towards Monosemanticity: Decomposing Language Models With Dictionary Learning. transformer-circuits.pub.

[14] Wolf, T., et al. (2020). Transformers: State-of-the-Art Natural Language Processing. EMNLP demo.

[15] Sennrich, R., Haddow, B., & Birch, A. (2016). Neural Machine Translation of Rare Words with Subword Units. ACL.


The full source code, contrast sets, captured activations, probe outputs, and figures are available at [github.com/triceraz/dia-loc](https://github.com/triceraz/dia-loc). The pipeline reproduces from raw downloads in approximately 30 minutes on an RTX 3060 Ti.

mech-interpBokmålNynorskQwen 2.5activation patchinglinear probes
← Tilbake til forskning