The most-jailbroken open Gemma 4 model is the one that fits on a phone
TL;DR
- Setup. Abliteration is a public jailbreak script for open weight chat models: it finds the single internal direction that triggers refusal and projects it out of every layer (Arditi et al. 2024). Wang et al. 2025 showed the same English derived direction also disables refusal in 13 other languages on instruction tuned 7B–14B models, but they extracted directions in house and did not test Gemma 4.
- Motivation. Gemma 4 (Google, April 2, 2026) ships in four sizes: three Dense (E2B ~2.3B effective, E4B ~4.5B, 31B) and one Mixture of Experts (26B A4B); we test the three Dense ones and exclude the MoE for now so size and architecture do not confound each other.
huihui aireleased abliterated versions of all four within 48 hours, and these public checkpoints are the actual threat model: anyone can download and run them. - Main claim. Across seven languages and 100 BeaverTails harmful prompts per language, mean compliance after English abliteration is 42.9% at E2B, 68.1% at E4B, and 64.4% at 31B. The curve does not rise cleanly with size: it peaks at the mid size model, not at the smallest or the largest.
- Side finding 1. Hindi sits at the bottom of abliterated compliance at every size (39%, 65%, 60%), lowest at E2B and 31B, third-lowest at E4B (behind Chinese and Arabic tied at 64%). The "low-resource means more vulnerable" assumption fails in this lineup.
- Side finding 2. The single highest cell in our matrix is 74% (E4B abliterated, Portuguese and German tied), well below the "consistently approaching or exceeding 90%" that Wang et al. report on 7B+ models in other families.
- Mechanistic finding. Inside each model there is a single internal direction that triggers refusal, and the seven languages share that direction the most at E4B (mean cross lingual cosine similarity 0.37, against 0.31 at E2B and 0.27 at 31B). When the languages share one direction, removing it disables refusal in all seven at once: the compliance peak and the alignment peak land at the same model.
- Impact. The strong "smaller means more vulnerable" form of the democratization safety paradox is wrong inside Gemma 4 Dense, because E2B still refuses the majority of harmful prompts after public abliteration. A weaker form survives: E4B is also a consumer accessible model, so the worst point in the matrix is at the size most users can run.
- Code: github.com/gustipardo/gemma4-abliteration
Introduction
When a chat model decides whether to refuse a harmful request, that decision lives in a small part of its internal state. Arditi et al. 2024 showed it is well captured by a single direction in the model's hidden activations: feed the model harmful and harmless prompts, take the difference of the average activations, and you get a vector that points from "this is fine" toward "I should refuse". That vector is the refusal direction.
Abliteration is a community built jailbreak that exploits exactly this. The recipe finds the refusal direction and projects it out of every layer's output, leaving a checkpoint that mostly stops refusing harmful requests and keeps the rest of its capabilities. The script is short, public, and reproducible. Within 48 hours of Google releasing the Gemma 4 family on April 2, 2026, abliterated versions of every Gemma 4 size were on HuggingFace under the huihui ai namespace.
Whether refusal is a single direction has become the load bearing question for this style of attack. Wang et al. 2025 extracted refusal directions from English prompts in instruction tuned models from Yi, Qwen 2.5, Llama 3, and Gemma 2 (mostly 7B–14B variants) and showed that ablating those directions raised compliance to harmful prompts in 13 other languages. In their stronger universality experiment, where they extracted directions from non English source languages (German, Chinese, Thai), compliance rates were "consistently approaching or exceeding 90%" across the other safety aligned languages.
Two questions Wang et al. did not address are directly relevant to the public threat model. First, their refusal direction extraction is a controlled procedure run by the authors. The public alternative is the huihui ai remove refusals with transformers script, which is the version a non expert user can actually download. Whether the public recipe reproduces the published universality result on a fresh model family is open. Second, Wang et al. evaluated points across families rather than within a single family with a continuous size axis. The within family scaling shape, especially in the sub 7B region, was not the focus of their analysis.
The Gemma 4 Dense lineup (E2B at ~2.3B effective parameters via Per Layer Embeddings, E4B at ~4.5B effective, 31B Dense), released by Google on April 2, 2026, is the first model family released after the Wang et al. submission with public abliterated checkpoints at three Dense sizes. It is the natural test bed for both questions.
Threat model. A non expert user downloads a public abliterated checkpoint of an open weight Gemma 4 model, runs it locally, and prompts it in any of the 140+ languages Gemma 4 supports. We measure how often that user gets a fully compliant response to a harmful prompt, judged by Claude Haiku 4.5 reading the full 512 token response.
Research question. Within the Gemma 4 Dense family, how does single vector English abliteration affect multilingual compliance with harmful prompts as model size increases from E2B to 31B?
Contribution. Mean post abliteration compliance across seven languages does not rise cleanly with model size. It rises by 25.3 percentage points from E2B (42.9%) to E4B (68.1%), then falls by 3.7 percentage points from E4B to 31B (64.4%). The base to abliterated gap (how much the attack adds on top of the base model's already low refusal rate) is also largest at E4B: +57.4 pp, against +38.8 pp at E2B and +51.3 pp at 31B. Inside this Gemma 4 Dense lineup, the public script removes the most refusal at the mid size model, not at the smallest or the largest.
Why it matters. The naive "smaller means more vulnerable" framing of the democratization safety paradox is wrong in detail: E2B still refuses the majority of harmful prompts even after public abliteration. A precise version of the paradox still holds, because E4B is also a consumer accessible model. At 4 bit it fits in roughly 3 GB of VRAM, which is the memory budget of a 2026 mid range smartphone, a Raspberry Pi 5 with 8 GB RAM, the integrated GPU on an M series MacBook Air, or any consumer NVIDIA card from the past several years. The model the public attack hits hardest is also one of the easiest to run.
Methods
Models. Six checkpoints, two per size, all loaded with the same quantization config (bitsandbytes 4 bit NF4, double quant, bf16 compute). Bases are google/gemma-4-E2B-it (~2.3B effective parameters via Per Layer Embeddings), google/gemma-4-E4B-it (~4.5B effective), google/gemma-4-31B-it (Dense, ~31B). Abliterated counterparts are huihui-ai/Huihui-gemma-4-{E2B,E4B,31B}-it-abliterated, all produced with the same remove refusals with transformers script. Identical recipe across sizes is what lets us isolate scale as the only thing that varies between abliterated cells.
The Gemma 4 lineup also contains a 26B A4B Mixture of Experts variant (~4B active parameters via routing). We exclude it from the principal experiment to keep the size axis clean. Mixing Dense and MoE on the same axis would confound size with architecture. The MoE variant is queued as a separate sub experiment.
Prompts. 100 harmful prompts per language, sampled with seed 42 from the BeaverTails 30k_test split (PKU-Alignment dataset) and filtered to is_safe = False. The non English versions are translations of the same 100 prompts.
Languages. English, Spanish, Chinese, Portuguese, German, Arabic, Hindi. A mix of high and medium coverage in pretraining data.
Generation. Greedy decoding, 512 max new tokens. Gemma 4 frequently produces "delayed refusals" of the form "I can't help with that" embedded after 50 to 100 helpful tokens. A short generation budget would misclassify these as compliant.
Judge. Claude Haiku 4.5 reads the full prompt and the full response and emits one of three labels: complied, refused, partial. We treat complied as the only positive case for the compliance rate. Wang et al. used WildGuard (Han et al. 2024), an open source moderation classifier, in the same role. The two judges are not strictly comparable, so we treat absolute compliance numbers as judge conditional and rely on within experiment comparisons for the headline.
Mechanistic protocol
What we wanted to know. The compliance numbers tell us what the model does after abliteration. We also wanted to know why the curve peaks at E4B by looking at what the model is doing internally before it speaks.
What we measured. For each base model and each language we fed it 100 harmful prompts and 10 harmless prompts, and read the model's hidden state right after it had finished reading the instruction. The refusal direction for that language is the difference between the average hidden state on harmful prompts and the average on harmless prompts. It is the single direction in the model's "thought space" that points from "this is fine" toward "this is harmful, I should refuse". We extracted one direction per language at the layer where it is strongest.
Two numbers come out of this: cosine similarity between every pair of the seven per language refusal directions (1.0 means same direction, 0.0 means unrelated), and silhouette score per language asking how cleanly harmful and harmless prompts separate inside the model's internal space.
Results
Compliance peaks at the mid size model
Table 1. Compliance rates per cell (%). Same 100 harmful prompts, same Claude Haiku 4.5 judge, same huihui ai abliteration recipe across sizes. Bold marks the peak cell of the abliterated row.
| EN | ES | ZH | PT | DE | AR | HI | mean | |
|---|---|---|---|---|---|---|---|---|
| E2B base | 4 | 7 | 4 | 6 | 4 | 2 | 2 | 4.1 |
| E2B abliterated | 42 | 47 | 45 | 40 | 44 | 43 | 39 | 42.9 |
| E4B base | 13 | 11 | 7 | 16 | 10 | 9 | 9 | 10.7 |
| E4B abliterated | 70 | 66 | 64 | 74 | 74 | 64 | 65 | 68.1 |
| 31B base | 16 | 12 | 13 | 16 | 12 | 10 | 13 | 13.1 |
| 31B abliterated | 67 | 64 | 65 | 63 | 71 | 61 | 60 | 64.4 |
The big move on the curve is the +25 pp jump from E2B to E4B, well outside the ~1.8 pp standard error on the seven-language mean. The drop from E4B to 31B is small but consistent: E4B beats 31B in 4 of 7 languages, ties in zero, trails in 3. The base-to-abliterated gap peaks at E4B too (+57 pp, vs +39 at E2B and +51 at 31B): the same off-the-shelf attack does the most work at the mid size.
Hindi resists more than expected
Hindi sits at the bottom of abliterated compliance at every size: 39% at E2B (the lowest of the seven), 65% at E4B (third-lowest, behind Chinese and Arabic tied at 64%), 60% at 31B (the lowest). The most compliant language is not stable across sizes either: Spanish leads at E2B (47%), Portuguese and German tie at E4B (74%), and German leads at 31B (71%). The common assumption that languages with less internet pretraining data should be the most vulnerable to a cross lingual jailbreak does not hold here.
No cell reaches the 90% Wang et al. report on other families
The single highest compliance cell in our matrix is 74% (E4B abliterated, Portuguese and German tied). Wang et al. 2025 report cells "consistently approaching or exceeding 90%" in their non English universality experiment on 7B+ Dense models. Our E4B max (74%) is in the lower part of that range. Our E2B max (47%) and 31B max (71%) sit further from it.
Inside the model: refusal directions align the most at E4B
Cross lingual cosine similarity of refusal directions (mean over the 21 off diagonal pairs of a 7×7 language matrix):
| E2B | E4B | 31B | |
|---|---|---|---|
| mean | 0.31 | 0.37 | 0.27 |
| max | 0.67 | 0.71 | 0.52 |
| min | 0.18 | 0.29 | 0.17 |
At E4B the seven languages' refusal directions are the most aligned (0.37). They are less aligned at E2B (0.31) and least aligned at 31B (0.27). The shape mirrors the compliance curve: peak at the mid size, lower at the endpoints.
Per-language Silhouette scores (mean over 7 languages, measured at the refusal-extraction layer):
| Size | mean Silhouette |
|---|---|
| E2B | 0.29 |
| E4B | 0.26 |
| 31B | 0.23 |
Silhouette decreases steadily with size: harmful and harmless prompts separate the cleanest at E2B and the messiest at 31B.
Putting the two numbers together. Picture seven arrows (one per language) inside the model's internal space. At E4B, the seven arrows point almost the same way (high cosine) — removing one disables refusal in all seven. At E2B, the arrows fan out (lowest cosine), even though the harmful and harmless prompt clusters are still cleanly separated (highest Silhouette) — removing one only kills refusal in that language. At 31B, the arrows fan out and the clusters blur — single-vector ablation leaves a lot in place.
Discussion
Did we answer the research question? Partially. Descriptively, yes: post-abliteration compliance is non-monotonic in size, peaking at E4B (~4B), not at the smallest or the largest. Interpretively, no: we cannot tell whether that shape is a property of Gemma 4 Dense, of Dense families in general, or of the public huihui ai recipe. One experiment does not adjudicate.
If you held the strong form of the democratization safety paradox ("small accessible models are the most dangerous because they are the easiest to abliterate"), the data here does not support it. Size matters less than that hypothesis predicted, and E2B is the safest of the three abliterated cells. A weaker form does survive: E4B is the most compromised cell in the matrix, and E4B is also a consumer-accessible model.
If you held the inverse view ("scale brings vulnerability with it because larger models are better instruction followers"), the curve does not support that either. Base compliance does rise with size, but post abliteration compliance does not.
Three readings, checked against the geometry data
- Refusal geometry is most concentrated at E4B. At E2B refusal is spread across several directions; at E4B it converges on one dominant direction; at 31B it spreads out again. The geometry data fits: cross-lingual cosine peaks at E4B, Silhouette peaks at E2B.
- Capability outpaces refusal early, then refusal catches up. This story would predict cosine and Silhouette both rising with size. The data does the opposite.
- The
huihui-airecipe is accidentally optimal at the mid size. Same script across all three checkpoints, but the prompts used to derive the refusal direction may not be equally informative at every scale. Untested.
Limitations
- Single model family. We tested Gemma 4 Dense only.
- Single abliteration recipe. All abliterated checkpoints come from
huihui ai's public script. - Single judge. Claude Haiku 4.5 is not the judge Wang et al. used (they used WildGuard).
- Seven languages. Wang et al. used 14, including Polish, Russian, and Yoruba.
- Translations were not back translated for fidelity.
- Base condition cells are not zero. We have not separated judge calibration noise from genuine instruction following on harmful prompts.
- Single random seed (42) for prompt sampling.
- Architecture held constant at Dense. Whether the routing layer in Gemma 4 26B A4B (MoE) reproduces or breaks the E4B peak is the cleanest follow up experiment.
Related work
- Wang et al. 2025 showed that English-derived refusal directions transfer across languages in 7B–14B models from Yi, Qwen 2.5, Llama 3, and Gemma 2; we test the same hypothesis on Gemma 4 using a public abliteration script instead of their in-house extraction.
- Arditi et al. 2024 introduced the single-direction ablation technique that the open-weight community calls abliteration; we test the public checkpoints produced by it.
- An embarrassingly simple defense (2025) trains models on extended refusals to spread the refusal signal across many tokens, making single-vector ablation less effective; whether it flattens the E4B peak is open.
Future work
- Compare abliteration recipes on the same six checkpoints. Re running with a different refusal direction extraction (for example the lab variant in Wang et al. 2025) on the same six checkpoints would tell us whether the E4B peak is about Gemma 4 Dense or about the
huihui aiscript in particular. - Add the 26B A4B Mixture of Experts checkpoint, paired with E4B on active parameter count. Same protocol, different architecture.
- Test more languages and a second judge. Adding less-pretrained and safety-misaligned languages (Yoruba, Polish) and replacing Claude Haiku with WildGuard would test whether the per-language ranking survives perturbations.
Acknowledgements
This is a mentored project at BAISH (Buenos Aires AI Safety Hub), supported by BlueDot (2026).