
Trishool's HaloGuard 1.0 on Bittensor SN23 achieved state-of-the-art multilingual prompt safety, outperforming larger open guard models.
Author: Akshat Thakur
2nd July 2026 – Astroware Labs has released HaloGuard 1.0, an open-weight safety classifier for AI prompts. The team says its smallest model beats guard systems many times larger.
High Signal Summary For A Quick Glance
Sheik
@ttsheik
@trishoolai Well-done trishool team, miners, validators and supporters! Sn23 is undeniably a top-tier subnet! https://t.co/fs9xU1bBr5
We’re excited to announce that Trishool’s HaloGuard 1.0 𝐡𝐚𝐬 𝐚𝐜𝐡𝐢𝐞𝐯𝐞𝐝 𝐒𝐎𝐓𝐀 prompt-safety performance among open-weight guard models. Today, we present HaloGuard 1.0, a constitutional input classifier for multilingual AI safety. It is built as a first-layer input https://t.co/YJswErsZSs
04:20 PM·Jul 2, 2026
D.C.J.
@ReuverJero18591
@trishoolai That's amazing!
We’re excited to announce that Trishool’s HaloGuard 1.0 𝐡𝐚𝐬 𝐚𝐜𝐡𝐢𝐞𝐯𝐞𝐝 𝐒𝐎𝐓𝐀 prompt-safety performance among open-weight guard models. Today, we present HaloGuard 1.0, a constitutional input classifier for multilingual AI safety. It is built as a first-layer input https://t.co/YJswErsZSs
03:52 PM·Jul 2, 2026
General Tensor
@generaltensor
@trishoolai Incredibly proud of the Trishool team!
We’re excited to announce that Trishool’s HaloGuard 1.0 𝐡𝐚𝐬 𝐚𝐜𝐡𝐢𝐞𝐯𝐞𝐝 𝐒𝐎𝐓𝐀 prompt-safety performance among open-weight guard models. Today, we present HaloGuard 1.0, a constitutional input classifier for multilingual AI safety. It is built as a first-layer input https://t.co/YJswErsZSs
03:14 PM·Jul 2, 2026
Steady attention without excessive speculation.
HaloGuard 1.0 screens incoming prompts for jailbreaks, prompt injections, and policy violations. It acts as a first-layer filter, so risky requests get flagged before an LLM or agent ever runs. Astroware announced the release on July 2 through its Trishool account on X.
The headline claim is size. The 0.8B version scores 90.9 average F1 across seven prompt-safety benchmarks, according to the team. The 4B version reaches 92.1.
The seven tests are OAI Moderation, Aegis, Aegis 2.0, ToxiC, SimpST, HarmBench, and WildGuardTest. On Aegis 2.0, the 0.8B model scores 87.9 F1. The rest sit in a similar range.
Those numbers matter because of what they beat. Meta’s LlamaGuard4, at 12B parameters, sits at 75.9 on the same tests. Google’s ShieldGemma, at 27B, lands at 70.0.
In other words, a model roughly 15 to 33 times smaller comes out ahead. The team also puts HaloGuard 1.0 above NemoGuard 8B (82.9), WildGuard 7B (85.8), Qwen3Guard-Gen 8B (86.2), and PolyGuard-Qwen 7B (87.0).
F1 score is the balance between catching real harms and avoiding false alarms. So a higher average suggests fewer missed attacks without flagging safe prompts as dangerous. For now, all of these results are self-reported.
It is a constitutional classifier built on Qwen3.5 base models. Instead of a simple label head, it reads the prompt against a written safety “constitution” and then generates a verdict.
Because it generates text, the model returns a safe or unsafe call plus a category, rather than a single score. As a result, the output reads more like a reasoned judgment than a raw flag.
This first-layer approach is meant to be fast and cheap. So teams can screen traffic inline before spending money on a large downstream model or an agent run.
HaloGuard comes out of Trishool, also known as Subnet 23 on Bittensor. Trishool is a decentralized red-teaming network, and its whole point is adversarial pressure.
Here is the loop. Miners compete to break the guard with jailbreaks and injections. Then validators score those attacks, and the successful ones feed back into training.
These cycles usually run about seven days. So the guard keeps meeting fresh attacks. The team frames it as a “living” model, not a fixed release. According to Astroware, Subnet 23 earns a small share of Bittensor emissions, roughly 0.19% in recent snapshots.
The economic design ties into Bittensor’s core thesis. Attackers and validators earn rewards, so the incentive points toward finding weaknesses instead of hiding them.
HaloGuard did not appear overnight. An earlier build, Halo Guard Alpha, reached about 87% F1 in the spring. The team has shipped steadily since then.
Real deployments came next. In May 2026, Halo integrated with Chutes on Subnet 64 for live serving. Then in June, the project joined the Google for Startups Web3 Program.
Astroware took over Subnet 23 roughly seven months ago. Since then, the work has moved through two phases. Phase 1 set the core guard and policy framework. Phase 2 now handles low-latency input guarding.
The team says the granularity is what sets HaloGuard apart. It starts from 46 constitutional policies. Then those expand into 490 categories and 2,940 fine-grained subcategories.
That is far more detailed than the coarse “harmful or safe” labels used by many older guards. Because of that detail, the model can learn narrower intent boundaries.
The corpus itself holds 1,259,451 synthetic records, or about 1.26 million. Crucially, it uses 1:1 paired counterfactuals. Each pair keeps the same topic and vocabulary but flips the intent. So the model learns meaning, not surface keywords.
Coverage also spans 46 languages in a balanced mix. As a result, the team argues that a language switch should not slip past the filter.
Every benchmark here comes from Astroware and the model card, not an outside auditor. The full arXiv technical report is still pending. So independent researchers cannot yet reproduce the scores.
That gap matters in a field where eval sets can be cherry-picked or contaminated. Some machine learning researchers may also question the synthetic data. Others may ask how the multilingual coverage holds up on real-world traffic.
The models are open weight under an Apache 2.0 license. So anyone can download and test them today on Hugging Face. That openness makes independent checks possible, even if none have landed yet. None of this is financial advice.
For now, the model covers input prompts only. Response-side checks and agent tool-use scenarios sit on the roadmap, according to the team.
The real test arrives with the arXiv paper and the first outside reproductions. Until then, HaloGuard 1.0 stands as a bold, downloadable claim. A small decentralized guard says it can outscore far larger labs.
Our Crypto Talk is committed to unbiased, transparent, and true reporting to the best of our knowledge. This news article aims to provide accurate information in a timely manner. However, we advise the readers to verify facts independently and consult a professional before making any decisions based on the content since our sources could be wrong too. Check our Terms and conditions for more info.
HaloGuard 1.0 Beats Larger Guard Models on Prompt Safety
Robinhood Chain Virtuals Integration Goes Live Day One
Lighter LIT Burn to Cut Supply From Revenue Buybacks
BNB Chain Launches AI Agent Studio on AWS Bedrock AgentCore
HaloGuard 1.0 Beats Larger Guard Models on Prompt Safety
Robinhood Chain Virtuals Integration Goes Live Day One
Lighter LIT Burn to Cut Supply From Revenue Buybacks
BNB Chain Launches AI Agent Studio on AWS Bedrock AgentCore