SmythOS - AI That Rewrites Itself: SEAL’s 72.5% on ARC-AGI Ignites the Self-Evolution Race

The world of AI is changing fast. We’re moving from frozen, pre-trained models to systems that can grow and adapt on their own. One exciting step in this direction is the SEAL framework.

What is SEAL, you ask? SEAL stands for Self-Adapting Language Model, and it’s doing something remarkable.

It reached a 72.5% success rate on a set of tough reasoning tasks from the ARC-AGI benchmark. That’s nearly three times better than past best results—and a giant leap beyond the 0% baseline that standard language models typically score.

What makes SEAL different is how it updates itself.

Instead of relying on human-written training data or manually chosen settings, SEAL writes its own training instructions. It creates synthetic examples, picks its own training strategies, and decides how to improve. All of this happens inside a feedback loop powered by reinforcement learning—a method that rewards the model when it gets better at solving tasks.

Here’s why this is shift is a big deal. Traditional large language models stay the same after training. They don’t evolve unless researchers intervene. But SEAL changes that. It opens the door to models that improve themselves over time without needing constant human help.

If systems like SEAL prove successful across more types of tests, they could save companies time and money. Retraining large models today takes a lot of computing power and expert oversight. A self-adapting model could learn and evolve on the fly, which could make AI development much faster and more flexible.

Inside SEAL: A Blueprint for Self-Evolution

SEAL changes the game by letting AI models edit and train themselves. It does this through a smart, multi-step process that helps the model not just adapt, but learn how to adapt better over time.

Self-Edits as the Control Layer

When SEAL gets a new task, it doesn’t just try to solve it right away. Instead, it writes something called a “self-edit.” This is more than just an answer—it’s an instruction about how the model should change to perform better.

For example, a self-edit might tell the model to adjust how it reads data, tweak how it learns (like changing the learning rate), or even use extra tools to create more training examples. These edits help the model decide what to focus on and how to update itself. This gives SEAL a kind of internal control system, where it directs its own learning instead of waiting for a human to tell it what to do.

Supervised Fine-Tuning as Self-Execution

Once SEAL creates a self-edit, it uses that as a guide in a phase called supervised fine-tuning. Here, the model changes its internal weights—the numbers that control how it thinks. These changes are based on the self-edit’s instructions.

This makes the learning “stick.” Instead of just using the self-edit once and moving on, the model actually rewires part of itself. If the self-edit was helpful, the model becomes permanently better at that task.

Reinforcement Learning for Meta-Learning

But how does SEAL know which self-edits are good? That’s where reinforcement learning (RL) comes in. After making a self-edit and fine-tuning itself, SEAL checks how well it performed on the task. If it did better, it gets a reward. If not, it learns to avoid that kind of edit in the future.

This feedback loop means SEAL isn’t just learning how to do tasks—it’s learning how to improve its own learning process. Over time, it builds smarter self-editing habits. This is called meta-learning, or “learning to learn.”

Comparative Insight

SEAL isn’t just another plug-in system like LoRA, which adds extra layers to an AI model to help it adapt. Instead, SEAL builds adaptation into the model itself. It doesn’t just receive updates—it decides how those updates should work.

This is a big shift. With SEAL, the model becomes more like a living system, shaping its own training path.

Cautionary Note

Still, this power comes with risks. If the model creates bad self-edits—maybe because it misunderstands a task—it could make itself worse, not better.

Some early reviews point out that SEAL can “forget” things it already learned or get stuck reinforcing wrong ideas. This is similar to how people sometimes fall into bad habits or false beliefs if they don’t have outside feedback.

That’s why even self-adapting systems need oversight. Without checks in place, a model like SEAL could drift off course and become less useful or even unsafe.

ARC-AGI: The Benchmark That Separates Memorization from Reasoning

The ARC-AGI benchmark was created to test something simple yet powerful: can AI think more like a human? Unlike most AI tests, ARC-AGI isn’t about remembering facts or repeating patterns. It’s about learning new skills quickly from just a few examples—something humans do all the time but AI still struggles with.

Human-Like Reasoning, Grid by Grid

ARC-AGI tasks are puzzles made from grids filled with colored blocks. The challenge is to look at a couple of example grids and figure out the hidden rule. Then, the model must apply that rule to a new grid and produce the correct result.

Each puzzle is different. There are no repeated formats, so memorizing answers won’t help. The tasks might involve rotating shapes, counting objects, or spotting a pattern in color and position. Humans solve these using intuition and reasoning. AI, on the other hand, often finds them incredibly hard.

Designed to Be Hard for AI, Easy for Humans

ARC-AGI was built to highlight the gap between human and machine intelligence. While humans usually solve these puzzles quickly, even top language models can fail badly. That’s because many models are “frozen” after training. They can’t adjust how they think when given new types of problems.

SEAL’s self-adapting method changes that. By creating and testing its own updates, SEAL can actually adjust how it solves new problems. This is why SEAL achieved a 72.5% success rate on ARC-style puzzles, a huge jump from the usual 0% baseline using traditional few-shot prompts.

Still, it’s important to understand that SEAL’s score was achieved on a custom subset of ARC-style tasks—not the full public or private ARC-AGI benchmark. This means it’s a strong result, but not yet the official top score on the whole ARC-AGI leaderboard.

Leaderboard Landscape

Let’s put SEAL’s 72.5% in context. Other top methods have scored similarly or higher—but with different setups. For example:

One research group scored 71.6% on the full public ARC-AGI set using data augmentation and search techniques.
OpenAI’s o3-preview model hit 88% on a semi-private ARC test—but it was trained on 75% of the benchmark, which raises questions about data overlap.
The winning team of the 2024 ARC Prize on Kaggle reached 53.5% on the private test set with advanced methods.
Standard language models with prompting or fine-tuning usually score under 10%.

Compared to these, SEAL stands out not just for its score, but for how it got there—by learning to adapt on its own.

Knowledge Incorporation Tests

SEAL’s strength isn’t limited to visual puzzles. In another test, it learned new facts from text and improved question-answering accuracy from 33.5% to 47%. This even beat data generated by GPT-4.1.

That shows SEAL can adapt across very different types of tasks—not just visual reasoning, but also understanding and remembering new knowledge.

Where We’re Headed: The Strategic Horizon for Self-Adapting AI

AI is entering a new phase—one where models can change and improve themselves over time. SEAL is a major part of this trend, but it’s not alone. Other approaches, like Sakana AI’s Transformer-Squared, show how different paths can still lead to the same big goal: building AI that can learn how to learn.

Parallel Movements: Sakana AI’s Transformer-Squared (T²)

Transformer-Squared, or T², is another system that lets AI models adapt on their own. But it takes a different route than SEAL. Instead of writing its own training data, T² tweaks specific parts of its internal structure—just the most important values inside its neural network.

Here’s how it works:

T² creates “expert” settings for different kinds of tasks, like math or programming.
When the model gets a new problem, it first figures out what kind of problem it is.
Then it chooses and mixes the expert settings that fit that task best.

T² is smart about saving resources. It only changes a small number of settings and doesn’t need to retrain the whole model. This makes it faster and cheaper to use. Like SEAL, it also uses reinforcement learning to improve its choices over time.

Both SEAL and T² are signs that AI research is moving toward systems that don’t need to be constantly retrained by humans. They learn how to adapt by themselves, which is a major step toward building truly intelligent machines.

New Governance Challenges

But with great power comes great responsibility. When an AI model can rewrite parts of itself, it also becomes harder to manage and control. Traditional safety tools like fine-tuning with human feedback (called RLHF) aren’t enough anymore. These tools usually just teach a model how to behave—not how to safely change itself.

To deal with this, researchers are starting to look at new ways to keep AI safe, even as it learns and evolves. Some proposed solutions include:

Locking in a core part of the AI that can’t be changed.
Requiring every big update to be reviewed and approved by a human.
Using two separate systems to double-check important decisions before they go live.

These ideas aim to make sure self-changing models don’t go off the rails—and that someone is always watching when major changes happen.

Proposed Safeguards

For self-adapting AI to be safe and trustworthy, we need rules built into the system itself. Some suggestions include:

An “immutable kernel”—a locked-down part of the system that watches everything the AI does.
“Transformation contracts”—a human-readable log of every major update, signed off by real people.
Two-track validation—every big change must be approved by two different reviewers, like two AIs or an AI and a human.
Transparent auditing—making all changes visible and tying researcher incentives (like funding or publication) to safety checks.

These rules would help prevent dangerous or hidden changes in self-adapting models and make sure people stay in control.

AI Strategy Implications

For businesses and developers, these new models offer exciting possibilities—but also new responsibilities. If self-adapting systems like SEAL or T² prove reliable, they could dramatically lower the cost of customizing AI. Teams would no longer need to retrain models from scratch every time. Instead, AI could tune itself to a specific task, client, or domain.

This could speed up innovation and open doors in fields like science, law, or finance. But it also changes how companies must think about AI safety, compliance, and oversight. The models will evolve faster—but that means the tools to manage them must evolve too.

Conclusion: Towards Self-Regulating and Generally Intelligent Systems

The SEAL framework is a major breakthrough in the evolution of AI. By learning how to fine-tune itself using reinforcement learning, SEAL shows that language models can begin to adapt on their own—without waiting for humans to retrain them. Its strong performance on tough reasoning tasks, like those in ARC-AGI, proves that self-directed learning isn’t just possible—it works.

Other systems, like Transformer-Squared, are also pushing this vision forward in different ways. Together, these efforts signal a big shift away from frozen models toward smarter, more flexible AI—systems that act more like living minds than static tools.

Still, we’re not at general intelligence yet. Most of today’s self-adapting models work well in narrow tasks, but scaling them up while keeping them coherent, safe, and trustworthy remains a challenge. Benchmarks like ARC-AGI will help guide progress, setting the bar for real reasoning and skill learning in new situations.

The future of AI is full of potential—but also full of risk. As models begin to rewrite themselves, we’ll need stronger safety checks, clearer rules, and constant human oversight. Innovation must go hand in hand with responsibility. The goal isn’t just smarter machines. It’s smarter, safer systems that evolve with us—not past us.