NEMA: Nace Meta-Agent


Summary

At Nace.AI, we are pioneering the future of AI system development through advanced meta-learning architectures designed to generate specialized, task-specific agents automatically. Our Nace Meta-Agent (NEMA) exemplifies this approach. NEMA successfully created an AI system capable of tackling the difficult Certified Public Accountant (CPA) exam, achieving performance comparable to leading generalist models like OpenAI's o1. This breakthrough highlights the power and efficiency of our meta-agent technology for creating highly specialized AI systems that can be rapidly deployed for complex, domain-specific challenges, offering organizations a significant competitive advantage.


Introduction

The traditional approach to AI model development involves building and fine-tuning individual systems for specific tasks—a resource-intensive process that often leads to time-consuming solutions. At Nace.AI, we're taking a fundamentally different approach by developing meta-models that generate other models.

Inspired by groundbreaking research in automated agent design—specifically the ADAS [1] and AFlow [2] frameworks—we built the Nace Meta-Agent (NEMA). NEMA is a meta-agent framework designed to autonomously generate high-performing, task-specific AI agents. To validate its capabilities, we targeted one of the most demanding professional assessments: the CPA Exam.

The CPA Exam is a rigorous, four-part examination covering complex accounting principles, auditing standards, business law, taxation, and regulatory requirements. Its blend of multiple-choice questions and intricate scenario-based problems demands deep technical knowledge and sophisticated analytical reasoning. Successfully navigating the CPA exam requires expertise comparable to seasoned professionals, making it an ideal benchmark for evaluating the capabilities of specialized AI systems. NEMA's success in generating a proficient CPA agent underscores the potential of our meta-learning approach.


The Meta-Agent Architecture (NEMA)

flowchart TD
    init[Initialize Archive] --> generate[Generate Solution]
    
    generate --> evaluate[Evaluate Solution]
    evaluate --> update[Update Archive]
    update --> check{Generation Complete?}
    
    check -->|No| generate
    check -->|Yes| test[Test Best Solutions]
    
    %% External components with clear connections
    generate --- genLLM[Generator LLM]
    evaluate --- execLLM[Executor LLM]
    evaluate --- system[Evaluation System]
    
    %% Styling with better contrast
    classDef process fill:#e6f7ff,stroke:#1890ff,stroke-width:3px;
    classDef external fill:#f6ffed,stroke:#52c41a,stroke-width:3px;
    classDef decision fill:#fff7e6,stroke:#fa8c16,stroke-width:3px;
    classDef llm fill:#f9f0ff,stroke:#722ed1,stroke-width:3px;
    
    class init,generate,evaluate,update,test process;
    class system external;
    class genLLM,execLLM llm;
    class check decision;

Fig 1. Meta-Agent Workflow

NEMA operates as a model-agnostic workflow, orchestrated by Large Language Models (LLMs) and equipped with external tools, specifically designed to discover and refine high-performing task-specific agents. This architecture embodies the principles of meta-learning, or "learning to learn," enabling the system to improve its agent-generation capabilities over time.

  1. Initialization: The process begins with an Archive containing examples of basic agent structures (e.g., Chain of Thought [3], Self-Reflection [4]).

  2. Evaluation: Each agent in the Archive is evaluated against a validation dataset using an Executor LLM. Performance scores are recorded.

  3. Generation: The Archive, along with the validation scores, is provided to a Generator LLM. The Generator LLM is prompted to create a new agent workflow (as Python code) designed to maximize performance on the validation set.

  4. Evaluation (New Agent): The newly generated agent is evaluated using the Executor LLM and the validation dataset.

  5. Update Archive: The new agent and its score are added to the Archive.

  6. Iteration: Steps 3-5 are repeated for a predefined number of iterations, allowing NEMA to explore the space of possible agent designs and progressively improve.

  7. Final Selection: After N iterations, all generated agents are evaluated on a final test dataset, and the best-performing agent is selected.

 Our implementation offers three key innovations:

  1. Enhanced Initialization: We seed the initial Archive with sophisticated agent examples employing techniques like Program of Thoughts (PoT) [5] and Graph of Thoughts (GoT) [7]. Using code-proficient LLMs (like o3-mini, Claude 3.7 thinking, Gemini 2.5 Pro) as the Generator LLM, along with few-shot examples, significantly improves the quality and robustness of the generated agent code.

  1. Optimized Search Strategy: Instead of computationally expensive Monte Carlo Tree Search or potentially memory-intensive linear searches through generated agents,  we use a top-K selection strategy. After each generation, agents are ranked by their validation scores, and only the top-performing ones (e.g., top 20) are retained in the Archive. These top-K agents are then passed back to the Generator LLM in subsequent iterations. This approach efficiently balances exploration quality and context window limitations.

  1. Comprehensive Feedback Loop: Going beyond simple accuracy scores, we feed detailed information about incorrectly and correctly answered validation questions back to the Generator LLM. This rich textual feedback, complementing the performance scores, acts as a powerful self-correction mechanism, enabling the Generator LLM to understand why certain agents succeed or fail and generate more precise and effective solutions in subsequent iterations.


A Case Study in Specialized Agent Development

Through multiple iterations of this automated search and self-improvement process, NEMA generated a highly specialized agent tailored for the CPA exam. This agent achieves strong results, performing on par with state-of-the-art models like OpenAI's o1. Notably, this was achieved using OpenAI's o3-mini as the Generator LLM and the open-source deepseek-r1 [6] as the Executor LLM, demonstrating the ability to leverage diverse model capabilities effectively.

During development, NEMA identified that initial agent versions, while strong in reasoning, lacked the necessary precision for calculation-heavy sections like Financial Accounting and Reporting (FAR). To address this, NEMA integrated a Program of Thoughts [5] approach coupled with a secure Python interpreter tool (running in a Docker container for safety). This allows the agent to offload complex calculations, ensuring accuracy without compromising the LLM's reasoning flow.

The final CPA-specific agent architecture generated by NEMA employs a sophisticated multi-expert workflow, illustrated in Figure 2:

  1. Question Classification: The initial step involves a sophisticated analysis of the incoming CPA exam question to precisely determine its core requirements. This classification, performed by an LLM component within the agent, distinguishes between questions demanding primarily conceptual understanding, regulatory recall, or complex numerical computation. 

  2. Calculation Assessment & Execution: When the classification step identifies a need for numerical work, the agent invokes a specialized module combining the Program of Thoughts (PoT) methodology with a secure Python interpreter. PoT guides the LLM to break down the calculation into logical steps and generate corresponding Python code. This code is then executed within an isolated Docker container, leveraging libraries like math and numpy for robust and accurate computation. This approach deliberately solves the known limitations of LLMs in performing complex arithmetic reliably, offloading the task to a dedicated computational environment. The result is significantly improved accuracy and trustworthiness for quantitative problems, crucial for sections like Financial Accounting and Reporting (FAR), while maintaining safety through containerization.

  3. Multi-Expert Debate: Following the initial processing and any necessary calculations, the question and intermediate findings are presented to a simulated panel of domain specialists. These are distinct instances or prompted roles within the LLM framework, each embodying expertise in a specific area of the CPA exam: Taxation, Auditing, Financial Accounting & Reporting, and Business Environment & Concepts. These virtual experts engage in a structured, multi-round debate. Each expert analyzes the problem from its specialized viewpoint, proposes solutions or interpretations, critiques the reasoning of others, and refines its own stance based on the collective feedback. This iterative deliberation process allows for a deeper, multi-faceted analysis, uncovering nuances, cross-validating facts, and identifying potential errors that a single monolithic reasoning process might miss.

  4. Consensus Building: The final stage involves synthesizing the diverse perspectives and arguments generated during the multi-expert deliberation into a single, coherent, and high-confidence answer. A dedicated component reviews the entire debate transcript, weighs the arguments presented by each expert, identifies points of agreement and disagreement, and resolves conflicts based on the strength of evidence and reasoning.

flowchart TD
    start([Start]) --> classify[Question Classifier]
    classify --> decision{Calculation Required?}
    
    decision -->|Yes| calc[Calculation Engine]
    decision -->|No| debate[Debate Engine]
    
    calc --> valid{Solution Valid?}
    valid -->|No| debate
    valid -->|Yes| insights[Calculation Insights]
    
    debate --> consensus[Expert Consensus]
    insights --> final[Decision Agent]
    consensus --> final
    
    final --> answer([Final Answer])
    
    %% Styling
    classDef process fill:#f0f8ff,stroke:#2196f3,stroke-width:2px;
    classDef calculation fill:#e8f5e9,stroke:#4caf50,stroke-width:2px;
    classDef debate fill:#f3e5f5,stroke:#9c27b0,stroke-width:2px;
    classDef decision fill:#fff3e0,stroke:#ff9800,stroke-width:2px;
    
    class classify,final process;
    class calc,insights calculation;
    class debate,consensus debate;
    class decision,valid decision;
    class start,answer process;

Fig 2. CPA Agent Workflow


Performance Benchmarks

Our approach has yielded impressive results when benchmarked against leading AI models:

Model

Accuracy (%)

O1

91.08

NEMA CPA agent (ours)

90.20

Claude-3-7

88.38

Claude-3-7-thinking

88.38

Gemini-2.0-flash

86.75

GPT-4o

84.32

Deepseek-V3

80.80

Qwen-QwQ-32B

80.27

Deepseek-R1

79.7

Llama-3.3-70B

76.49

Qwen2.5-72B

72.43

These results demonstrate that our meta-agent approach can produce specialized systems that compete with and potentially outperform general-purpose frontier models on specific tasks—all while offering greater adaptability.


Conclusion and Future Directions

At Nace.AI, we're committed to continuing our investment in research and development of hypernetworks, multi-task learning, and meta-learning frameworks. We believe this approach to rapid adaptation will deliver AI systems that are more efficient, more powerful, and more accessible.

Our meta-agent architecture is designed to be applicable across various professional domains requiring specialized expertise. In the near future, we plan to extend this approach to generate high-performing agents for other professional certification exams, including the Certified Financial Analyst (CFA), Certified Internal Auditor (CIA), and other specialized certifications.

By automating the development of expert systems through meta-learning, we're creating a new paradigm for AI deployment that combines the power of frontier models with the specificity and reliability needed for mission-critical business applications.


References

[1] Automated Design of Agentic Systems, https://arxiv.org/pdf/2408.08435
[2] AFlow: Automating Agentic Workflow Generation, https://arxiv.org/pdf/2410.10762
[3] Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, https://arxiv.org/pdf/2201.11903
[4] Self-Reflection in LLM Agents: Effects on Problem-Solving Performance, https://arxiv.org/pdf/2405.06682
[5] Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks, https://arxiv.org/pdf/2211.12588
[6] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, https://arxiv.org/pdf/2501.12948
[7] Graph of Thoughts: Solving Elaborate Problems with Large Language Models, https://arxiv.org/pdf/2308.09687

All rights reserved. Nace.AI © 2026

All rights reserved. Nace.AI © 2026