Recursive Systems Evaluation Lab

We study how agentic AI systems behave once deployed — where long-horizon objectives, tools, incentives, and feedback loops introduce uncertainty that standard evaluations fail to capture.

As AI systems become more autonomous, failures shift from obvious errors to quiet, compounding ones.

Models that appear reliable in isolation can behave unpredictably when embedded in workflows, connected to tools, and optimized over time.

We focus on applied systems research: measuring uncertainty and diagnosing how real agentic AI systems break down in practice.

Research Areas

We conduct applied research on the measurement and evaluation of advanced AI systems, with an emphasis on agentic behavior, benchmarks, and real-world deployment.

Introducing: ArXivDLInstruct

We're excited to announce the release of ArXivDLInstruct, a new open-source dataset designed for instruction tuning on Python research...

ARIA Benchmarks

We're excited to introduce the ARIA Benchmarks (AI Research Intelligence Assessment), a set of natural language closed-book benchmarks...

The ArXiv Research Code Dataset

We introduce the ArXiv Research Code Dataset, a collection code extracted from repositories linked to computer science papers published...

Safety & Evaluation Artifacts

ArXiv DL Instruct Dataset

A dataset for studying how models interpret, follow, and generalize technical instructions derived from real research code.

Useful for analyzing brittleness, overconfidence, and instruction misgeneralization in technical domains, especially when models are embedded in research or agentic workflows.

AlgorithmicResearchGroup/ArXivDLInstruct (2.26 GB)

ArXiv Research Code Dataset

A large corpus of research code referenced directly in scientific papers, designed to study how models reason about, modify, and execute real-world code.

This dataset supports research into tool-use errors, hallucinated affordances, and silent failure modes when agents interact with complex software systems

AlgorithmicResearchGroup/arxiv_research_code (21 GB)

We have also broken the dataset out into the most prominent languages

AlgorithmicResearchGroup/arxiv_python_research_code (4.8 GB)
AlgorithmicResearchGroup/arxiv_cplusplus_research_code (10.6 GB)
AlgorithmicResearchGroup/arxiv_java_research_code (1.03 GB)

ArXiv Instruct Tuning Dataset

A collection of instruction-tuning datasets derived from scientific abstracts.

These datasets are useful for studying how synthetic supervision shapes model behavior, including overfitting to surface patterns and failures to generalize under distribution shift.

Arxiv QA Bier Datasets

Question-answer datasets derived from ArXiv, designed for evaluating retrieval and search behavior in technical domains.

These datasets support analysis of retrieval failures, false confidence, and compounding errors in RAG and agentic systems that depend on external knowledge.

ArXiv Semantic Search Models

Semantic search models trained on large-scale scientific corpora.

Intended for studying how retrieval quality, query formulation, and embedding behavior affect downstream agent decisions — particularly in systems where retrieval errors propagate silently.

ArXiv LED Summarization Models

Models for summarizing full-length scientific papers from abstracts and source documents.

Useful for analyzing information loss, misrepresentation, and overconfidence introduced by compression — especially when summaries are used as inputs to downstream decision-making agents.

We've taken over AI Improving AI from CAIS. View it here

Code

ML-Research-Agent

Our open-source baseline agent for ML Research Benchmark. This agent provides a foundation for comparing and evaluating machine learning research and development tasks that agents can perform.

ML Research Benchmark Tasks

The tasks for ML Research Benchmark, a benchmarkdesigned to evaluate the capabilities of AI agents in accelerating ML research and development. The benchmark consists of 9 competition-level tasks that span the spectrum of activities typically undertaken by ML researchers.

AI Agent State

The AI Agent State Library is a library designed to manage the state and decision-making processes of AI agents. At its core, it implements the concept of finite state machines, a computational model used to design systems with a finite number of states and transitions between those states.

ARIA Benchmark

ARIA Benchmarks is a suite of closed-book benchmarks designed to assess a models knowledge and understanding of machine learning research and methodologies

Recursive Systems Evaluation Lab

We study how agentic AI systems behave once deployed — where long-horizon objectives, tools, incentives, and feedback loops introduce uncertainty that standard evaluations fail to capture.

Research Areas

Introducing: ArXivDLInstruct

ARIA Benchmarks

The ArXiv Research Code Dataset

Safety & Evaluation Artifacts

ArXiv DL Instruct Dataset

ArXiv Research Code Dataset

AlgorithmicResearchGroup/arxiv_research_code (21 GB)

We have also broken the dataset out into the most prominent languages

ArXiv Instruct Tuning Dataset

Arxiv QA Bier Datasets

ArXiv Semantic Search Models

AlgorithmicResearchGroup/arxiv-distilbert-base-v3-GenQ

AlgorithmicResearchGroup/arxiv-distilroberta-base-GenQ

ArXiv LED Summarization Models

Code

ML-Research-Agent

Our open-source baseline agent for ML Research Benchmark. This agent provides a foundation for comparing and evaluating machine learning research and development tasks that agents can perform.

ML Research Benchmark Tasks

The tasks for ML Research Benchmark, a benchmarkdesigned to evaluate the capabilities of AI agents in accelerating ML research and development. The benchmark consists of 9 competition-level tasks that span the spectrum of activities typically undertaken by ML researchers.

AI Agent State

The AI Agent State Library is a library designed to manage the state and decision-making processes of AI agents. At its core, it implements the concept of finite state machines, a computational model used to design systems with a finite number of states and transitions between those states.

ARIA Benchmark

ARIA Benchmarks is a suite of closed-book benchmarks designed to assess a models knowledge and understanding of machine learning research and methodologies

Advancing AI Together

We value the power of collaboration and are actively seeking partnerships with academic institutions, AI research labs, and individual researchers to drive innovation together.

RSE Lab

©2024 RSE Lab. All Rights Reserved.

Recursive Systems Evaluation Lab

We study how agentic AI systems behave once deployed — where long-horizon objectives, tools, incentives, and feedback loops introduce uncertainty that standard evaluations fail to capture.

Research Areas

Introducing: ArXivDLInstruct

ARIA Benchmarks

The ArXiv Research Code Dataset

Safety & Evaluation Artifacts

ArXiv DL Instruct Dataset

ArXiv Research Code Dataset

AlgorithmicResearchGroup/arxiv_research_code (21 GB)

We have also broken the dataset out into the most prominent languages

ArXiv Instruct Tuning Dataset

Arxiv QA Bier Datasets

ArXiv Semantic Search Models

​

AlgorithmicResearchGroup/arxiv-distilbert-base-v3-GenQ

AlgorithmicResearchGroup/arxiv-distilroberta-base-GenQ​

ArXiv LED Summarization Models

​

Code

ML-Research-Agent

Our open-source baseline agent for ML Research Benchmark. This agent provides a foundation for comparing and evaluating machine learning research and development tasks that agents can perform.

ML Research Benchmark Tasks

The tasks for ML Research Benchmark, a benchmarkdesigned to evaluate the capabilities of AI agents in accelerating ML research and development. The benchmark consists of 9 competition-level tasks that span the spectrum of activities typically undertaken by ML researchers.

AI Agent State

The AI Agent State Library is a library designed to manage the state and decision-making processes of AI agents. At its core, it implements the concept of finite state machines, a computational model used to design systems with a finite number of states and transitions between those states.

ARIA Benchmark

ARIA Benchmarks is a suite of closed-book benchmarks designed to assess a models knowledge and understanding of machine learning research and methodologies

Advancing AI Together

We value the power of collaboration and are actively seeking partnerships with academic institutions, AI research labs, and individual researchers to drive innovation together. RSE Lab

©2024 RSE Lab. All Rights Reserved.

AlgorithmicResearchGroup/arxiv-distilroberta-base-GenQ

We value the power of collaboration and are actively seeking partnerships with academic institutions, AI research labs, and individual researchers to drive innovation together.

RSE Lab