top of page

Recursive Systems Evaluation Lab

We study how agentic AI systems behave once deployed — where long-horizon objectives, tools, incentives, and feedback loops introduce uncertainty that standard evaluations fail to capture.

Recursive Systems Evaluation Lab

As AI systems become more autonomous, failures shift from obvious errors to quiet, compounding ones.

​

Models that appear reliable in isolation can behave unpredictably when embedded in workflows, connected to tools, and optimized over time.

​

We focus on applied systems research: measuring uncertainty and diagnosing how real agentic AI systems break down in practice.

Research Areas

We conduct applied research on the measurement and evaluation of advanced AI systems, with an emphasis on agentic behavior, benchmarks, and real-world deployment.

Safety & Evaluation Artifacts

ArXiv DL Instruct Dataset

A dataset for studying how models interpret, follow, and generalize technical instructions derived from real research code.

Useful for analyzing brittleness, overconfidence, and instruction misgeneralization in technical domains, especially when models are embedded in research or agentic workflows.

​​

ArXiv Research Code Dataset

A large corpus of research code referenced directly in scientific papers, designed to study how models reason about, modify, and execute real-world code.

This dataset supports research into tool-use errors, hallucinated affordances, and silent failure modes when agents interact with complex software systems

​


We have also broken the dataset out into the most prominent languages

ArXiv Instruct Tuning Dataset

A collection of instruction-tuning datasets derived from scientific abstracts.

These datasets are useful for studying how synthetic supervision shapes model behavior, including overfitting to surface patterns and failures to generalize under distribution shift.

​

Arxiv QA Bier Datasets

Question-answer datasets derived from ArXiv, designed for evaluating retrieval and search behavior in technical domains.

These datasets support analysis of retrieval failures, false confidence, and compounding errors in RAG and agentic systems that depend on external knowledge.

​

ArXiv Semantic Search Models

Semantic search models trained on large-scale scientific corpora.

Intended for studying how retrieval quality, query formulation, and embedding behavior affect downstream agent decisions — particularly in systems where retrieval errors propagate silently.

​

ArXiv LED Summarization Models

Models for summarizing full-length scientific papers from abstracts and source documents.

Useful for analyzing information loss, misrepresentation, and overconfidence introduced by compression — especially when summaries are used as inputs to downstream decision-making agents.

​

We've taken over AI Improving AI from CAIS. View it here

Code

Our open-source baseline agent for ML Research Benchmark. This agent provides a foundation for comparing and evaluating machine learning research and development tasks that agents can perform.

The tasks for ML Research Benchmark, a benchmarkdesigned to evaluate the capabilities of AI agents in accelerating ML research and development. The benchmark consists of 9 competition-level tasks that span the spectrum of activities typically undertaken by ML researchers.

The AI Agent State Library is a library designed to manage the state and decision-making processes of AI agents. At its core, it implements the concept of finite state machines, a computational model used to design systems with a finite number of states and transitions between those states. 

ARIA Benchmarks is a suite of closed-book benchmarks designed to assess a models knowledge and understanding of machine learning research and methodologies

ChatGPT Image Nov 19, 2025, 02_26_24 PM.png

Advancing AI Together

We value the power of collaboration and are actively seeking partnerships with academic institutions, AI research labs, and individual researchers to drive innovation together.

RSE Lab

Thanks for submitting!

©2024 RSE Lab. All Rights Reserved.

bottom of page