
Recursive Systems Evaluation Lab

We study how agentic AI systems behave once deployed — where long-horizon objectives, tools, incentives, and feedback loops introduce uncertainty that standard evaluations fail to capture.
Recursive Systems Evaluation Lab
As AI systems become more autonomous, failures shift from obvious errors to quiet, compounding ones.
​
Models that appear reliable in isolation can behave unpredictably when embedded in workflows, connected to tools, and optimized over time.
​
We focus on applied systems research: measuring uncertainty and diagnosing how real agentic AI systems break down in practice.
Research Areas
We conduct applied research on the measurement and evaluation of advanced AI systems, with an emphasis on agentic behavior, benchmarks, and real-world deployment.
Safety & Evaluation Artifacts

ArXiv DL Instruct Dataset
A dataset for studying how models interpret, follow, and generalize technical instructions derived from real research code.
Useful for analyzing brittleness, overconfidence, and instruction misgeneralization in technical domains, especially when models are embedded in research or agentic workflows.
​​
-
AlgorithmicResearchGroup/ArXivDLInstruct (2.26 GB)
ArXiv Research Code Dataset
A large corpus of research code referenced directly in scientific papers, designed to study how models reason about, modify, and execute real-world code.
This dataset supports research into tool-use errors, hallucinated affordances, and silent failure modes when agents interact with complex software systems
​
We have also broken the dataset out into the most prominent languages
ArXiv Instruct Tuning Dataset
A collection of instruction-tuning datasets derived from scientific abstracts.
These datasets are useful for studying how synthetic supervision shapes model behavior, including overfitting to surface patterns and failures to generalize under distribution shift.
​

Arxiv QA Bier Datasets
Question-answer datasets derived from ArXiv, designed for evaluating retrieval and search behavior in technical domains.
These datasets support analysis of retrieval failures, false confidence, and compounding errors in RAG and agentic systems that depend on external knowledge.
​
ArXiv Semantic Search Models
Semantic search models trained on large-scale scientific corpora.
Intended for studying how retrieval quality, query formulation, and embedding behavior affect downstream agent decisions — particularly in systems where retrieval errors propagate silently.
​
ArXiv LED Summarization Models
Models for summarizing full-length scientific papers from abstracts and source documents.
Useful for analyzing information loss, misrepresentation, and overconfidence introduced by compression — especially when summaries are used as inputs to downstream decision-making agents.
​
We've taken over AI Improving AI from CAIS. View it here





