top of page

Safety & Evaluation Artifacts

We publish datasets, models, and evaluation resources to support research on how agentic AI systems behave once deployed.

​

These artifacts are designed to surface failure modes that emerge when models interact with tools, operate over time, and rely on retrieval, compression, and imperfect objectives.

ArXiv DL Instruct Dataset

A dataset for studying how models interpret, follow, and generalize technical instructions derived from real research code.

Useful for analyzing brittleness, overconfidence, and instruction misgeneralization in technical domains — especially when models are embedded in research or agentic workflows.

​​

ArXiv Research Code Dataset

A large corpus of research code referenced directly in scientific papers, designed to study how models reason about, modify, and execute real-world code.

This dataset supports research into tool-use errors, hallucinated affordances, and silent failure modes when agents interact with complex software systems

​

​

ArXiv DL Instruct Dataset

A dataset for studying how models interpret, follow, and generalize technical instructions derived from real research code.

Useful for analyzing brittleness, overconfidence, and instruction misgeneralization in technical domains — especially when models are embedded in research or agentic workflows.

​​

ArXiv Instruct Tuning Dataset

A collection of instruction-tuning datasets derived from scientific abstracts.

These datasets are useful for studying how synthetic supervision shapes model behavior, including overfitting to surface patterns and failures to generalize under distribution shift.

​

Arxiv QA Bier Datasets

Question-answer datasets derived from ArXiv, designed for evaluating retrieval and search behavior in technical domains.

These datasets support analysis of retrieval failures, false confidence, and compounding errors in RAG and agentic systems that depend on external knowledge.

​

ArXiv Semantic Search Models

Semantic search models trained on large-scale scientific corpora.

Intended for studying how retrieval quality, query formulation, and embedding behavior affect downstream agent decisions — particularly in systems where retrieval errors propagate silently.

​

ArXiv Semantic Search Models

Semantic search models trained on large-scale scientific corpora.

Intended for studying how retrieval quality, query formulation, and embedding behavior affect downstream agent decisions — particularly in systems where retrieval errors propagate silently.

​

ArXiv LED Summarization Models

Models for summarizing full-length scientific papers from abstracts and source documents.

Useful for analyzing information loss, misrepresentation, and overconfidence introduced by compression — especially when summaries are used as inputs to downstream decision-making agents.

​

ChatGPT Image Nov 19, 2025, 02_26_24 PM.png

Advancing AI Together

We value the power of collaboration and are actively seeking partnerships with academic institutions, AI research labs, and individual researchers to drive innovation together.

RSE Lab

Thanks for submitting!

©2024 RSE Lab. All Rights Reserved.

bottom of page