Safety & Evaluation Artifacts

We publish datasets, models, and evaluation resources to support research on how agentic AI systems behave once deployed.

These artifacts are designed to surface failure modes that emerge when models interact with tools, operate over time, and rely on retrieval, compression, and imperfect objectives.

ArXiv DL Instruct Dataset

A dataset for studying how models interpret, follow, and generalize technical instructions derived from real research code.

Useful for analyzing brittleness, overconfidence, and instruction misgeneralization in technical domains — especially when models are embedded in research or agentic workflows.

AlgorithmicResearchGroup/ArXivDLInstruct (2.26 GB)

ArXiv Research Code Dataset

A large corpus of research code referenced directly in scientific papers, designed to study how models reason about, modify, and execute real-world code.

This dataset supports research into tool-use errors, hallucinated affordances, and silent failure modes when agents interact with complex software systems

AlgorithmicResearchGroup/arxiv_research_code (21 GB)

ArXiv DL Instruct Dataset

A dataset for studying how models interpret, follow, and generalize technical instructions derived from real research code.

Useful for analyzing brittleness, overconfidence, and instruction misgeneralization in technical domains — especially when models are embedded in research or agentic workflows.

AlgorithmicResearchGroup/ArXivDLInstruct (2.26 GB)

ArXiv Instruct Tuning Dataset

A collection of instruction-tuning datasets derived from scientific abstracts.

These datasets are useful for studying how synthetic supervision shapes model behavior, including overfitting to surface patterns and failures to generalize under distribution shift.

Arxiv QA Bier Datasets

Question-answer datasets derived from ArXiv, designed for evaluating retrieval and search behavior in technical domains.

These datasets support analysis of retrieval failures, false confidence, and compounding errors in RAG and agentic systems that depend on external knowledge.

ArXiv Semantic Search Models

Semantic search models trained on large-scale scientific corpora.

Intended for studying how retrieval quality, query formulation, and embedding behavior affect downstream agent decisions — particularly in systems where retrieval errors propagate silently.

ArXiv Semantic Search Models

Semantic search models trained on large-scale scientific corpora.

Intended for studying how retrieval quality, query formulation, and embedding behavior affect downstream agent decisions — particularly in systems where retrieval errors propagate silently.

ArXiv LED Summarization Models

Models for summarizing full-length scientific papers from abstracts and source documents.

Useful for analyzing information loss, misrepresentation, and overconfidence introduced by compression — especially when summaries are used as inputs to downstream decision-making agents.

Safety & Evaluation Artifacts

ArXiv DL Instruct Dataset

ArXiv Research Code Dataset

AlgorithmicResearchGroup/arxiv_research_code (21 GB)

ArXiv DL Instruct Dataset

ArXiv Instruct Tuning Dataset

Arxiv QA Bier Datasets

ArXiv Semantic Search Models

AlgorithmicResearchGroup/arxiv-distilbert-base-v3-GenQ

AlgorithmicResearchGroup/arxiv-distilroberta-base-GenQ

ArXiv Semantic Search Models

AlgorithmicResearchGroup/arxiv-distilbert-base-v3-GenQ

AlgorithmicResearchGroup/arxiv-distilroberta-base-GenQ

ArXiv LED Summarization Models

Advancing AI Together

We value the power of collaboration and are actively seeking partnerships with academic institutions, AI research labs, and individual researchers to drive innovation together.

RSE Lab

©2024 RSE Lab. All Rights Reserved.

Safety & Evaluation Artifacts

ArXiv DL Instruct Dataset

ArXiv Research Code Dataset

AlgorithmicResearchGroup/arxiv_research_code (21 GB)

​

ArXiv DL Instruct Dataset

ArXiv Instruct Tuning Dataset

Arxiv QA Bier Datasets

ArXiv Semantic Search Models

​

AlgorithmicResearchGroup/arxiv-distilbert-base-v3-GenQ

AlgorithmicResearchGroup/arxiv-distilroberta-base-GenQ​

ArXiv Semantic Search Models

​

AlgorithmicResearchGroup/arxiv-distilbert-base-v3-GenQ

AlgorithmicResearchGroup/arxiv-distilroberta-base-GenQ​

ArXiv LED Summarization Models

​

Advancing AI Together

We value the power of collaboration and are actively seeking partnerships with academic institutions, AI research labs, and individual researchers to drive innovation together. RSE Lab

©2024 RSE Lab. All Rights Reserved.

AlgorithmicResearchGroup/arxiv-distilroberta-base-GenQ

AlgorithmicResearchGroup/arxiv-distilroberta-base-GenQ

We value the power of collaboration and are actively seeking partnerships with academic institutions, AI research labs, and individual researchers to drive innovation together.

RSE Lab