Reflection of Episodes: Learning to Play Game from Expert and Self Experiences
Framework combining LLM self-reflection with expert and self-experience for StarCraft II gameplay. Addresses complex environment learning.
Framework combining LLM self-reflection with expert and self-experience for StarCraft II gameplay. Addresses complex environment learning.
Method for LLMs to generate reliable citations without external retrievers by leveraging pretraining knowledge. Improves inference efficiency.
Benchmark for evaluating long-term planning capabilities of LLMs and AI agents. Addresses gap in existing planning benchmarks.
Mathematical framework formalizing similarity relations as structural basis for dynamic systems. Theoretical foundational work.
Survey of autonomous LLM agents for scientific discovery, orchestrating human scientists, code, and physics simulations.
Survey of security threats, defenses, and evaluation methods for agentic AI systems with tool use, planning, and autonomous execution.
PRISM: Training-free framework combining prompt engineering and multi-agent coordination for financial document retrieval with LLMs.
Agent-based framework for automatic validation of mathematical optimization models generated by LLMs from natural language descriptions.
Research on iterative concept refinement for vision classifiers through human-in-the-loop deliberation for subjective visual tasks.
Finch: benchmark for evaluating agents on enterprise finance workflows including data entry, retrieval, calculation, and reporting using Enron dataset.
DDFT protocol for measuring epistemic robustness in LMs under degraded information and adversarial stress beyond static benchmarks.
HAG framework for topic-adaptive agent generation in agent-based modeling balancing macro-level distributions with micro-level rationality.
Mechanistic interpretability study of how Diffusion Transformers generate correct spatial relations between objects in text-to-image generation.
ConvoLearn dataset of 2,134 tutor-student dialogues for fine-tuning LLMs on dialogic tutoring principles in science education.
Study showing LLMs exhibit robustness to emotional framing in rule-bound decision-making despite known brittleness to prompt perturbations.
TSPO: RL framework for multi-turn search-augmented LLM reasoning addressing process and reward homogenization in tool-integrated tasks.
Method for improving Vision Language Model robustness when modalities are missing using scalable diffusion-based feature restoration.
Multi-agent LLM framework for discovering instrumental variables in causal inference through interdisciplinary knowledge synthesis.
Voxtral Realtime: natively streaming ASR model achieving sub-second latency with end-to-end training for audio-text alignment.
SSLogic: agentic meta-synthesis framework where LLM agents iteratively create and refine generator-validator pairs for logic reasoning tasks.
KLong: open-source LLM agent trained for extremely long-horizon tasks using trajectory-splitting SFT and progressive RL with Research-Factory pipeline.
AI Runtime Infrastructure layer that observes and optimizes agent execution for task success, latency, token efficiency, and safety.
DeepFact benchmark and co-evolving agent system for testing factuality of search-augmented LLM-generated research reports.
HECG framework for autonomous agents using LLMs with multi-dimensional error correction and strategy transfer across tasks.
Study showing that deliberation between multiple LLMs can amplify tiny perturbations into divergent decisions, challenging robustness assumptions.
Machine learning framework for automating defect detection in photovoltaic systems using electroluminescence imaging.
Proposes alternative training architecture for geometric and neuromorphic AI using non-standard arithmetic to reduce memory overhead.
Conceptual framework for AI governance addressing regulatory gaps between task-specific systems and foundation models.
Voxtral TTS expressive multilingual text-to-speech model generating natural speech from minimal reference audio.
Metriplector neural architecture primitive based on field theory where input configures abstract physical systems.
ClawSafety exposes security vulnerabilities in local LLM agent frameworks where prompt injection enables privilege escalation.
AgentSocialBench evaluates privacy risks in collaborative multi-agent social networks with persistent LLM agents.
Modal framework for knowledge representation handling domain-specific concept meaning shifts in knowledge graphs.
XpertBench evaluates LLM performance on expert-level open-ended tasks with rubrics-based assessment.
Addresses value hallucination in Dyna reinforcement learning agents through multistep predecessor models.
VLBiasBench evaluates biases in large vision-language models across diverse domains and question formats.
Study of app metamorphosis phenomenon where mobile apps undergo significant market repositioning.
MegaFake dataset of LLM-generated fake news for understanding mechanisms behind AI-generated misinformation.
SPRIG optimizes system prompts for LLMs using genetic algorithms to improve general task performance.
Comprehensive survey of document parsing techniques for extracting structured information from unstructured documents.
Certified Training with Branch-and-Bound for learning verifiably stable neural control systems.
RIRS framework for multi-agent RAG systems to route complex questions across distributed knowledge bases.
Human-AI collaboration for game testing using vision language models to enhance manual testing efficiency.
Framework for statistical inference on detected changepoints in sequential analysis with confidence sets.
Review of anomaly detection techniques for cyber-physical systems security in critical infrastructure.
Reasoning Model Implicit Association Test studies implicit bias-like patterns in LLMs that use step-by-step reasoning.
BalancedDPO method aligns diffusion models with multiple conflicting evaluation metrics for text-to-image generation.
Open-source benchmark for 3D chip design using OpenROAD framework, evaluates power, performance, area, and thermal metrics.
Investigates alignment of causal attribution scores (Shapley, Banzhaf, Causal Responsibility) for database tuple relevance in data management.
RaPA improves transferable targeted adversarial attacks by identifying and pruning redundant surrogate model parameters.