Neurosymbolic approach combining LLMs with Logic Tensor Networks for auditable offer validation in regulated procurement, ensuring factually correct and legally verifiable decisions.
COSMO-Agent tool-augmented RL framework teaching LLMs to bridge CAD-CAE gap by translating simulation feedback into valid geometric edits for iterative industrial design optimization.
ResearchEVO framework for automated scientific discovery using LLMs to conduct undirected experimentation and generate explanations, instantiating discover-then-explain paradigm computationally.
Research on LLM-as-a-Judge showing both humans and LLMs exhibit bias toward human-authored content labels over identical AI-generated content via counterfactual design and eye-tracking.
Philosophical critique of behavioral evaluation paradigms for AI systems and proposal for cognitive assessment methods.
PECKER algorithm for efficient machine unlearning in diffusion models with directed gradient updates.
CuraLight framework combining RL and LLMs for traffic signal control with debate-guided data curation.
LudoBench benchmark evaluating LLM strategic reasoning in Ludo board game with 480 handcrafted scenarios.
Quality-aware mixture of experts for multimodal sentiment analysis robust to noise and modality missingness.
Unlearn-and-Reinvent pipeline testing whether LLMs can rediscover foundational algorithms after unlearning removal.
Study on cultural evolution showing minimal social learning can transmit higher-level representations without inference.
Hierarchical RL framework (STEP-HRL) for LLM agents using step-level transitions to reduce computational cost and history length.
Vision-language model critic for automated iterative refinement of frontend code generation with visual feedback loops.
Open-source framework for autonomous LLM agents conducting deep learning experiments with hypothesis formation, training, and iterative refinement.
Diagnostic framework determining when LLMs are necessary for contextual multi-armed bandits with text and numerical context.
JTON format, JSON superset with Zen Grid encoding for token-efficient structured data processing in LLMs.
Joint knowledge base completion and QA using combined large and small language models for KB-related tasks.
KV cache compression technique for multimodal LLM inference, reducing memory overhead and latency with hybrid compression strategy.
Architecture for value-driven LLM agents addressing behavioral rigidity through context-value-action design.
Foundation model enabling single GPT-based agent to perform across diverse multi-agent reinforcement learning tasks and environments.
Research agent framework for generating trustworthy reports with confidence estimation and calibration mechanisms.
Multi-objective preference alignment for LLMs using Pareto-lenient consensus to handle diverse human values in model training.
AI agents for retail supply chain operations, automating demand forecasting, procurement, and inventory replenishment in supermarket chains.
Proposes epistemic blinding, an inference-time auditing protocol to separate memorized priors from data-driven inference in LLM-assisted agentic analysis systems.
Investigates instruction-following mechanisms in LLMs through diagnostic probing, finding evidence for compositional skill deployment over universal mechanism.
Proposes ACE-Bench, agent evaluation benchmark with unified grid-based planning tasks, lightweight environments, and configurable difficulty/horizon control.
Introduces Claw-Eval, an end-to-end evaluation suite for autonomous agents addressing trajectory-opaque grading, safety, and interaction modality coverage.
Theoretical analysis of contextuality in quantum information systems as external bookkeeping cost under classical simulation.
Proposes Web Retrieval-Aware Chunking (W-RAC) for efficient RAG document chunking to balance retrieval quality, latency, and cost on web-scale content.
Proposes Task-Driven Alignment (TDA-RC) for improving reasoning chains in LLMs by bridging logical gaps between CoT and multi-round thought paradigms.
Evaluates bidirectional training objectives (MLM, masked attention) to mitigate the reversal curse in autoregressive language models.
Introduces Inclusion-of-Thoughts (IoT), a strategy to reduce LLM instability on multiple-choice questions by filtering irrelevant distractors.
Proposes SUMMIR framework for ranking sports insights extracted by LLMs, addressing hallucinations with 7,900-article dataset across four sports.
Evaluates four open-source PDF-to-Markdown conversion frameworks (Docling, MinerU, Marker, DeepSeek OCR) for RAG document preprocessing impact on QA accuracy.
Studies how to design information retrieval systems for LLM agents versus humans, proposing learning-to-rank methods for agent trajectories.
Analysis of how generative AI enables social engineering fraud and trust manipulation attacks in financial crime scenarios.
Surveys transition from heuristic-based to generative synthesis methods for automatic video trailer generation using LLMs and diffusion models.
Opinion piece on environmental and computational costs of scaling LLM agents and implications for planetary boundaries.
Self-supervised foundation model (CalM) trained on neuronal calcium traces for neuroscience task transfer learning.
Proposes MG²-RAG, a multi-granularity graph approach for retrieval-augmented generation in multimodal LLMs to improve cross-modal reasoning without costly text translation.
Independent evaluation of Claude Code's auto mode permission system for AI coding agents, testing security gates on ambiguous authorization scenarios.
Introduces Squeez, a method for pruning tool outputs in coding agents by identifying minimal relevant evidence blocks. Includes 11,477-example benchmark from SWE-bench.
CURE enables privacy-preserving unlearning in LLM-based recommendation systems using circuit-aware techniques for removing user data.
Cactus improves speculative sampling for LLM decoding by relaxing strict distribution matching to allow acceptable variations like top-k sampling.
Prune-Quantize-Distill pipeline for neural network compression optimizing wall-clock inference time rather than parameter count or FLOPs.
Analysis of implicit architectural decisions made by AI coding agents, identifying five mechanisms and six prompt-architecture coupling patterns.
FreakOut-LLM framework investigates whether emotionally charged prompts compromise safety alignment in ten LLMs using psychological stimuli.
Comparative evaluation of embedding-based and generative models for document classification, showing Vision-Language Models with CoT achieve 82% zero-shot accuracy.
PRIME enables multimodal self-supervised pretraining for cancer prognosis with missing modalities by combining histopathology, gene expression, and reports.
Case study of closed-loop software development system managing backlog via deterministic pipeline with Jira integration and safety constraints.