Co-Evolution of Policy and Internal Reward for Language Agents
Self-Guide method for co-evolving policy and internal reward in LLM agents, addressing sparse reward bottleneck in long-horizon training.
Self-Guide method for co-evolving policy and internal reward in LLM agents, addressing sparse reward bottleneck in long-horizon training.
Knowledge graph completion approach for network alert prediction modeling cyber-attacks as hyper-relational statements.
Benchmarking training-free unlearning methods for removing sensitive visual concepts from vision-language models.
Safety evaluation of Kimi K2.5 open-weight LLM assessing CBRNE misuse, cybersecurity, alignment, and bias risks.
Domain-adapted RAG pipeline using fine-tuned embedding models for pedagogical dialogue act annotation without generative model fine-tuning.
Systematic security evaluation of six OpenClaw-series AI agent frameworks identifying vulnerabilities in tool-augmented LLM agents.
Case study of AI-assisted unit test writing and test-driven refactoring for improving legacy codebase maintainability.
InCoder-32B-Thinking model trained with Error-driven Chain-of-Thought for industrial code generation with reasoning traces.
Method for identifying valence-arousal emotion subspace in LLM representations using steering vectors and PCA.
Survey of contextual enrichment strategies for LLMs from in-context prompting through retrieval-augmented generation and GraphRAG.
Analysis of hallucination effects in reinforcement learning post-training for multimodal LLMs, examining whether RL improves visual reasoning or merely exploits hallucinations.
Research on optimization primitives in context space for AI agents, addressing credit assignment, overfitting, and learning signal challenges.
arXiv paper on multi-teacher knowledge distillation for low-resource abstractive summarization using inter-teacher agreement for supervision routing.
arXiv paper introducing PR3DICTR, open-access PyTorch/MONAI framework for 3D medical image classification and outcome prediction.
arXiv paper on server learning with client filtering to improve federated learning robustness against malicious attacks.
arXiv paper on WiseMind, multi-agent LLM framework inspired by Dialectical Behavior Therapy for reliable and empathetic psychiatric diagnosis.
arXiv paper on AutoCO, LLM-based method coupling OR principles with bidirectional coevolution for complex constraint optimization problems.
arXiv paper on Glia, multi-agent LLM architecture for autonomous computer systems design using specialized agents with empirical feedback loops.
arXiv paper introducing CostBench benchmark for evaluating LLM tool-use agents on cost-optimal planning and adaptation in dynamic environments.
arXiv paper on code-in-the-loop agentic tool use for image forgery detection, unifying low-level artifacts with semantic knowledge from MLLMs.
arXiv paper on ClinicalReTrial, multi-agent system using LLMs to redesign failing clinical trial protocols with actionable recommendations.
arXiv paper on AgenticRed, automated pipeline using in-context learning to evolve red-teaming systems without human-designed workflows.
arXiv paper analyzing gap between LLM math benchmark performance and real-world application through contextual reasoning benchmark ContextMATH.
arXiv survey on autonomous driving using synthetic data and virtual environments for training and evaluation.
arXiv paper on embedding authorization mechanisms directly into LLM reasoning to prevent data leakage and unauthorized command execution.
arXiv paper introducing framework for evaluating harmful AI manipulation through human-AI interaction studies across policy, finance, and health domains.
arXiv paper proposing PAPO, integrating process-level evaluation into policy optimization to improve reasoning quality beyond final-answer correctness.
arXiv paper on multi-agent RAG with adaptive orchestration and evolving agent prompts to handle complex multi-hop reasoning tasks.
Analysis showing LLM reasoning models encode decisions before generating chain-of-thought explanations via linear probes.
Study evaluating reliability and risk of AI systems in medication decision-making and healthcare workflows.
OSCAR framework for mitigating hallucinations in diffusion language models using self-verification during generation.
SAT/MaxSAT framework for solving 2D cutting stock problem in manufacturing optimization.
Research probing whether LLMs encode awareness of conversation continuity by generating user turns after assistant responses.
Novel framework using LLMs for causal graph discovery via breadth-first search, reducing query complexity from quadratic to linear.
Improves emotion intensity and speaker consistency in zero-shot LLM-based text-to-speech through expressive prompt design methods.
Multimodal LLM fine-tuned for interpretable image forgery detection and localization providing semantic understanding beyond low-level artifacts.
Proposes scale transformation method for transferable targeted adversarial attacks requiring minimal data without surrogate model feedback.
Zero-shot concept bottleneck models enabling interpretable predictions without target task training by leveraging zero-shot learning.
Improves text-to-video generation semantic and temporal consistency using neuro-symbolic feedback without retraining the model.
LMask framework uses dynamic masking with learning to solve constrained routing problems as combinatorial optimization tasks.
StructEval benchmark systematically evaluates LLM capabilities in generating structured outputs across JSON, HTML, React, SVG and other formats.
Introduces FLEX, multimodal multiview dataset for fitness action quality assessment with professional assessment and multiple sensor modalities.
Uses diffusion models for data-driven galaxy image generation without explicit physical parameters, outperforming simulation-based methods.
Formalizes mission-aligned learning-informed control framework for autonomous physical agents integrating learning with task objectives.
Proposes modular vision-language alignment architecture improving CLIP's handling of multi-object images and caption misalignment.
Introduces ReDef, high-confidence software defect prediction dataset from 22 C/C++ projects, evaluating code language model understanding of changes.
Compares psychometric questionnaire profiles with actual LLM generation behavior across eight open-source models to assess assessment validity.
GenAI system enabling parents to create personalized multi-path social narratives for autistic children using generative models.
Generates synthetic robot poses for RGB-D bimanual manipulation data augmentation to improve imitation learning policy training.
Analyzes political bias in LLM training data composition across pre and post-training stages to understand sources of model bias.