AgentSocialBench: Evaluating Privacy Risks in Human-Centered Agentic Social Networks
AgentSocialBench evaluates privacy risks in collaborative multi-agent social networks with persistent LLM agents.
AgentSocialBench evaluates privacy risks in collaborative multi-agent social networks with persistent LLM agents.
Modal framework for knowledge representation handling domain-specific concept meaning shifts in knowledge graphs.
XpertBench evaluates LLM performance on expert-level open-ended tasks with rubrics-based assessment.
Addresses value hallucination in Dyna reinforcement learning agents through multistep predecessor models.
VLBiasBench evaluates biases in large vision-language models across diverse domains and question formats.
Study of app metamorphosis phenomenon where mobile apps undergo significant market repositioning.
MegaFake dataset of LLM-generated fake news for understanding mechanisms behind AI-generated misinformation.
SPRIG optimizes system prompts for LLMs using genetic algorithms to improve general task performance.
Comprehensive survey of document parsing techniques for extracting structured information from unstructured documents.
Certified Training with Branch-and-Bound for learning verifiably stable neural control systems.
RIRS framework for multi-agent RAG systems to route complex questions across distributed knowledge bases.
Human-AI collaboration for game testing using vision language models to enhance manual testing efficiency.
Framework for statistical inference on detected changepoints in sequential analysis with confidence sets.
Review of anomaly detection techniques for cyber-physical systems security in critical infrastructure.
Reasoning Model Implicit Association Test studies implicit bias-like patterns in LLMs that use step-by-step reasoning.
BalancedDPO method aligns diffusion models with multiple conflicting evaluation metrics for text-to-image generation.
Open-source benchmark for 3D chip design using OpenROAD framework, evaluates power, performance, area, and thermal metrics.
Investigates alignment of causal attribution scores (Shapley, Banzhaf, Causal Responsibility) for database tuple relevance in data management.
RaPA improves transferable targeted adversarial attacks by identifying and pruning redundant surrogate model parameters.
Online test-time adaptation method for spiking neural networks via threshold modulation, enabling edge deployment with distribution shift handling.
FSD bridges reasoning and decision-making in robotic manipulation by combining Vision-Language Models with action prediction for zero-shot generalization.
Bayesian ablation framework for interpreting latent task representations in neural networks, enabling probabilistic analysis of learned representations.
VERDI uses Vision-Language Models embedded in autonomous driving stack for reasoning-based trajectory planning under partial observability.
Chapter reviewing ML/AI applications in food processing, covering classification frameworks and data science approaches to food informatics.
SoSBench evaluates LLM safety alignment across six scientific domains with sophisticated, knowledge-intensive adversarial prompts.
Framework for evaluating LLM judges of LLM outputs, accounting for both sampling and judge quality uncertainty without gold-standard scores.
K-Steering enables unified multi-attribute control of LLM behavior at inference time using non-linear steering on hidden activations.
PhysGaia benchmark for dynamic novel view synthesis with physics-aware evaluation of multi-body interactions and realistic collisions.
LLMs applied to combinatorial optimization of Design Structure Matrices in engineering, demonstrating reasoning capabilities for complex system reorganization.
ZINA detects and edits fine-grained hallucinations in multimodal LLMs, proposing a novel evaluation task for MLLM quality.
Vision Transformer-based framework reconstructs multispectral satellite imagery obscured by clouds using SAR data for crop mapping.
PRISM: lightweight fully convolutional model for multivariate time-series classification on edge devices.
Framework treating prompts as first-class citizens in LLM pipelines to enable reuse, optimization, and runtime adaptation in complex agent systems.
CATNet applies geometric deep learning (R-GCN) to catastrophe bond spread prediction in financial markets.
Embodied-R1 introduces a 3B VLM using "pointing" as unified intermediate representation to address the seeing-to-doing gap in robotic manipulation across different embodiments.
ShadowNPU system co-design for efficient on-device LLM inference on NPUs, addressing quantization sensitivity in attention operators.
Benchmarking study of deep learning segmentation models for carotid artery structures in histopathological images with limited datasets.
DoubleAgents system for human-agent alignment in coordination tasks using a coordination agent and dashboard for preference elicitation and feedback.
Neural-MedBench reasoning-intensive benchmark for evaluating clinical reasoning ability of vision-language models beyond classification accuracy.
Vid-Freeze defense mechanism against malicious image-to-video generation using temporal freezing adversarial techniques.
MedIRT psychometric framework for evaluating LLM medical competency rather than benchmark-specific performance using Item Response Theory.
ACT system combines decision trees with LLMs to provide transparent, interpretable, and auditable AI decisions on unstructured data.
Study of how autonomy levels in LLM agents affect user privacy concerns and trust, with implications for personalization design.
FURINA-Builder multi-agent pipeline for automatically constructing customizable role-playing benchmarks at scale for evaluating LLM agent behavior.
Security analysis of LLM pruning methods showing vulnerabilities in popular inference engines like vLLM when models are pruned before deployment.
Survey of image and video restoration techniques for adverse weather conditions in intelligent transportation systems and autonomous driving.
IoT and wireless sensor networks for industrial monitoring and control using NRF transceivers and Arduino microcontrollers.
Watermarking technique for LLMs using syntactic predictability to balance text quality against detection robustness for governance and trustworthiness.
XModBench benchmark measures cross-modal consistency and modality-specific biases in omni-modal large language models across audio, vision, and text.
Game-theoretic framework for evaluating LLMs on subjective and open-ended tasks beyond fixed-format benchmarks with reference answers.