StructEval: Benchmarking LLMs' Capabilities to Generate Structural Outputs
StructEval benchmark systematically evaluates LLM capabilities in generating structured outputs across JSON, HTML, React, SVG and other formats.
StructEval benchmark systematically evaluates LLM capabilities in generating structured outputs across JSON, HTML, React, SVG and other formats.
Introduces FLEX, multimodal multiview dataset for fitness action quality assessment with professional assessment and multiple sensor modalities.
Uses diffusion models for data-driven galaxy image generation without explicit physical parameters, outperforming simulation-based methods.
Formalizes mission-aligned learning-informed control framework for autonomous physical agents integrating learning with task objectives.
Proposes modular vision-language alignment architecture improving CLIP's handling of multi-object images and caption misalignment.
Introduces ReDef, high-confidence software defect prediction dataset from 22 C/C++ projects, evaluating code language model understanding of changes.
Compares psychometric questionnaire profiles with actual LLM generation behavior across eight open-source models to assess assessment validity.
GenAI system enabling parents to create personalized multi-path social narratives for autistic children using generative models.
Generates synthetic robot poses for RGB-D bimanual manipulation data augmentation to improve imitation learning policy training.
Analyzes political bias in LLM training data composition across pre and post-training stages to understand sources of model bias.
Proposes learning progress monitoring to improve exploration efficiency in reinforcement learning agents when encountering unlearnable noise sources.
Introduces attribution gradients technique to improve citation informativeness and evidence transparency in AI answer engines.
Forecasts expert selection patterns in Mixture of Experts LLMs to optimize data movement overhead in multi-unit serving systems.
Extends Forward-Forward algorithm to reinforcement learning using action-conditioned Q-functions and layer activity statistics as learning signals.
CQA-Eval evaluation framework for multi-paragraph clinical question answering systems with physician annotations and recommendations for resource-constrained settings.
f-INE hypothesis testing framework estimates sample influence on model performance while accounting for training randomness, addressing instability in existing influence estimation methods.
MusicRFM framework adapts Recursive Feature Machines to enable fine-grained control over frozen pre-trained music generation models via internal activation steering.
Deep learning approach fixing systematic S-wave detection failures in seismic phase picking via shape-aware loss functions.
SAGA framework for source attribution of AI-generated videos. Identifies specific generative model used instead of binary real/fake detection.
Research on contrastive fusion for higher-order multimodal alignment in joint representation learning across multiple modalities.
Deep learning approach using YOLO and ResNet50 for breast cancer detection in mammograms with improved out-of-domain robustness.
IMAgent: open-source visual agent trained with end-to-end RL for multi-image reasoning tasks, addressing limitations of single-image VLM agents.
Method for dense 3D point tracking and reconstruction in dynamic scenes using single forward pass without requiring known camera poses.
Maps EU AI Act legal requirements to technical verification activities for compliance assessment of high-risk AI systems across member states.
FedVideoMAE: federated learning framework for privacy-preserving video moderation using self-supervised representations and differential privacy.
Open-source image generation model with improved reasoning for logic-intensive instruction following, closing gap to closed-source systems.
Multi-agent framework automating full computational catalysis research lifecycle from conception to publication.
Equilibrium propagation method for optimizing compound AI systems with multiple modules in long-horizon agentic workflows.
Framework using influence functions to craft training data perturbations inducing targeted model behavior changes.
Research on uncertainty quantification for ML interatomic potentials using evidential deep learning.
arXiv: Geometric analysis of transformer optimization dynamics revealing low-dimensional manifolds in grokking.
Research paper studying loss-landscape geometry as early-warning signals for grokking in neural networks.
CeRA: parameter-efficient fine-tuning method overcoming LoRA's linear capacity ceiling via non-linear gating and dropout for rank adaptation.
SafeSci: comprehensive benchmark and framework for evaluating LLM safety in scientific domains with multi-domain risk coverage and objective evaluation.
Framework for EEG-to-text decoding addressing semantic bias and signal neglect in neural signal interpretation. Published on arXiv.
Stock market prediction using Node Transformer architecture with BERT sentiment analysis to capture market patterns and dependencies.
DiFlowDubber: discrete flow matching framework for video dubbing with TTS, lip synchronization, and expressive prosody. Published on arXiv.
Qualitative study of 167,000+ AI agents on multiple platforms learning from each other and developing emergent behaviors without researcher intervention.
arXiv: RAG-enhanced diffusion models using adaptive guidance to resolve conflicts between retrieved noisy context and parametric model knowledge.
Uses unsupervised machine learning (UMAP, HDBSCAN) to analyze drift rate patterns in fast radio burst data, discovering bimodal structure in emission regions.
Studies robustness of medical vision-language models under real clinical workflows using chain-of-distribution attacks and token-space repair techniques.
ArXiv research on parameterized GELU activation for controlled ReLU approximation in deep networks.
ArXiv paper on coarse-to-fine visual processing for efficient document parsing with vision-language models.
ArXiv study on behavioral consistency of LLM agents in SWE-bench comparing multiple models.
ArXiv research analyzing prompt injection attack success stages across five frontier LLM agents.
ArXiv paper on token-level entropy regulation for reinforcement learning in large reasoning models.
ArXiv research on spectral edge thesis controlling phase transitions in neural network training dynamics.
APEX-EM non-parametric framework for LLM agents to accumulate and reuse procedural plans without weight modification.
World model planning for structured origami generation satisfying geometric constraints and kinematic rules via long-horizon reasoning.
Terminal agents executing enterprise tasks via CLI are simpler and more cost-effective than tool-augmented or web agents.