APIEval-20: A Benchmark for Black-Box API Test Suite Generation
APIEval-20 benchmark dataset for evaluating black-box API test suite generation using LLMs and schemas.
APIEval-20 benchmark dataset for evaluating black-box API test suite generation using LLMs and schemas.
CEO used ChatGPT to terminate studio head; decision was reversed and criticized.
GPU profiling tool that diagnoses performance bottlenecks beyond utilization metrics. Minimal details provided but relevant tool.
MCP server for AI agents to select appropriate cloud services with current pricing and compatibility data. 74 services, no API key required.
Stanford research showing AI vision models generate images not in training data through hallucination mechanisms.
News about President Trump press interaction on Air Force One.
TRIBE v2: Predictive AI model of human brain responses to visual, auditory, and language stimuli from neuroscience research.
HD Audio driver for Windows 98SE/ME systems on Intel chipsets with WDM support.
R package that converts Excel workbooks to standalone R scripts with formula recreation and verification against cached values.
LLMnesia: Local-first search tool for AI conversation history across ChatGPT, Claude, Gemini, and other platforms.
Analysis of Meta's legal losses and liability implications from internal social science research on platform effects.
WhisperFlow: Free, open-source speech-to-text tool for macOS. On-device processing, no cloud upload, no account required.
arXiv research on 4D generation from natural language and images using embodied world models. Addresses data scarcity and long-horizon video generation challenges.
arXiv research proposing Balanced Fine-Tuning method for aligning LLMs with biomedical knowledge. Combines SFT and RL using confidence-weighted token optimization for scientific understanding.
arXiv research on streaming video understanding with gaze signal interpretation for AR applications. Evaluates multimodal LLMs on temporal reasoning with human attention signals.
arXiv research on multimodal memory architecture for long-form video understanding. Addresses context capacity and visual detail retention in hours-long videos using dynamic memory mechanisms.
Post-training method for lower-resource languages preserving fluency when aligned by disfluent reward models, addressing preference optimization data scarcity.
Feed-forward transformer model predicting 3D object articulations including parts, kinematic structure, and motion constraints for articulated object understanding.
Cascaded reinforcement learning infrastructure for scaling general-purpose reasoning models, addressing heterogeneity in response lengths and verification latency.
SonicMoE optimizes Mixture of Experts model inference through IO and tile-aware techniques, accelerating high-sparsity MoE architectures for language models.
Deep learning method for radio path loss prediction in multi-transmitter 5G scenarios, addressing distribution shifts and environmental generalization.
Dual-objective language model combining autoregressive and masked-diffusion training without architectural changes, improving efficiency and reducing overfitting.
Medical report generation using reinforcement learning with clinical alignment objectives, improving correctness over token-level likelihood training approaches.
Study comparing SpeechLLMs that directly process speech for translation against cascaded transcription pipelines, evaluating speech modality integration effectiveness.
Dual-State Architecture formalizes execution primitives coupling stochastic LLM generation with deterministic verification guards for reliable code generation agents.
Benchmark evaluating LiDAR 3D perception model robustness under simultaneous domain shifts and label-space evolution in autonomous driving scenarios.
Crucible system augments RAG with Q&A nuggets from documents, preserving citation provenance and improving extraction, selection, and report generation.
Study examining risks of RAG system evaluation and optimization using LLM judges, revealing circularity issues in nugget-based evaluation approaches.
CARPE method improving vision-centric capabilities of vision-language models through context-aware image representation prioritization via ensemble approach.
Framework addressing LLM's tendency to collapse ambiguous inputs prematurely by mapping text to non-collapsing state spaces for better dialogue reasoning.
Study introducing VAPT toolkit to evaluate how LLMs extract, embody, and explain human values from conversations through user perception research.
Benchmark for evaluating multimodal LLMs on handwritten STEM student solutions with mathematical formulas and diagrams, addressing authentic domain-specific evaluation gaps.
TernaryLM: Language model trained natively with 1.5-bit quantization achieving memory-efficient deployment on edge devices while maintaining language modeling capability.
Video generation model for precise instance insertion with sparse control in filmmaking applications, moving beyond prompt-engineering toward controllable generation.
Benchmark evaluating LLM-based coding agents on their ability to learn from context and reuse experience across related software engineering tasks in repositories.
Administrative law analysis of how government agencies balance technological capability with democratic oversight and accountability mechanisms.
Comparative study of CNN architectures (VGG, ResNet, GoogLeNet) analyzing relationship between depth and trainability in image recognition.
DUET-VLM: dual-stage token reduction framework for vision-language models reducing computational cost while maintaining accuracy during training and inference.
PedaCo-Gen: pedagogically-informed human-AI system for collaborative instructional video generation using Cognitive Theory of Multimedia Learning.
Layer gradient analysis method for identifying optimal layers in LLMs for knowledge editing while preserving model behavior on unrelated inputs.
Extension of ptychographic imaging to overlap-free single-shot coherent diffractive imaging using physics-informed neural networks.
SpotIt+: open-source verification tool for Text-to-SQL evaluation using bounded equivalence checking and constraint-mining for practical query discrepancies.
DiFlowDubber: two-stage approach for automated video dubbing using discrete flow matching for expressive prosody and precise audio-visual synchronization.
Method for measuring physical frame rate from visual dynamics in generative video models to improve temporal consistency.
AgentTrace: lightweight framework for post-hoc root cause analysis in deployed multi-agent systems using causal graph tracing from execution logs.
Study showing LLMs struggle with private library code generation despite API documentation; proposes teaching methods for private-library-oriented code generation.
Analysis of multimodal LLMs generating natural language explanations for face verification decisions on unconstrained images.
Goedel-Code-Prover: hierarchical proof search framework for automated code verification in Lean 4 using LLMs to decompose complex verification goals.
Analysis of how AI scaling laws reshape classical Amdahl's Law for modern heterogeneous computer architectures with specialized accelerators and tensor datapaths.
KG-Hopper: reinforcement learning framework enabling compact open-source LLMs to perform knowledge graph reasoning for multi-hop KBQA tasks.