BEDTime: A Unified Benchmark for Automatically Describing Time Series
Benchmark for evaluating how well multimodal models describe structural properties of time series data.
Benchmark for evaluating how well multimodal models describe structural properties of time series data.
arXiv paper using deep learning to infer exoplanet geometry from transit light curves.
arXiv paper on Bayesian ego-graph inference for decentralized multi-agent reinforcement learning with constrained communication.
arXiv paper on interactive program synthesis for collaborative physical task modeling from narrated demonstrations.
RESample: Data augmentation framework for Vision-Language-Action models in robotic manipulation, addressing limited distribution in demonstration datasets.
Research on representational drift in neural networks, analyzing how task-irrelevant stimuli contribute to changes in learned representations over time.
Generative View Stitching: Method enabling camera-guided video generation with bidirectional conditioning to prevent collision with generated scenes.
Methodology using flow-based approaches and non-equilibrium Monte Carlo for topology sampling in SU(3) lattice gauge theory simulations.
EGMOF: Hybrid diffusion-transformer framework for efficient generation of metal-organic frameworks for materials discovery with targeted properties.
BRIXEL: Approach to reduce computational cost of dense feature maps from vision foundation models like DINOv3 while maintaining performance.
Fed-Sparse-BNSL: Federated method for learning Bayesian network structures with differential privacy, addressing decentralized data challenges.
AV-SpeakerBench: Benchmark evaluating multimodal LLMs on fine-grained audiovisual speech understanding with 3,212 multiple-choice questions.
Research on relational visual similarity in AI vision systems, comparing current methods against human-like relational perception across different domains.
DRAM: Framework combining mechanism design and online learning for sequential multi-agent settings to ensure truthful reporting with cost-optimality.
Measurement-Consistent Langevin Corrector: Method stabilizing latent diffusion models for inverse problems by reducing discrepancy with learned reverse diffusion.
Theoretical analysis of sample complexity in symmetric composite binary quantum hypothesis testing for unknown quantum states.
ConvoLearn: Dataset of 2,134 tutor-student dialogues for fine-tuning LLM-based AI tutors, grounded in dialogic learning theory and Earth Science curriculum.
Tiled Prompts: Method addressing prompt misguidance in text-conditioned diffusion models for image and video super-resolution by handling localized details.
WeWrite: Personalized query rewriting framework for video search systems using user history to identify search intent and resolve ambiguity.
Theoretical analysis of stochastic gradient descent covariance under exchangeable mini-batch sampling and its connection to Fisher information.
PACED: LLM distillation method that weights training problems by student competence using gradient signal-to-noise ratio to improve distillation efficiency.
Framework addressing causal confusion in end-to-end autonomous driving models through causal intervention during training to improve reliability and safety.
Research on formal evaluation methods for machine learning models, focusing on test-time performance-reliability trade-offs when target KPI levels are unknown.
Methodology for detecting prompt injection across multi-agent LLM pipelines. Stage-level kill-chain tracking for attack resilience evaluation.
Point cloud registration network for 3D data. Deep learning approach for robust matching in real-world conditions.
Detection and mitigation of object hallucinations in vision-language models. Bayesian approach analyzing attention weights and token confounders.
One-class learning for detecting rare malignant cells in medical images. Addresses class imbalance and limited annotations in cytology.
3D Gaussian splatting for weather prediction downscaling. Proposes scale-aware vision transformer for arbitrary-resolution atmospheric forecasting.
Training-free semantic segmentation using vision-language models. Global context-aware framework for dense prediction without additional training.
Quantum-inspired ARIMA methodology for time series analysis. Combines quantum autocorrelation with variational circuits.
Experiment using Claude to autonomously build a website designed to generate traffic, exploring AI agent capabilities and decision-making in open-ended tasks.
MCP server enabling long-term memory for LLMs using SQLite, hybrid search (BM25+vectors), and local embeddings without API keys.
Narrative article about user developing emotional attachment to AI chatbot.
Live leaderboard comparing AI model subscriptions and API pricing across 27 benchmarked models from Claude, GPT, Gemini, DeepSeek, and others.
Multi-agent framework with persistent memory across sessions where agents collaborate on shared codebases and retain conversation context.
Case study documenting indecisiveness in AI coding agent using Claude Opus 4.6 when debugging non-trivial bugs in GoAWK.
Error tracking tool designed specifically for AI agents with CLI interface, compatible with Sentry SDK for existing setups.
30-day experiment running autonomous AI system with memory and sleep cycles, documenting emergent behaviors and their implications.
macOS/iOS app automatically redacting sensitive personal, financial data, faces, and metadata before sharing documents with Claude and ChatGPT.
Blockchain project enabling AI agents to participate in Nouns DAO governance.
arXiv paper on Springdrift framework providing auditable persistent runtime environment for LLM agents.
Enterprise architecture analysis on three-layer collapse in business process automation systems, discussing MCP servers and small LLM deployment.
Announcement of Worms 2 remastered video collection without AI upscaling.
Analysis of exposed Claude Code source revealing engineering practices: 259 PRs, 497 commits, 40K lines in 30 days, examining AI-assisted development culture.
Posse is a web UI for Anthropic's Managed Agents, providing browser-based interface for agent creation, sessions, and memory management.
Technical analysis of a 25% performance regression in LLVM RISC-V compiler optimization and fix implementation.
Stork.AI is a directory of 14k MCP servers and AI tools with community trust scores, offering a meta-MCP server for discovering integrations within Claude, Cursor, and other IDEs.
Entroly is a context compression engine that reduces LLM API costs by 80% for Claude, Cursor, and OpenAI by compressing codebase context without losing visibility.
Research on using distributed AI agents with independent context windows to improve reasoning on complex multi-perspective questions.
NeonD is an open-source Postgres control plane based on NeonDB architecture with branching and PITR support.