Frequency Matters: Fast Model-Agnostic Data Curation for Pruning and Quantization
Frequency-based data curation method for selecting calibration data to preserve LLM performance during post-training pruning and quantization.
Frequency-based data curation method for selecting calibration data to preserve LLM performance during post-training pruning and quantization.
Examines prompt framing effects on LLM decision-making in threshold voting tasks across model families under isolated, non-interactive settings.
HR Simulator: Game-based evaluation of LLMs navigating complex workplace social norms like giving feedback and rejecting requests appropriately.
Identifies first-mover bias in SHAP explanations from gradient boosting's sequential fitting causing attribution instability under multicollinearity.
Step-level faithfulness evaluation shows chain-of-thought reasoning in frontier LLMs is often decorative, post-hoc narrative rather than genuine reasoning.
Code Review Agent Benchmark: Dataset and evaluation framework for assessing AI agents' ability to review code quality in generated codebases.
DiffAttn: Diffusion-based framework for predicting drivers' visual attention using LLM-enhanced semantic reasoning for intelligent vehicles.
SABLE: Semantics-aware backdoor attack on federated learning using realistic, in-distribution visual triggers instead of synthetic patterns.
MemFactory: Unified framework for training and inference of memory-augmented LLM agents with reinforcement learning optimization of memory operations.
Compares GraphRAG with VectorRAG for retrieval-augmented generation, showing simpler vector-based approaches handle chunk relationships effectively.
DVGT-2: Vision-Geometry-Action model for autonomous driving using dense 3D geometry instead of language descriptions for planning.
Analyzes safety, security, and cognitive risks in world models used for autonomous decision-making in robotics and agentic AI systems.
Uses sparse autoencoders to automatically annotate morphological traits in biological organism images, automating expert-driven extraction process.
Demonstrates environment-injected memory poisoning attacks on LLM-based web agents through contamination persisting across sessions without direct memory access.
GraphicDesignBench: First comprehensive benchmark for evaluating AI on professional graphic design tasks including layout translation and typographic rendering.
Identifies sparse routing mechanism in alignment-trained language models where gate attention heads trigger refusal responses, validated across 9 models from 6 labs.
Vero: Open-source family of vision-language models matching proprietary systems on visual reasoning tasks using reinforcement learning with public recipes and data.
arXiv research benchmark for evaluating AI performance on graphic design tasks, measuring model capabilities in visual design domains.
VitalNexa is an AI health agent that analyzes lab results and wearable data to provide personalized health recommendations and biological age scoring.
Author presents a protocol addressing AI's structural tendency to agree and sound authoritative rather than hallucinating, causing subtle reality distortions in outputs.
Technical analysis of practical constraints preventing AI agents from autonomous operation, mapping barriers and their severity.
Practical analysis of operational and technical barriers preventing autonomous AI agents, mapping constraints in agent economy.
Pydantic-resolve is a declarative data assembly library using DataLoader pattern to eliminate N+1 queries across REST, GraphQL, and MCP protocols.
2019 article about OpenAI's decision not to release GPT-2 over safety concerns (title only, no content).
Case study of server overload caused by LLM scraper bots making excessive HTTPS requests to acme.com domain.
Video of Sam Altman discussing AI development (title only, no content provided).
Research finding that larger and more instructable language models show decreased reliability (title only).
Static analysis tool detecting ReDoS vulnerabilities in Python regular expressions with automatic fixes.
Marketing content for oral dissolving peptides supplement product.
Omni Voice is a multilingual AI voice cloning and text-to-speech platform supporting 646 languages with unified model.
Drive9 is agent-native data infrastructure providing filesystem-like interface with semantic search, embedding, and full-text indexing for AI agents.
Analysis of impact when non-technical teams adopt AI coding agents at 100% adoption rate (title only).
Overview of how AI is transforming legal work by automating research, document review, and drafting tasks for lawyers and paralegals.
Kylrix is an open-source privacy suite offering note, vault, task, and communication alternatives to Google services.
Release Please automates changelog generation, GitHub releases, and version bumping via conventional commit parsing.
Analysis of MCP connection model security: agent frameworks keep all integrations live during sessions, creating unnecessary attack surfaces and costs.
GitHub Copilot CLI now supports bring-your-own-key models and local models via Azure OpenAI, Anthropic, or OpenAI-compatible endpoints.
Google's JSIR: open-source high-level intermediate representation for JavaScript code analysis and transforms.
Open-source spec-driven integration framework for API sprawl, enabling governed AI agent integration with SaaS/microservices.
HN discussion on tools for enforcing LLM/agent call limits at runtime rather than just monitoring, addressing cost control in agent systems.
SQLite-backed cost analytics tool for Claude Code sessions, tracking token spend, cache hits, and budget enforcement.
Thread initiation post about BSD operating systems and AI with minimal content and formatting guidelines.
Technique for scaling LLM-based vulnerability scanning across multiple files using strategic prompting and structured output for security analysis.
Opinion piece on evolution of AI agent development tools in 2026, discussing market consolidation and accessibility barriers for non-programmers.
FUSE-based filesystem interface for MongoDB-compatible databases, enabling file-based querying without drivers.
Claude Code skill that builds knowledge graphs from multimodal inputs to help developers understand codebase structure and architecture.
Crag governance compiler for AI coding tools that unifies configuration across 12 targets with 96.4% accuracy, solving multi-tool consistency.
Research preprint on blind-spot failures in LLM coding agents, proposing causal interpretation framework for improved agent reliability and rescue mechanisms.
MCP-compatible Chrome browser control for AI agents. Integrates with Claude, Cursor, Kiro clients. Supports human intervention for CAPTCHAs/MFA.
WSLg enables running Linux GUI applications on Windows via X11/Wayland integration.