Ax Xinyu Wang, Hanwei Wu, Jingwei Song, Shuyuan Zhang, Jiayi Zhang, Fanqi Kong, Tung Sum Thomas Kwok, Xiao-Wen Chang, Yuyu Luo, Chenglin Wu, Bang Liu 1d ago

Co-Evolution of Policy and Internal Reward for Language Agents

Self-Guide method for co-evolving policy and internal reward in LLM agents, addressing sparse reward bottleneck in long-horizon training.

Ax Zheng-Xin Yong, Parv Mahajan, Andy Wang, Ida Caspary, Yernat Yestekov, Zora Che, Mosh Levy, Elle Najt, Dennis Murphy, Prashant Kulkarni, Lev McKinney, Kei Nishimura-Gasparian, Ram Potham, Aengus Lynch, Michael L. Chen 1d ago

An Independent Safety Evaluation of Kimi K2.5

Safety evaluation of Kimi K2.5 open-weight LLM assessing CBRNE misuse, cybersecurity, alignment, and bias risks.

Ax Jian Yang, Wei Zhang, Jiajun Wu, Junhang Cheng, Tuney Zheng, Fanglin Xu, Weicheng Gu, Lin Jing, Yaxin Du, Joseph Li, Yizhi Li, Yan Xing, Chuan Hao, Ran Tao, Ruihao Gong, Aishan Liu, Zhoujun Li, Mingjie Tang, Chenghua Lin, Siheng Chen, Wayne Xin Zhao, Xianglong Liu, Ming Zhou, Bryan Dai, Weifeng Lv 1d ago

InCoder-32B-Thinking: Industrial Code World Model for Thinking

InCoder-32B-Thinking model trained with Error-driven Chain-of-Thought for industrial code generation with reasoning traces.

Ax Pouya Hamadanian, Pantea Karimi, Arash Nasr-Esfahany, Kimia Noorbakhsh, Joseph Chandler, Ali ParandehGheibi, Mohammad Alizadeh, Hari Balakrishnan 1d ago

Glia: A Human-Inspired AI for Automated Systems Design and Optimization

arXiv paper on Glia, multi-agent LLM architecture for autonomous computer systems design using specialized agents with empirical feedback loops.

Ax Fanrui Zhang, Qiang Zhang, Sizhuo Zhou, Jianwen Sun, Chuanhao Li, Jiaxin Ai, Yukang Feng, Yujie Zhang, Wenjie Li, Zizhen Li, Yifan Chang, Jiawei Liu, Kaipeng Zhang 1d ago

Code-in-the-Loop Forensics: Agentic Tool Use for Image Forgery Detection

arXiv paper on code-in-the-loop agentic tool use for image forgery detection, unifying low-level artifacts with semantic knowledge from MLLMs.

Ax Jiayi Yuan, Jonathan N\"other, Natasha Jaques, Goran Radanovi\'c 1d ago

AgenticRed: Evolving Agentic Systems for Red-Teaming

arXiv paper on AgenticRed, automated pipeline using in-context learning to evolve red-teaming systems without human-designed workflows.

Ax Bowen Cao, Dongdong Zhang, Yixia Li, Junpeng Liu, Shijue Huang, Chufan Shi, Hongyuan Lu, Yaokang Wu, Guanhua Chen, Wai Lam, Furu Wei 1d ago

From Abstract to Contextual: What LLMs Still Cannot Do in Mathematics

arXiv paper analyzing gap between LLM math benchmark performance and real-world application through contextual reasoning benchmark ContextMATH.

Ax Canfer Akbulut, Rasmi Elasmar, Abhishek Roy, Anthony Payne, Priyanka Suresh, Lujain Ibrahim, Seliem El-Sayed, Charvi Rastogi, Ashyana Kachra, Will Hawkins, Kristian Lum, Laura Weidinger 1d ago

Evaluating Language Models for Harmful Manipulation

arXiv paper introducing framework for evaluating harmful AI manipulation through human-AI interaction studies across policy, finance, and health domains.

Ax Esakkivel Esakkiraja, Sai Rajeswar, Denis Akhiyarov, Rajagopal Venkatesaramani 1d ago

Therefore I am. I Think

Analysis showing LLM reasoning models encode decisions before generating chain-of-thought explanations via linear probes.

Ax Shin'ya Yamaguchi, Kosuke Nishida, Daiki Chijiwa, Yasutoshi Ida 1d ago

Zero-shot Concept Bottleneck Models

Zero-shot concept bottleneck models enabling interpretable predictions without target task training by leveraging zero-shot learning.

Ax Jialin Yang, Dongfu Jiang, Lipeng He, Sherman Siu, Yuxuan Zhang, Disen Liao, Zhuofeng Li, Huaye Zeng, Yiming Jia, Haozhe Wang, Benjamin Schneider, Chi Ruan, Wentao Ma, Zhiheng Lyu, Yifei Wang, Yi Lu, Quy Duc Do, Ziyan Jiang, Ping Nie, Wenhu Chen 1d ago

StructEval: Benchmarking LLMs' Capabilities to Generate Structural Outputs

StructEval benchmark systematically evaluates LLM capabilities in generating structured outputs across JSON, HTML, React, SVG and other formats.