ConCuR: Conciseness Makes State-of-the-Art Kernel Generation

1The Hong Kong University of Science and Technology, 2Zhejiang University, 3University of Cambridge, 4Westlake University
(Work done when Lingcheng was a visiting student at ENCODE Lab, Westlake University.)

*Corresponding author: wanghuan [at] westlake [dot] edu [dot] cn
HKUST Logo Zhejiang University Logo Westlake University Logo
ENCODE Lab Logo

Abstract

GPU kernel generation by LLMs has recently experienced rapid development, leveraging test-time scaling and reinforcement learning techniques. However, a key challenge for kernel generation is the scarcity of high-quality data, as most high-quality kernels are proprietary and not open-source. This challenge prevents us from leveraging supervised fine-tuning to align LLMs to the kernel generation task. To address this challenge, we develop a pipeline that generates and curates high-quality CUDA kernels with reasoning traces, motivated by a critical observation that concise yet informative reasoning traces result in robust generation of high-performance kernels. Using this pipeline, we construct our dataset ConCuR and introduce our model KernelCoder, which is the first model trained on a curated dataset consisting of PyTorch, reasoning, and CUDA kernel pairs, to our knowledge. In the KernelBench setup, our model achieves significant improvements over the existing top-performing model, QwQ-32B, and outperforms all open-source models fine-tuned for kernel generation, as well as frontier models such as DeepSeek-V3.1-Think and Claude-4-sonnet. Finally, we show that the average reasoning length can serve as a metric to assess the difficulty of kernel generation tasks. The observations, metrics, and our data collection and curation pipeline can help obtain better data in the kernel generation task in the future.

Overview of the Data Gathering Pipeline

Data Gathering Pipeline

Overview of our two-stage data gathering pipeline. The first stage involves synthesizing CUDA kernels with corresponding CoTs and performing unit tests on each generated kernel to verify the correctness of the kernel and get the speedup over the torch eager implementation. The second stage is to balance the type of tasks in the dataset and select high-quality reasoning traces based on two criteria: (1) The speedup over torch eager and (2) The length of the corresponding CoT in tokens.

Observations

Reasoning Length Analysis

Observation 1: The shorter the length of reasoning, the higher the accuracy rate

Previous opinions assume that a more challenging task requires more thinking tokens; thereby, they claim that a correct generation with a long reasoning trace would be high-quality data to learn. However, our observation illustrated that although more challenging tasks typically require a greater number of reasoning tokens, for the same task, CUDA kernels generated after shorter reasoning traces tend to be correct more often than those produced through longer ones.

Speedup vs Reasoning Length Analysis

Observation 2: We observe that speedup is largely independent of reasoning length

From a conventional standpoint, longer reasoning processes, entailing more time spent on reasoning, are generally expected to enhance model performance. However, in CUDA kernel generation, we find that prolonged reasoning does not necessarily yield higher-quality kernels and may instead introduce redundant steps that neither enhance correctness nor improve execution efficiency.

Main Results

BibTeX

@article{ConCuR2025,
  title={ConCuR: Conciseness Makes State-of-the-Art Kernel Generation},
  author={Lingcheng Kong and Jiateng Wei and Hanzhang Shen and Huan Wang},
  journal={arXiv preprint arXiv:2510.07356},
  year={2025}
}