Abstract
GPU kernel generation by LLMs has recently experienced rapid development, leveraging test-time scaling and reinforcement learning techniques. However, a key challenge for kernel generation is the scarcity of high-quality data, as most high-quality kernels are proprietary and not open-source. This challenge prevents us from leveraging supervised fine-tuning to align LLMs to the kernel generation task. To address this challenge, we develop a pipeline that generates and curates high-quality CUDA kernels with reasoning traces, motivated by a critical observation that concise yet informative reasoning traces result in robust generation of high-performance kernels. Using this pipeline, we construct our dataset ConCuR and introduce our model KernelCoder, which is the first model trained on a curated dataset consisting of PyTorch, reasoning, and CUDA kernel pairs, to our knowledge. In the KernelBench setup, our model achieves significant improvements over the existing top-performing model, QwQ-32B, and outperforms all open-source models fine-tuned for kernel generation, as well as frontier models such as DeepSeek-V3.1-Think and Claude-4-sonnet. Finally, we show that the average reasoning length can serve as a metric to assess the difficulty of kernel generation tasks. The observations, metrics, and our data collection and curation pipeline can help obtain better data in the kernel generation task in the future.
Overview of the Data Gathering Pipeline

Overview of our two-stage data gathering pipeline. The first stage involves synthesizing CUDA kernels with corresponding CoTs and performing unit tests on each generated kernel to verify the correctness of the kernel and get the speedup over the torch eager implementation. The second stage is to balance the type of tasks in the dataset and select high-quality reasoning traces based on two criteria: (1) The speedup over torch eager and (2) The length of the corresponding CoT in tokens.
Observations

Observation 1: The shorter the length of reasoning, the higher the accuracy rate
Previous opinions assume that a more challenging task requires more thinking tokens; thereby, they claim that a correct generation with a long reasoning trace would be high-quality data to learn. However, our observation illustrated that although more challenging tasks typically require a greater number of reasoning tokens, for the same task, CUDA kernels generated after shorter reasoning traces tend to be correct more often than those produced through longer ones.

Observation 2: We observe that speedup is largely independent of reasoning length
From a conventional standpoint, longer reasoning processes, entailing more time spent on reasoning, are generally expected to enhance model performance. However, in CUDA kernel generation, we find that prolonged reasoning does not necessarily yield higher-quality kernels and may instead introduce redundant steps that neither enhance correctness nor improve execution efficiency.
Main Results

Pass@1 results on KernelBench Level 1 and Level 2.
We present the correct rate (Exec) and the fast1 score across levels, reported as percentages. The best result is labeled by Bold.

Pass@10 results on KernelBench Level 1 and Level 2.
The results show that our model is capable of solving most tasks in KernelBench successfully by simply employing parallel inference.
BibTeX
@article{ConCuR2025,
title={ConCuR: Conciseness Makes State-of-the-Art Kernel Generation},
author={Lingcheng Kong and Jiateng Wei and Hanzhang Shen and Huan Wang},
journal={arXiv preprint arXiv:2510.07356},
year={2025}
}