I made a Top-K implementation that’s up to 20x faster than PyTorch CPU (open source)

Enhancing Language Model Sampling: Introducing a High-Performance Top-K Selection Algorithm

In the realm of large language model (LLM) deployment, efficient sampling mechanisms are crucial for maintaining high throughput and responsiveness. Recently, a significant breakthrough was achieved in optimizing the Top-K selection process—a fundamental step in model sampling—leading to performance improvements of up to 20 times faster than traditional CPU-based implementations in PyTorch.

The Challenge of Efficient Top-K Selection

Selecting the top K elements from a large probability distribution is computationally intensive, especially when dealing with vocabularies of hundreds of thousands of tokens. Standard frameworks like PyTorch provide built-in functions, but they often fall short in speed, particularly on CPU hardware when processing large models or data batches.

Introducing an Optimized, Open-Source Solution

To address this bottleneck, an AVX2-optimized, batched Top-K selection algorithm was developed. This implementation leverages advanced SIMD (Single Instruction, Multiple Data) instructions, adaptive sampling strategies, and cache-efficient scanning techniques to dramatically accelerate the process.

Key Features:

AVX2-Accelerated Batching: Utilizes SIMD instructions for parallel data processing, maximizing CPU throughput.
Adaptive Sampling & Cache Optimization: Ensures that data access patterns are efficient, reducing latency.
Fast Paths for Specific Input Types: Optimized handling for sorted or constant input sequences.
Single-Pass Algorithm: Complete selection process within a single iteration, eliminating redundant passes.
GPU-Free Execution: Achieves high performance without relying on GPU acceleration.

Benchmarking Performance

Performance tests across different vocabulary sizes demonstrate the significant gains over PyTorch’s CPU implementation:

| Vocabulary Size | Custom Implementation (ms) | PyTorch CPU (ms) | Speedup Factor |
|——————-|——————————|——————|—————-|
| 32,000 tokens | 0.043 | 0.173 | 4x |
| 128,000 tokens | 0.057 | 0.777 | 13x |
| 256,000 tokens | 0.079 | 1.560 | 20x |

These results underscore the algorithm’s scalability and efficiency, especially with larger vocabularies.

Real-World Impact

Integrating this optimized Top-K selection into the llama.cpp framework yielded a substantial performance boost. Specifically, prompt processing throughput increased by approximately 63%, elevating token generation rates on a 120-billion parameter Mixture of Experts (MoE) model from 81 to 142 tokens per second. Such improvements can significantly enhance the responsiveness and scalability of deployment pipelines.

Accessibility and Future Directions

The implementation is available as an open-source project, complete with pre-built DLLs and compatible integration code for Windows platforms. Contributors and enthusiasts are encouraged to review, provide feedback, or suggest enhancements.

Learn more and access the code here: https://github.com/RAZZULLIX/fast_topk_batched

Conclusion

This development exemplifies how targeted low-level optimizations and algorithmic innovations can drastically improve core model operations. By reducing the computational overhead of Top-K selection, developers can achieve higher throughput, lower latency, and more scalable LLM applications without reliance on GPU resources.

Keywords: Large Language Models, Top-K Sampling, CPU Optimization, AVX2, SIMD, Performance Benchmarking, Open Source, llama.cpp, Model Deployment

Holidays in Europe