how to optimize some algorithm in cuda.
RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.
[ICML2025] SpargeAttention: A training-free sparse attention that accelerates any model inference.
A high-throughput and memory-efficient inference and serving engine for LLMs
SGLang is a high-performance serving framework for large language models and multimodal models.