Introduction to CUDA programming for Python developers

343 points by t55 a day ago

Stupid question: Is there any chance that I, as an engineer, can get away from learning the Math side of AI but still drill deeper into the lower level of CUDA or even GPU architecture? If so, how do I start? I guess I should learn about optimization and why we chose to use GPU for certain computations.

Parallel question: I work as a Data Engineer and always wonder if it's possible to get into MLE or AI Data Engineering without knowing AI/ML. I thought I only need to know what the data looks like, but so far I see every job description of an MLE requires background in AI.

danielmarkbruce a day ago

Yes. They are largely unrelated. Just go to Nvidia's site and find the docs. Or there are several books (look at amazon).
A "background in AI" is a bit silly in most cases these days. Everyone is basically talking about LLMs or multimodal models which in practice haven't been around long. Sebastian Raschka has a good book about building an LLM from scratch, Simon Prince has a good book on deep learning, Chip Huyen has a good book on "AI engineering". Make a few toys. There you have a "background".
Now if you want to really move the needle... get really strong at all of it, including PTX (nvidia gpu assembly, sort of). Then you can blow people away like the deep seek people did...
- jms55 a day ago
  
  Lets say you already have deep knowledge of GPU architecture and experience optimizing GPU code to saves 0.5ms runtime for a kernel. But you got that experience from writing graphics code for rendering, and have little knowledge of AI stuff beyond surface level stuff of how neural networks work.
  How can I leverage that experience into earning the huge amounts of money that AI companies seem to be paying? Most job listings I've looked at require a PhD in specifically AI/math stuff and 15 years of experience (I have a masters in CS, and no where close to 15 years of experience).
  - suresk 20 hours ago
    
    I've only done the CUDA side (and not professionally), so I've always wondered how much those skills transfer either way myself. I imagine some of the specific techniques employed are fairly different, but a lot of it is just your mental model for programming, which can be a bit of a shift if you're not used to it.
    I'd think things like optimizing for occupancy/memory throughput, ensuring coalesced memory accesses, tuning block sizes, using fast math alternatives, writing parallel algorithms, working with profiling tools like nsight, and things like that are fairly transferable?
  - danielmarkbruce 21 hours ago
    
    I don't have a great answer except learn as much about AI as possible - the easiest starting point is Simon Prince's book - and it's free online. Maybe start submitting changes to pytorch? Get a name for yourself? I don't know.
    Most companies aren't doing a lot of heavy GPU optimization. That's why deepseek was able to come out of nowhere. Most (not all) AI research basically takes the given hardware (and most of the software) stack as a given and is about architecture, loss functions, data mix, activation functions blah blah blah.
    Speculation - a good amount of work will go towards optimizations in future (and at the big shops like openAI, a good amount already is).
  - pavelstoev a day ago
    
    Is this hypothetical person someone you know? if yes, please email me to pavel at centml dotz ai
  - saagarjha 21 hours ago
    
    You can get paid that without the GPU experience so yes. Getting up to speed with this is mostly just a function of how able you are to understand what modern ML architectures look like.
- ferguess_k a day ago
  
  Thank you! This really helps. I'll concentrate on Computer Architecture and lower level optimization then. I'll also pick one of the books just to get some ideas.
- t55 a day ago
  
  Agreed, Rashka's book is amazing and will probably become the seminal book on LLMs
  - barrenko 15 hours ago
    
    Just to add that he has a video series on DL (youtube), completely approachable and accompanied by code notebooks.
    
    ra7 11 hours ago
    
    How does it compare with Andrej Karpathy’s video series on building GPTs from scratch? Are they pretty much teaching the same things?
    
    barrenko 9 hours ago
    
    Karpathy focuses on GPT, well, NLP-related specifics, while Raschka overviews Deep Learning as a whole, starting from the Perceptron basically.
    Karpathy's teaching style is well, Karpathy, Raschka is more conventional (but not buttoned down).
SJC_Hacker 20 hours ago

The math isn't that difficult. The transformers paper (https://proceedings.neurips.cc/paper_files/paper/2017/file/3...) was remarkably readable for such a high impact paper. Beyond the AI/ML specific terminology (attention) that were thrown out
Neural networks are basically just linear algebra (i.e matrix multiplication) plus an activation function (ReLu, sigmoid, etc.) to generate non-linearities.
Thats first year undergrad in most engineering programs - a fair amount even took it in high school.
- OtherShrezzing 15 hours ago
  
  I'd like to re-enforce this viewpoint. The math is non-trivial, but if you're a software engineer, you have the skills required to learn _enough_ of it to be useful in the domain. It's a subject which demands an enormous amount of rote learning - exactly the same as software engineering.
- t55 9 hours ago
  
  hot take: i don't think you even need to understand much linear algebra/calculus to understand what a transformer does. like the math for that could probably be learned within a week of focused effort.
  - SJC_Hacker 9 hours ago
    
    Yeah to be honest its mostly the matrix multiplication, which I got in second year algebra (high school)0.
    You don't really need even need to know about determinants, inverting matrices, Gauss-Jordan elimination, eigenvalues, etc. that you'd get in a first year undergrad linear algebra
dragandj 9 hours ago

May I plug-in with ClojureCUDA, a high-level library that lets you write CUDA with almost no overhead, but write it in the interactive Clojure REPL.
https://github.com/uncomplicate/clojurecuda
There's also tons of free tutorials at https://dragan.rocks And a few books! (not free) at https://aiprobook.com
Everything from scratch, interactive, line-by-line, and each line is executed in the live REPL.
codelion 10 hours ago

Not a stupid question at all! Imo, you can definitely dive deep into CUDA and GPU architecture without needing to be a math whiz. Think of it like this: you can be a great car mechanic without being the engineer who designed the engine.
Start with understanding parallel computing concepts and how GPUs are structured for it. Optimization is key - learn about memory access patterns, thread management, and how to profile your code to find bottlenecks. There are tons of great resources online, and NVIDIA's own documentation is surprisingly good.
As for the data engineering side, tbh, it's tougher to get into MLE without ML knowledge. However, focusing on the data pipeline, feature engineering, and data quality aspects for ML projects might be
- ferguess_k 9 hours ago
  
  Thanks for the help!
  > As for the data engineering side, tbh, it's tougher to get into MLE without ML knowledge. However, focusing on the data pipeline, feature engineering, and data quality aspects for ML projects might be
  I have a feeling that companies usually expect MLE to do both ML/AI and Data Engineering, so this might indeed be a dead end. Somehow I'm just not very interested in the MLE part of ML so I'll dormant that thought for the meanwhile.
  > Start with understanding parallel computing concepts and how GPUs are structured for it. Optimization is key - learn about memory access patterns, thread management, and how to profile your code to find bottlenecks. There are tons of great resources online, and NVIDIA's own documentation is surprisingly good.
  Thanks a lot! I'll take these points in mind when learning. I need to go through more basic CompArch materials first I think. I'm not a good programmer :D
- t55 10 hours ago
  
  Agreed, not sure how much math is really needed.
musebox35 8 hours ago

I suggest having a look at https://m.youtube.com/@GPUMODE
They have excellent resources to get you started with Cuda/Triton on top of torch. It also has a good community around it so you get to listen to some amazing people :)
JAlexoid 9 hours ago

> Math side of AI but still drill deeper into the lower level of CUDA or even GPU architecture
CUDA requires clear understanding of mathematics related to graphics processing and algebra. Using CUDA like you would use traditional CPU would yield abysmal performance.
> MLE or AI Data Engineering without knowing AI/ML
It's impossible to do so, considering that you need to know exactly how the data is used in the models. At the very least you need to understand the basics of the systems that use your data.
Like 90% of the time spent in creating ML based applications is preparing the data to be useful for a particular use case. And if you take Google ML Crash Course, you'll understand why you need to know what and why.
codelion 19 hours ago

It's definitely possible to focus on the CUDA/GPU side without diving deep into the math. Understanding parallel computing principles and memory optimization is key. I've found that focusing on specific use cases, like optimizing inference, can be a good way to learn. On that note, you might find https://github.com/codelion/optillm useful – it optimizes LLM inference and could give you practical experience with GPU utilization. What kind of AI applications are you most interested in optimizing?
the__alchemist 13 hours ago
I will provide general advice that applies here, and elsewhere: Start with a project, and implement it, using CUDA. The key will be identifying a problem that is SIMD in nature. Choose something you would normally use a loop for, but that has many (e.g. tens of thousands or more) iterations, which do not depend on the output of the other iterations.
Some basic areas to focus on:
```
  - Setting up the architecture and config
  - Learning how to write the kernels, and what makes sense for a kernel
  - Learning how the IO and synchronization between CPU and GPU work.
```
This will be as learning any new programming skill.
Falimonda 19 hours ago

If you want to dive into CUDA specifically then I recommend following some of the graphics tutorials. Then mess around with it yourself, trying to implement any cool graphic/visualization ideas or remixes on the tutorial material.
You could also try to recreate or modify a shader you like from https://www.shadertoy.com/playlist/featured
You'll inevitably pick up some of the math along the way and probably have fun doing it along the way.
moondev 13 hours ago

From an infrastructure perspective, If you have access to the hardware, a fun starting point is running NCCL tests across the infrastructure. Start with a single GPU, then 8 GPUs on a host, then 24 GPU multi hosts over IB or RoCE. You will get a feel for MPI and plenty of knobs to turn on the Kubernetes side.
physicsguy 15 hours ago

Yes, but the problems that need GPU programming also tend to require you to have some understanding of maths. Not exclusively - but it needs to be a problem that's divisible into many small pieces that can be recombined at the end, and you need to have enough data to work through that the compute cost + data transfer cost is much lower than just doing it on CPU.
t55 a day ago

IMO absolutely yes. I would start with the linked introduction and then ask myself if I enjoyed it.
for a deeper dive, check out the sth like Georgia Tech’s CS 8803 O21: GPU Hardware and Software.
To get into MLE/AI Data Engineering, I would start with a brief introductory ML course like Andrew Ng’s on Coursera
- ferguess_k a day ago
  
  Thanks! I'll follow the link and see what happens. And thanks for recommending Andrew Ng's course too, hopefully it gives enough background to know how the users (AI scientists) want us to prepare the data.
llm_trw a day ago

I mean yes, but without knowing the maths then knowing how to optimize the maths is a bit useless?
At the very least you should know enough linear algebra that you understand scalar, vector and matrix operations against each of the others. You don't need to be able to derive back prop from first principles, but you should know what happens when you multiply a matrix by a vector and apply a non-linear function to the result.
- ferguess_k a day ago
  
  Thanks! Yeah I do know some Math. I'm not sure how much I need to know. I guess the more the merrier, but it would be nice to know a line that I don't need to cross to properly do my job.
  - llm_trw 18 hours ago
    
    It's a tough one, I've never seen a book that actually covers the _bare_ minimum of the maths you need for ML.
    The little learner comes close but I'd only really suggest that to people who already know the maths because the presentation is very non-standard and can get very misleading.
    If you're interested drop me a line on my profile email and I'll have a look at some numerical algebra books and papers to see what's out there.
bwfan123 10 hours ago

I found the gpumode lectures, videos and code right on the money. check them out.
esafak a day ago

You will probably have fewer job opportunities than the people working higher up, but be safer from AI automation for now :)
fulafel 18 hours ago

Try dipping your toes into graphics programming, you can still use GPUs for that as well.

ultrasounder 7 hours ago

Very nice-write up. The in-line quiz, which i think is AI generated(QnA) is very useful to test understanding. Wish all tutorials incorporated that feature.

t55 6 hours ago

thank you!

ralphc 4 hours ago

Are all the CUDA tutorials geared towards AI or are there some, for example, like regular scientific computing? Airflow over wings and things that you used to see for high-performance computing would be fun to try.

spps11 a day ago

Thanks for sharing, enjoyed reading it!

I have a slightly tangential question: Do you have any insights into what exactly DeepSeek did by bypassing CUDA that made their run more efficient?

I always found it surprising that a core library like Cuda, developed over such a long time, still had room for improvement—especially to the extent that a seemingly new team of developers could bridge the gap on their own.

saagarjha 21 hours ago

They didn’t. They used PTX, which is what CUDA C++ compiles down to, but which is part of the CUDA toolchain. All major players have needed to do this because the intrinsics for the latest accelerators are not actually exposed in the C++ API, which means using them requires inline PTX at the very minimum.
t55 a day ago

They basically ditched CUDA and went straight to writing in PTX, which is like GPU assembly, letting them repurposing some cores for communication to squeeze out extra performance. I believe that with better AI models and tools like Cursor, we will move to a world where you can mold code ever more specific to your use case to make it more performant.
- suresk 19 hours ago
  
  Are you sure they ditched CUDA? I keep hearing this, but it seems odd because that would be a ton of extra work to entirely ditch it vs selectively employing some ptx in CUDA kernels which is fairly straightforward.
  Their paper [1] only mentions using PTX in a few areas to optimize data transfer operations so they don't blow up the L2 cache. This makes intuitive sense to me, since the main limitation of the H800 vs H100 is reduced nvlink bandwidth, which would necessitate doing stuff like this that may not be a common thing for others who have access to H100s.
  1. https://arxiv.org/abs/2412.19437
  - t55 9 hours ago
    
    I should have been more precise, sorry. Didn't want to imply they entirely ditched CUDA but basically circumvented it in a few areas like you said.
- pjmlp 9 hours ago
  
  Targeting directly PTX is perfectly regular CUDA, and used by many toolchains that target the ecosystem.
  CUDA is not only C++, as many mistake it for.
- spps11 a day ago
  
  got it, thanks for explaining.
  > with better AI models and tools like Cursor, we will move to a world where you can mold code ever more specific to your use case to make it more performant
  what do you think the value of having the right abstraction will be in such a world?
  - t55 a day ago
    
    I think that for at least for us dumb humans with limited memory, having good abstractions makes things much easier to understand
    
    spps11 a day ago
    
    Yes, but I wonder how much of this trait is carried over to the LLMs from us.
    
    t55 a day ago
    
    what do you mean, the LLM abstracting things for us while we speak to it?
    
    spps11 a day ago
    
    No I meant something else. As you said: us humans love clean abstractions. We love building on top of them. Now LLMs are trained on data produced by us. So I wonder if they would also inherit this trait from us and end up loving good abstractions, and would find it easier to build on top of them. Other possibility is that they end up move-37ing the whole abstraction shebang. And find that always building something up bespoke, from low-level is better than constraining oneself to some general purpose abstraction.
    
    t55 20 hours ago
    
    ah gotcha. I think that with the new trend of RLing models, the move 37 may come up sooner than we think -- just provide the pretrained models some outcome-goal and the way it gets there may use low-level code without clean abstractions
    
    tomnipotent a day ago
    
    It's an interesting idea.
    If code is ever updated by an LLM, does it benefit from using abstractions? After all they're really a tool for us lowly sapients to aid in breaking down complex problems. Maybe LLM's will create their own class of abstractions, diverse from our own but useful for their task.

musicale 20 hours ago

What Jensen giveth, Guido taketh away.

t55 20 hours ago

lol. i guess this tutorial is about cutting out guido ;)

signa11 18 hours ago

this book:

    Programming Massively Parallel Processors by Wen-mei W. Hwu , David B. Kirk , Izzat El Hajj

seems to be tailor mode for folks transitioning from cpu -> gpu arch.

t55 9 hours ago

Yes, it is great for key concepts but a bit outdated. Hence we added an LLM/FA section in the linked post!

LegNeato a day ago

Also check out https://github.com/rust-gpu/rust-gpu and https://github.com/rust-gpu/rust-cuda

the__alchemist 13 hours ago

Rust-Cuda is broken and has been for years.`cudarc` is the [only?] working one.
- LegNeato 10 hours ago
  
  I am in the process of rebooting it: https://rust-gpu.github.io/blog/2025/01/27/rust-cuda-reboot/
t55 21 hours ago

this looks really cool and i love rust. just a matter of time until everything runs on rust.

t55 a day ago

https://www.reddit.com/r/MachineLearning/comments/1itqrgl/p_...

saagarjha a day ago

Wasn’t this a bunch of kernels that didn’t work?
- t55 a day ago
  
  What do you mean?
  - pavelstoev a day ago
    
    The hallucinated code was reusing memory buffers filled with previous results so not performing the actual computations. When this was fixed the AI generated code was like 0.3x of the baseline.
    
    neodypsis 21 hours ago
    
    It is mentioned on section "Limitations and Bloopers" of the page [0]:
    > Combining evolutionary optimization with LLMs is powerful but can also find ways to trick the verification sandbox. We are fortunate to have Twitter user @main_horse help test our CUDA kernels, to identify that The AI CUDA Engineer had found a way to “cheat”. The system had found a memory exploit in the evaluation code which, in a small percentage of cases, allowed it to avoid checking for correctness (...)
    0. https://sakana.ai/ai-cuda-engineer
    
    rnrn 20 hours ago
    
    As I write this (after the updates to the evaluation code), https://pub.sakana.ai/ai-cuda-engineer/kernel/2/23/optimize-... is on their top of their list of speedups, with a claim of 128x speed up on a fused 3D convolution + groupnorm + mean.
    The generated implementation doesn’t do a convolution.
    The 2nd kernel on the leaderboard also appears to be incorrect, with a bunch of dead code computing a convolution and then not using it and writing tanhf(1.0f) * scaling_factor for every output.
  - imtringued 17 hours ago
    
    They don't verify the correctness of their kernels. They expect you to pick the working ones from their kernel junkyard yourself.
    The very idea is also dumb as hell. They could have done CUDA -> HIP/oneAPI/Metal/Vulkan/SYCL/OpenCL. Then they wouldn't need to beat the performance of anything, just the automatic porting would be worth an acquisition by AMD or Intel.
    
    bwfan123 10 hours ago
    
    Problem with startups like Devin (AI sw engineer) and Sakana (AI research scientist) is that they are full of hot-air.
    They get caught up in the hype, and focus on the marketing and not the essential engineering.
- tsunego a day ago
  
  [dead]

whatever1 15 hours ago

Any idea what changed recently and we can have end to end simulations (with branches) in the gpu (eg isaac sim) vs in the past where simulations were a cpu thing ?

jamiejquinn 14 hours ago

Always been possible, but now the time cost of moving data between the GPU and CPU memory is too high to ignore. Branching may be slower on the GPU but it's still faster than moving data to the CPU for a time then back. The maturation of direct GPU-GPU transfers over the network also helped enable GPU-only MPI codes.

m_kos a day ago

Since this is on PySpur's website, does anyone have experience with these UI tools for AI agents like PySpur and n8n? I am looking for something to help me prototype a few ideas for fun. I would have to self-host it ($), so I would prefer something relatively easy to configure like Open Hands.

t55 a day ago

Disclaimer: I work on pyspur
I'd recommend pyspur if you seek
1) More AI-native features eg. Evals, RAG, or even UI decisions like seeing outputs directly on the canvas when running on the agent 2) Truly open-source Apache license 3) Python-based (in the sense that you can run and extend it via python)
On the other hand, n8n is 1) more mature for traditional workflows 2) offering overall more integrations (probably every single integration you can think of) 3) TypeScript based and runs on Node.js
- m_kos 21 hours ago
  
  Thanks for replying. Do you know when your docs will be a bit more comprehensive? Right now, there is very little information and some links don't work, e.g., Next Steps on this page: https://docs.pyspur.dev/quickstart
  - t55 21 hours ago
    
    > Do you know when your docs will be a bit more comprehensive?
    Yes, we're actively working on this, and we should have some more pages by next week. If you have any questions, you can always shoot us an email: founders@pyspur.dev or join our Discord.
    > some links don't work, e.g., Next Steps on this page
    This might be confusing, the cards below "After installation, you can:" are not meant to be links. Thanks for making us aware, we will improve the wording.
spps11 a day ago

pyspur is apache 2. it is free to self-host.

nitrogen99 a day ago

If you are a Python dev, why not just use Triton?

t55 a day ago

Triton sits between CUDA and PyTorch and is built to work smoothly within the PyTorch ecosystem. In CUDA, on the other hand, you can directly manipulate warp-level primitives and fine-tune memory prefetching to reduce latency in eg. attention algorithms, a level of control that Triton and PyTorch don't offer AFAIK.
- pjmlp 8 hours ago
  
  MLIR extensions for Python do though, as far as I could tell from LLVM developer meeting.
  - 6gvONxR4sf7o 8 hours ago
    
    MLIR is one of those things everyone seems to use, but nobody seems to want to write solid introductory docs for :(
    I've been curious for a few years now to get into MLIR, but I don't know compilers or LLVM, and all the docs I've found seem to assume knowledge of one or the other.
    (yes this is a plea for someone to write an 'intro to compilers' using MLIR)
    
    pjmlp 7 hours ago
    
    Not sure if you will be able to follow along, but here it is what I was talking about,
    "PyDSL: A MLIR DSL for Python developers"
    https://www.youtube.com/watch?v=iYLxgTRe8TU
    "PyDSL, a subset of Python for constructing affine & transform dialects"
    https://www.youtube.com/watch?v=nmtHeRkl850
    And MLIR channel,
    https://www.youtube.com/@MLIRCompiler
saagarjha 21 hours ago

Triton is somewhat limited in what it supports, and it’s not really Python either.
pavelstoev a day ago

or use Hidet compiler (open source)
- t55 20 hours ago
  
  never heard of Hidet before; for when/what would I use it over CUDA/Triton/Pytorch?
  - pavelstoev 5 hours ago
    
    It is written in Python itself and emits efficient CUDA code. This way, you can understand what is going on. The current focus is on inference, but hopefully, training workloads will be supported soon. https://github.com/hidet-org/hidet

AchintyaAshok 11 hours ago

Thanks for unraveling this!

t55 9 hours ago

you're welcome!

rtkal10 a day ago

Interestingly, the CUDA implementations are more readable than the pytorch ones.

t55 21 hours ago

interesting, you mean they are less obscure?
tsunego a day ago

[dead]

android521 14 hours ago

pyspur graph is cool, is there a startup building this kind of product but in typescript?

lukaspetersson a day ago

I needed this

t55 20 hours ago

Hehe glad you did!

tsunego a day ago

[dead]