
Senior DGX Cloud AI Infrastructure Software Engineer
Design, build, and maintain AI infrastructure software and tools for large-scale pre-training, post-training, and inference. Responsibilities include improving efficiency and resiliency, co-designing APIs with resiliency stacks, defining reliability metrics, and performing root-cause analysis from application to hardware. Requires 8+ years experience, proficiency in Python and C/C++, experience with observability platforms (ELK, Prometheus, Loki), distributed systems, AI training/inference infrastructure and strong debugging skills.













