NVIDIA's Infrastructure Specialists team is hiring a Senior Solutions Architect - AI Factory Observability & Visualization! This remote role develops full-spectrum visibility that supports the smooth functioning of HPC systems and AI factories, transforming intricate telemetry across network and compute into straightforward, actionable perspectives.

The role has a complete, end-to-end understanding of the HPC/AI system, running and interpreting microbenchmarks and workloads to confirm system readiness, then establishing the observability that maintains this state. The work involves collaborating across NVIDIA teams to help partners see, understand, and respond to HPC system and AI factory performance, from hardware to workload.

What You Will be Doing:

Run AI factory validation tools, microbenchmarks, and workloads provided by the team, and interpret results to assess system health and performance.
Gain a comprehensive understanding of the system from start to finish, including network topology, interconnects, and compute.
Establish what "healthy" represents across the stack — the metrics, logs, and signals that confirm a system is functioning well, and the thresholds that show it isn't.
Build and extend the telemetry surface across hardware, fabric, and workload, crafting how data is collected, transformed, stored, and surfaced.
Serve as the observability expert, investigating gaps in visibility to ensure it reflects true system behavior.
Develop automation (Python, Shell) for collecting, transforming, and presenting system and network data.
Recommend improvements to system visibility, data sources, and reporting that give teams clearer insight.
Collaborate with hardware, software, networking, datacenter, and product groups to ready HPC systems and AI factories for customer deployment, contributing documentation and readiness materials throughout the process.

What We Need to See:

Bachelor's degree or equivalent experience in Computer Science, Mathematics, Engineering, Physics, or related field.
6+ years of experience managing Linux-based systems in HPC, distributed systems, or large AI/ML settings.
Hands-on experience with the architecture of multi-GPU and/or multi-node clusters, including networking and interconnects.
Solid grasp of how HPC and AI factory systems fit together end to end, from network fabric through compute.
Proficiency with Python and Shell/Bash for scripting, automation, and tooling.
Practical experience working with observability systems (e.g., Prometheus, Grafana, Loki, or similar), including building custom exporters or collectors, setting up alerts, and handling metric cardinality and retention on a large scale.
Experience transforming metrics, logs, and traces into clear, actionable insight for complex distributed environments.
Familiarity with GPU and fabric telemetry (e.g., DCGM, NVLink, InfiniBand/Ethernet fabric counters) and using it to diagnose performance regressions.
Strong communication skills and the ability to work effectively with cross-functional teams.

Ways to Stand Out From the Crowd:

Experience with AI factory or large-scale AI infrastructure build, deployment, or operations.
Background in HPC systems engineering, SRE, or systems analysis for GPU-accelerated environments.
Experience building automation and data pipelines that feed dashboards and reporting at scale.
Demonstrated desire to use AI to solve practical problems, improve workflows, and guide data-driven decisions.

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5.

You will also be eligible for equity and benefits.

Applications for this job will be accepted at least until June 28, 2026.

This posting is for an existing vacancy.

NVIDIA uses AI tools in its recruiting processes.

NVIDIA is committed to fostering an inclusive work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Apply on company website You will be redirected to the company website to complete your application.

Browse similar remote roles

Solutions Architect Site Reliability Engineer

Senior Solutions Architect, AI Factory Observability and Visualization - NVIS

NVIDIA

Browse similar remote roles

Similar jobs

Senior Solutions Consultant

Senior Solutions Architect, Generative AI Research

Senior Solutions Architect, AI - Accelerated Physics

Specialist Adoption Architect 3

Solution Architect

Systems Engineer Principal, Major Incident Manager (Remote)

Software Engineer II - Orange Apron Media (Remote)

Senior Solutions Engineer | East Coast | Remote

Customer Reliability Engineer, Airflow

Customer Reliability Engineer - Infrastructure

Please confirm