BIFOLD Talk "Challenges for Observability in AI Factories"
Title: Challenges for Observability in AI Factories
Abstract: The global-scale deployment of AI infrastructures and AI factories is reshaping the landscape of modern computation. Future-generation data centers are transitioning into deeply integrated hardware and software stacks. This shift introduces a new set of observability challenges. To maintain peak efficiency and reliability, novel approaches must be developed for real-time tracking of critical resources, including High Bandwidth Memory (HBM) utilization, performance, congestion in high-speed networks, and granular power utilization per token of computation. This talk explores observability challenges related to the unpredictability of metrics due to dynamic hardware probes, the difficulties in automating the generation of telemetry queries and their documentation, . the need for telemetry validation and management as a version-controlled pipeline and suggests leveraging the standardized structure of AI clusters for developing new network-driven anomaly detection algorithms in AIOps.
Short bio: Jorge Cardoso is a Solutions Architect at NVIDIA, where he explores how new technologies that leverage Observability, Reliability Engineering, and Machine Learning can be used to operate and optimize NVIDIA’s AI infrastructures. Previously, he led the Large-scale AIOps Lab at Huawei Cloud in Munich, working on the development of innovative AI/ML-based systems for intelligent operations of cloud infrastructure. He has held roles at several leading industrial and academic research institutions, including SAP AG, The Boeing Company, CCG/Zentrum für Graphische Datenverarbeitung, Karlsruhe Institute of Technology (KIT), and the University of Dresden. Jorge is also an invited professor at the University of Coimbra, Portugal. He received a Ph.D. in Computer Science from the University of Georgia, US.