Scientific Software Engineer (HPC)

about 1 month ago
Full time role
Hybrid · Zürich, ZH, CH... more

Overview

Jua's mission is to create artificial general intelligence (AGI). Our product uses a foundational Earth systems model to predict atmospheric events. It’s more accurate, faster, and updates more frequently than current weather models, making it valuable for traders in weather-dependent power markets. These benefits are crucial during extreme weather events like hurricanes and cyclones. As we scale the model, more applications will become possible, including wildfire, flood, and vegetation prediction.

By joining our talented multidisciplinary team, you'll help shape our culture of ambition, transparent communication, and rapid iteration. We offer exciting challenges, creative freedom, fair pay, and generous stock options.

What we are looking for

At Jua, our mission is to create the world’s most accurate foundational physics model for the Earth. We see this as an important step on the journey towards AGI. Critical to the success of this mission is the ability to run model training, evaluation and inference at a massive scale. The greater the scale we operate at, the better our model becomes. We are looking for a motivated Scientific Software Engineer (HPC)  who can tackle these problems we encounter as we scale. 

Jua is a research-led company with a focus on solving hard engineering problems as well as product excellence. The ideal candidate will be motivated by the prospect of working for an early-stage startup and excited to have the opportunity to play a leading role in the journey of scaling our infrastructure from hundreds of nodes to multiple orders of magnitudes. 

Responsibilities and tasks

  • Design, deploy and maintain critical components in Jua’s advanced machine learning cluster of 1000+ GPUs and petabyte-scale training data

  • Optimise for high MFU at every level, from overarching architectural design to fine-tuning CUDA kernels

  • Prototype and evaluate new technologies, vendors and architectural approaches with a focus on performance, usability, and reliability

  • Continuously profile and monitor system performance and resource utilisation

  • Enhance cluster efficiency and management by implementing best practices and automation

  • Troubleshoot and resolve complex technical issues across the system

Need-to-have

  • Extensive experience in building, configuring and managing high-performance accelerated computing clusters

  • Deep expertise with distributed file systems or object stores such as MinIO, Lustre, GPFS, Ceph

  • Understanding of distributed model training, especially deep understanding of different sharding strategies for models from 1 to 100B+ parameters

  • Strong troubleshooting abilities across all layers: application, operating system, networking, and hardware

  • A proactive approach to identifying problems, performance bottlenecks, and areas for improvement

  • Proven experience collaborating within multidisciplinary teams

Nice-to-have

  • Experience working with geospatial datasets

  • Experience leading/managing projects

At Jua, we foster a performance culture and value people who embody our beliefs of service and adventure. We prioritize agility, operating at the highest clock speed to adapt quickly to change. We innovate on behalf of our users and leverage data supremacy to maintain our competitive edge. Through clear communication and fact-based decision-making, we ensure alignment in our pursuit of excellence. With these principles, we aim to create a customer-focused, value-centric organization that sets new standards in the industry. We value the unique perspectives that each individual brings to the table and believe that embracing diverse backgrounds and experiences enriches our collective journey towards growth and success.