Jua's mission is to create artificial general intelligence (AGI). Our product uses a foundational Earth systems model to predict atmospheric events. It’s more accurate, faster, and updates more frequently than current weather models, making it valuable for traders in weather-dependent power markets. These benefits are crucial during extreme weather events like hurricanes and cyclones. As we scale the model, more applications will become possible, including wildfire, flood, and vegetation prediction.
By joining our talented multidisciplinary team, you'll help shape our culture of ambition, transparent communication, and rapid iteration. We offer exciting challenges, creative freedom, fair pay, and generous stock options.
At Jua, our mission is to create the world’s most accurate foundational physics model for the Earth. We see this as an important step on the journey towards AGI. Critical to the success of this mission is the ability to run model training, evaluation and inference at a massive scale. The greater the scale we operate at, the better our model becomes. We are looking for a motivated MLOps Engineer who can tackle these problems we encounter as we scale.
Jua is a research-led company with a focus on solving hard engineering problems as well as product excellence. The ideal candidate will be motivated by the prospect of working for an early-stage startup and excited to have the opportunity to play a leading role in the journey of scaling our infrastructure from hundreds of nodes to multiple orders of magnitudes.
Responsibilities and tasks
Design, deploy and maintain critical components in Jua’s advanced machine learning cluster of 1000+ GPUs and petabyte-scale training data
Optimise for high MFU at every level, from overarching architectural design to fine-tuning CUDA kernels
Prototype and evaluate new technologies, vendors and architectural approaches with a focus on performance, usability, and reliability
Continuously profile and monitor system performance and resource utilisation
Enhance cluster efficiency and management by implementing best practices and automation
Troubleshoot and resolve complex technical issues across the system
Extensive experience in building, configuring and managing high-performance accelerated computing clusters
Deep expertise with distributed file systems or object stores such as MinIO, Lustre, GPFS, Ceph
Understanding of distributed model training, especially deep understanding of different sharding strategies for models from 1 to 100B+ parameters
Strong troubleshooting abilities across all layers: application, operating system, networking, and hardware
A proactive approach to identifying problems, performance bottlenecks, and areas for improvement
Proven experience collaborating within multidisciplinary teams
Experience working with geospatial datasets
Experience leading/managing projects
At Jua, we foster a performance culture and value people who embody our beliefs of service and adventure. We prioritize agility, operating at the highest clock speed to adapt quickly to change. We innovate on behalf of our users and leverage data supremacy to maintain our competitive edge. Through clear communication and fact-based decision-making, we ensure alignment in our pursuit of excellence. With these principles, we aim to create a customer-focused, value-centric organization that sets new standards in the industry. We value the unique perspectives that each individual brings to the table and believe that embracing diverse backgrounds and experiences enriches our collective journey towards growth and success.