Linux Systems Engineering
- Deep troubleshooting across kernel, networking stack, storage, and performance layers.
- Performance tuning for low-latency systems (CPU pinning, NUMA, IRQ balancing, kernel tuning).
- Develop automation using Python, Go, or similar languages.
- Build and maintain infrastructure tooling and internal platform services.
- Implement high-availability solutions and disaster recovery strategies.
- Perform root cause analysis for production incidents affecting distributed systems.
- Design, deploy, and operate GPU-enabled infrastructure. Optimize GPU utilization (memory bandwidth, PCIe throughput, multi-process service, MIG partitioning where applicable).
- Tune workloads to efficiently leverage NVIDIA GPUs (or equivalent accelerators) for compute-intensive applications.
- Troubleshoot GPU driver, CUDA, kernel module, and firmware-related issues in production environments.
OpenStack Development & Cloud Infrastructure
- Develop and extend OpenStack services (Nova, Neutron, Cinder, Keystone, etc.).
- Build custom integrations and automation around OpenStack APIs.
- Optimize compute, networking, and storage performance for high-performance workloads.
- Design multi-tenant OpenStack architectures with strong isolation and security.
- Contribute to infrastructure-as-code frameworks managing OpenStack environments.
- Debug and resolve deep issues across hypervisors (KVM), networking layers, and control plane services.
- Integrate OpenStack environments with Kubernetes platforms (hybrid cloud architectures).
Kubernetes Platform Engineering
- Design, build, and operate highly available, production-grade Kubernetes clusters.
- Develop and maintain Kubernetes operators, controllers, and custom resource definitions (CRDs).
- Implement advanced scheduling, multi-tenancy, and workload isolation strategies.
- Optimize cluster performance for low-latency and high-throughput workloads.
- Integrate Kubernetes with CI/CD pipelines and GitOps workflows.
- Implement cluster observability using Prometheus, Grafana, OpenTelemetry, etc.
- Design and enforce networking policies (CNI), ingress architecture.
- Implement secure cluster design including RBAC, OPA/Gatekeeper, secrets management, and runtime security.
Automation & Infrastructure as Code
- Design and maintain infrastructure using Terraform, Ansible, Helm, or similar tools.
- Build CI/CD pipelines for infrastructure and platform deployments.
- Implement immutable infrastructure and GitOps methodologies.
- Create automated validation, testing, and deployment frameworks for platform services.
Required Technical Skills
- Advanced Linux systems knowledge (kernel, networking, storage)
- Experience deploying and operating GPU-enabled Linux servers
- Understanding of CUDA drivers, GPU kernel modules
- Performance profiling and Tuning Workloads for compute-intensive applications.
- Hands-on OpenStack development and operations experience
- Strong experience administering and engineering production Kubernetes clusters
- Strong understanding of distributed systems principles:
- Consensus
- Replication
- Fault tolerance
- CAP theorem tradeoffs
- Experience with
- Python or similar programming languages
- Infrastructure as Code (Terraform, Ansible)
- Container runtimes (containerd, CRI-O)
- Observability stacks (Prometheus, Grafana, ELK)
Desirable Experience
- Experience in low-latency or high-performance trading environments
- High-performance networking (DPDK, SR-IOV, CNI tuning)
- Storage systems (Ceph, distributed storage, NVMe optimization)
- Contribution to open-source projects (Kubernetes, OpenStack)
- Experience designing multi-region or hybrid cloud architectures
- Experience tuning AI/ML, quantitative, or high-performance compute workloads on GPUs
- Experience with NVIDIA DCGM, MIG (Multi-Instance GPU), or vGPU configurations
- Familiarity with RDMA, GPUDirect, or high-throughput interconnects
- Experience optimizing containerized ML or compute pipelines
Key Attributes
- Strong systems thinking and deep technical curiosity
- Ability to diagnose complex cross-layer failures
- Passion for building reliable, scalable distributed systems
- Comfortable operating in high-availability, high-performance production environments
- Strong documentation and knowledge-sharing mindset