Brett Michaelis

Summary

Senior Site Reliability Engineer with 10+ years architecting and operating hybrid cloud and bare metal infrastructure at scale. Deep experience building Kubernetes platforms across AWS, GCP, and private data centers, with hands-on GPU/CUDA workload management including bare metal server provisioning via PXE boot and Saltstack. Track record of direct collaboration with ML engineers and data scientists to build the infrastructure foundation that accelerates AI development. Strong observability stack (Prometheus, Grafana, Mimir), IaC discipline (Terraform), and an automation-first approach to platform reliability and self-service.

Core Skills

Cloud Platforms: GCP, AWS, Azure, Hetzner, UpCloud, Tier.Net
Containers, Orchestration & Storage: Kubernetes, Helm, Docker, Nomad; Ceph
Bare Metal & GPU: PXE boot provisioning, Saltstack configuration management, CUDA workloads
Infrastructure as Code & Automation: Terraform, Bash, Python, Go
CI/CD: GitLab CI, GitHub Actions, ArgoCD, Bitbucket Pipelines, Jenkins
Observability & Reliability: Prometheus, Grafana, Mimir, Alloy, TICK stack; SLO/SLI practices; incident response (Five Whys, AAR)
FinOps: Kubecost; multi-cloud cost optimization
Languages: Python, Go, JavaScript, Node.js

Experience

Operations Engineer 03/2025 - Present

Smarty.com | Orem, UT

Leading migration of a legacy Grafana observability platform to a GitOps-managed deployment, auditing and rationalizing all alerting across production services as part of the initiative.
Operate observability stack using Prometheus, Grafana, Mimir, and Alloy for metrics collection, long-term storage, and dashboarding.
Implement canary deployments via Nomad for progressive production rollouts, enabling confident releases with automated rollback.
Driving company-wide migration from Bitbucket to GitHub, including full re-implementation of all CI/CD workflows in GitHub Actions.
Manage multi-cloud deployments (Tier.Net, UpCloud, GCP, AWS, Hetzner) with Terraform, Nomad, and Bitbucket Pipelines, improving uptime and deployment velocity.
Automate repetitive workflows with Bash and Go, reducing manual toil across operations.

Senior DevOps Engineer 04/2020 - 10/2024

Five9.com

Orchestrated multi-cloud Kubernetes deployments on GCP using Helm and Terraform to support high-availability SaaS workloads at scale.
Built self-service deployment tooling and automation including ArgoCD-based GitOps workflows, enabling engineering teams to provision and release independently.
Partnered with product and engineering teams to define and track SLOs/SLIs, supporting customer-facing uptime goals.
Streamlined incident response with Five Whys, improving on-call processes and reliability through blameless postmortems.

Software Engineer / DevOps Engineer 03/2017 - 04/2020

Vivint SmartHome

Provisioned and configured bare metal servers at scale using PXE boot and Saltstack, building GPU/CUDA compute capacity for ML model training workloads.
Managed Ceph storage clusters to support high-throughput data access across a 1.5 PB ML data lake on GCP.
Collaborated directly with ML engineers and data scientists to translate research requirements into scalable, production-ready infrastructure.
Developed and deployed Golang-based microservices, optimizing performance and reducing latency.
Implemented TICK stack for observability and system health monitoring across ML and application workloads.

Director, IT & Software Development 2011 - 2017

Unicity International

Led global infrastructure modernization, migrating legacy apps to containerized, cloud-native environments.
Standardized multi-cloud deployments (AWS EC2, S3) to improve scalability and global availability.
Introduced reliability practices, including error budgeting and deployment automation.

Assistant Director, Web Development 2005 - 2010

Utah Valley University

Directed university-wide web development projects, improving service reliability and scalability for mission-critical systems.

Counterintelligence Agent 1998 - 2006

U.S. Army – Utah National Guard

Conducted secure intelligence operations, leveraging structured incident response and AAR methods.