Hi, hope you are doing well.
I'm an AWS MLOps & DevOps engineer.
Recently, I had deployed the ml model on AWS for product.
- Creating EKS cluster using Terraform
- Build docker images for application components like the Frontend, Backend, Model Serving, Model Inference and Database.
- Register docker images to ECR
- Create auto-scaling group for each components, so each one could be auto scaled by the cpu utilization.
- Using Spot instance for Model Serving and Model inference
- Deploy the Cluster AutoScaler using manifest
- Deploy the metric server to get the resource metric
- Deploy Prometheus and Grafana for cluster resource monitoring using Terraform
- Deploy EFK (ElasticSearch, FluentBit and Kibana) to Kubernetes cluster for application debug using Helm
- Kafak deploy on the same kubernetes cluster to save the all logs to S3 bucket
- Running Spark job to analyze big log data for each namespace of cluster
- Building Jenkins pipeline for CI and using ArgoCD for kubernetes deployment
I hope this could be helpful for your case.
Looking forward to hear from you.
Thank you.