Building a DevOps Pipeline for Machine Learning and AI: Evaluating Sagemaker

The rush to build and deploy machine learning models has exposed cracks in traditional DevOps processes.


Infrastructure built for deploying traditional applications is not optimized for the challenges of deploying AI/ML at scale.This challenge comes from the inherent differences in how data scientists and engineers work. The iterative and experimental nature of a data scientist’s workflow combined with the highly variability of their computational needs has made it difficult to build DevOps machine learning (ML) pipelines — increasingly called ‘MLOps’.

The goal behind an MLOps pipeline is to stabilize and streamline the ML model release process in a way that still allows for flexible experimentation. However, it’s difficult to get a consistent framework to build out this pipeline because the lack of maturity in both ML team practices and tools. This post provides a helpful framework to evaluate MLOps pipeline components and dives into some potential solutions for these pipeline components, including AWS Sagemaker and

Goals and Considerations

Deploying a model to production is just one part of the MLOps pipeline. An effective MLOps pipeline also encompasses building a data pipeline for continuous training, proper version control, scalable serving infrastructure, and ongoing monitoring and alerts. Clearly, there are similarities with traditional software development, but still some important open questions to answer:

For DevOps engineers

  • How do I hook this up to existing systems?
  • How can I make these components scalable?
  • Will it make model deployment easier?
  • Will it make working with data scientists easier?

For data scientists

  • Is it as flexible as using my own environment tools/setup?
  • Will it make building and training models easier?
  • Will it make working with other data scientists easier?
  • Will it make working with DevOps easier?

These questions boil down to three primary goals that an effective MLOps pipeline should fulfill:

  • Scaling infrastructure
  • Flexible team collaboration
  • Reproducibility

These goals are important regardless of your team’s size and composition, but their execution may differ based on those factors. Many teams are faced with the question of whether to set up their own custom environments or buy managed services from cloud providers like Amazon, Google, or Microsoft.

Buy versus Build

One option you might consider for your DevOps ML pipeline is AWS Sagemaker. AWS announced Sagemaker as a “fully managed end-to-end machine learning service that enables data scientists, developers, and machine learning experts to quickly build, train, and host machine learning models at scale.”

So where does Sagemaker fit into the overall pipeline and how does it differ from a custom environment?

Model Development

Sagemaker’s value boils down to abstraction and uniformity. Sagemaker is essentially a managed Jupyter notebook instance in AWS, that provides an API for easy distributed training of deep learning models. Under the hood, the main components of SageMaker are specific Amazon Machine Images (AMIs) and ordinary EC2 instances with data coming from S3 object storage [Source].

If you were set up these equivalent parts on your own, you could self host an EC2 instance. When you request an EC2 instance, you can specify which AMI should be used as its template (see the AWS Marketplace for the full list of pre-configured AMIs). AMIs capture the exact state of environment from details like the operating system, libraries, applications, and more. For deep learning projects, Amazon offers a subset of AMIs called Deep Learning AMIs that come pre-installed with open-source deep learning frameworks.

During the development phase, using Sagemaker is not very different from running Jupyter notebooks on a Deep Learning AMI and can actually be less flexible since you have to adjust your code to explicitly provide EstimatorSpec objects for the training, evaluation, and prediction modes and use supported data types [Source].

Model Training

For distributed model training, Sagemaker runs on a fully managed elastic compute server that automatically scales in proportion to the job size. If you’re using Sagemaker’s ‘optimized’ algorithms, you can see performance boosts since these algorithms were specifically designed to scale across multiple EC2 and GPU instances [Source].

Model Deployment

Sagemaker’s abstraction may actually becomes a drawback during deployment if you want to deploy a custom model from outside Sagemaker. For custom models, Amazon allows you to load in a Docker container into Elastic Container Repository (ECR) for production, but you have to use a very specific Docker directory structure. For their built-in algorithms, Sagemaker provides published docker images with consistent training and inference interfaces for each algorithm. [Source]

If you were to build out your own pipeline with AWS components for deploying models, you would need to piece together AWS API Gateway, AWS Lambda, and Cloudwatch [Source].

For teams without dedicated DevOps or data engineering resources, the extra cost of a managed service like Sagemaker could be worth the time saved maintaining the EC2 instance (especially if you’re using Spot instances), updating packages, and doing complex network configurations on a virtual private cloud (VPC).

Ultimately, adopting a managed service like Sagemaker may make scaling infrastructure easier to handle, but still leaves two important goals for your MLOps pipeline unaddressed: flexible team collaboration and reproducibility.

What’s missing?

We’ve seen how AWS SageMaker provides convenient and reliable infrastructure to train and deploy machine learning models, but how can we build in reproducibility into the DevOps ML pipeline?

In the same way that proper source control made software engineering highly productive, reproducibility in machine learning can prevent bottlenecks that make significantly increase costs even with a solid DevOps pipeline in place. These bottlenecks appear when data scientists and engineers try to:

  • Recover work from a data scientist that left the company
  • Compare results across different model iterations
  • Reproduce model results from a collaborator, research paper, or even a model in production
  • Trace the original training data, dependencies, hyperparameters or actual model while trying to retrain models efficiently
  • Avoid duplicating work across teams

Making machine learning work reproducible is not easy since training processes can be filled with numerous data transformations and splits, model architecture changes, and hyperparameter configurations. This is especially true in collaborative settings, where data scientists working on different versions of a model may make hundreds of changes to the files in the project. is doing for machine learning what Github did for code. We allow machine learning developers to automagically track their datasets, code changes, experimentation history and models. Backed by thousands of users and multiple Fortune 100 companies, Comet provides insights and data to build better, more accurate models while improving productivity, collaboration and explainability.

By operating as a layer on top of environments such as Sagemaker, vanilla Jupyter notebooks, or any other development environment, Comet serves as a single source of truth for machine learning work that’s in production or still in progress. Once both data scientists and engineers understand the characteristics of an effective model to deploy and more importantly, what technical effort was involved in creating that model, crucial post-analyses around model architectures, hyperparameters, and optimizations are possible.

Connecting and Sagemaker

Since is agnostic to your choice of both infrastructure and machine framework, you can continue using your current training process — whether it’s in AWS Sagemaker or your own custom environment.

Next week’s article will feature a tutorial on how integrates with AWS Sagemaker‘s Tensorflow Estimator API. We will be adapting running the Resnet model on the CIFAR10 dataset with Tensorflow.

It’s easy to get started

And it's free. Two things everyone loves.