Industry Q&A: What metrics do you track for in-production ML models?
Comet recently hosted the online panel, “How do top AI researchers from Google, Stanford and Hugging Face approach new ML problems?” This is the second post in a series where we recap the questions, answers, and approaches that top AI teams in the world are taking to critical machine learning challenges. You can access the first post here, and the second here.
You’ve built a model. It’s been trained. It’s going to production. But how do you ensure it’s working as expected? It’s not enough to simply send your model to production. You have to understand how it performs, if it’s successful, and where adjustments need to be made. These steps are critical to the long-term success of every machine learning team. It also begs the question – what metrics should you be tracking once your model is in production?
Gideon Mendels, Comet
What do you all consider when you monitor models in production? Are you looking at distribution of features? The business OKR? Everything? Is there a certain process where you say, “This is the point where it makes sense to train a model”?
Piero Molino, Stanford & Ludwig
I can give you an example. One project at Uber that I worked on was for customer support. The model was helping customer support reps by classifying tickets, answer tickets, what actions would be needed, and what templates would be used.
In that case, what we cared about was “how much faster can we make customer support representatives without sacrificing accuracy?” The more accurate your model is, the more you can help them be fast because the suggestions are impactful. But we had a situation where we could be 95-97% accurate for the top three questions, or 97%+ for a single question. In this case, being able to support three questions at 95-97% accuracy was more impactful to making customer support faster than getting that additional 2-3% increase for a single question.
In terms of monitoring and retraining, we would run the experiments based on a certain amount of data. We’d then separate the data into bins, take some as training data, others as prediction. Then we’d shift the window to get an understanding of how much older data you need to add to your model until it becomes noise due to the change in distribution. Eventually we understood that if we had more than 1.5 months of data, it would become noise. We were also able to understand that we would see a drop in the prediction after about a month, so we learned we needed to retrain the model every month or so.
This general approach for the data and how long you need to wait for retraining is a pretty good one, and it’s dynamic. There will be months where it shifts more, and others where it shifts less.
Ambarish Jash, Google AI
I agree with Piero. He alluded to some long-term pull back as well. You want to know how your model ages. Does it age well or is it all noise after two weeks?
What happens many times is you deploy a model in the wild, all of your metrics light up, and you’re very happy. But after two weeks, those metrics turn red. There could be any number of reasons. Sometimes it just doesn’t age well. Other times the model is optimized for something, but the user learns to ignore things. The long-term pull back is important to understand.
The other item is to have a continuous retraining pipeline.This will help you understand how much fresh content you need to serve, or how much is coming to you. For example, in the restaurant recommendation game, there may be hundreds of new restaurants every week, so you need to retrain weekly. But if you’re in the YouTube recommendation game, and there are millions of videos every minute, you may need to retrain a few times a day.
Freshness of content is a business metric that can drive how often you want to retrain your model.
Want to watch the full panel? It’s available on-demand here.