Industry Q&A: How do you start the machine learning research process?
Comet recently hosted the online panel, “How do top AI researchers from Google, Stanford and Hugging Face approach new ML problems?” This post is our first in a series where we recap the questions, answers, and approaches that top AI teams in the world are taking to critical machine learning challenges.
One of the hardest parts of machine learning is simply getting started. Do you have the necessary data? Do you have the systems in place to manage your model? If you need to take it into production, do you have a good understanding of what the production environment looks like? All of these are considerations that need to be made in order to ensure your work is successful — and if the problem you’re trying to solve is worth pursuing.
Gideon Mendels, Comet
There are a number of challenges to even approaching a machine learning challenge. There are so many moving parts. For those of us in the industry, it’s very different from what you might see in something like a Kaggle competition, where you have a clean dataset and the metrics are figured out for you.
So how do you start the research process? What do you do when you have a new problem?
Ambarish Jash, Google AI
Two parts, one is the research problem and coming up with the problem definition. The other is the challenge of putting it in production. Production puts a significant amount of constraint on the research you can do. One approach is to:
- Define the problem
- Define the system you’re going to build
If you need to go into production, I would strongly recommend you about building the system first and maybe start with simpler models. In the long run, it makes things like debugging and maintenance simpler.
Keep things simple, build out pipelines. Once these are built, you can start to rapidly iterate on the model. You need a lot of data, and most of the time, the loss isn’t exactly what you care about. Having a strong evaluation framework is important too, because as you start to add complexity to your model, you will need to figure out if it’s making sense with your final task.
Piero Moilino, Stanford & Ludwig
This depends on the project, based on if it’s a theory applied project or a research project. For theory applied, I’m often the only one with access to data and there usual isn’t much historical progress on the data. While on a research problem, there may be papers about it and I can start from there.
If the project is applied, the first thing I do is try to understand the data. That’s the number one thing. Is there signal for the problem I want to solve? In many cases, a machine learning project won’t work simply because there isn’t enough signal for it in the data.
After looking at the data, I personally use tools that I build for myself. This is just because it’s easier for me to compare different models, have a standard pipeline, and then I can reuse it. Usually I train a simple model, see what’s there, look at the predictions, do some visualizations, understand the predictions and learning curves.
Once I feel I have a global understanding of the problem, the data, and an initial simple solution, then I double down on complex models or more sophisticated solutions. But something simple first, then scale up, is really good advice.
I believe that machine learning projects are much closer to research projects than software projects. For software, usually you define your constraints and implement, and you know it will work ahead of time. That’s not the case for machine learning.
In machine learning, you don’t know if the problem will work to begin with. By starting simple, and understanding if you have signal, you can get an idea if you have what you need to solve the problem. Otherwise you can spend a lot of time and end up with a model that doesn’t work. Fail fast and try to figure it out early if you can solve the problem.
Victor Sanh, Hugging Face
My approach is to “really lose.” We have a spreadsheet with a lot of ideas that come along, and we take the ones that excite us the most.
But I agree with Piero and Ambarish – you take the problems, then you want to start fast and iterate fast at the beginning. You want to understand the data and learning processes, so you can decide “is it worth it to pursue this problem for a few weeks?”
The first two weeks are decisive, because that’s when you get a sense of your data and understand if there’s actually improvement that you can do there.
Want to watch the full panel? It’s available on-demand here.