ICLR Reproducibility Interview #3: Alfredo and Robert


Figure from Variational Sparse Coding Paper. Full paper here

Interested in learning more about the Reproducibility Challenge? Read our kick-off piece here

The third interview in our series is with Robert Alonso and Alfredo De La Fuente. They share an interest in generative models, and machine learning competitions.

Alfredo is attending a dual Master’s Student at The Skolkovo Institute of Science and Technology, and National Research University, Higher School of Economics in Moscow. Robert is a Research Assistant at the Pontifical Catholic University of Peru.

Alfredo and Robert remotely collaborated on the Reproducibility Challenge from Russia and Peru to reproduce their selected paper. In our discussion, Robert and Alfredo share their experience about writing modular, and readable code, and refactoring the code to expand on the original paper.

Interview recap (TLDR):

  • The team was able to reproduce all the experiments in the paper and are working to both extend the proposed encoding method to new datasets and provide a benchmark against different architectures for the task (e.g. disentanglement metrics, task-specific metrics)
  • While the initial paper was missing details around batch size and weights initialization, the team was able to get in touch with the authors and get these details included into the paper.
  • The team made suggestions for presenting reproducible research: setting the random seed (see a good resource on this topic here), reporting library versions, and system environment details, refactoring the code to split a single script into multiple modules, and using Jupyter notebooks to generate the plots shown in the paper.

Reproducibility Report Details:

Team members:

The Interview

Cecelia: Hello Alfredo and Robert! Thanks for agreeing to be interviewed as part of the Reproducibility Challenge. So to kick things off, can you just introduce yourselves?

Alfredo: You go first Robert?

Robert: Ok so my name is Robert Aduviri. and I’m a researcher at the Artificial Intelligence Research Group of the Pontifical Catholic University of Peru. I finished my undergrad last year and I have been working on Natural Language Processing, Recommender Systems, Genomics, and now Generative Models. I also work part-time at a startup called Karaoke Smart, in research and development, and also as a teaching assistant at my University in courses such as Data Analysis, Machine Learning, and Artificial Intelligence.

Finally, I love participating in Data Science competitions, with the last highlight being qualifying for the Data Science Game 2018 competition finals, in Paris.

Cecelia: So Robert, you’re basically a superhuman from what I’ve gathered. Okay, Alfredo — what about you?

Alfredo: I also did my bachelors in Peru, although it was not computer science related. I studied Petroleum Engineering. During my last year, I decided to learn Computer Science and Mathematics on my own.

I was working for some months in Peru as a Research Programmer, and then I moved to Russia, where I am currently getting my Master’s in Statistical Learning Theory. My latests interests are Generative Models, Representation Learning and Reinforcement Learning. Robert and I have also participated in different Kaggle-styled competitions.

Cecelia: And how is it that you two know each other? Is it from time overlapping in Peru or somewhere else.

Alfredo: It’s crazy because actually like we haven’t-

Robert: –Met in person. We talked for some time in the Rimac Data Science Competition.

Alfredo: Yeah we met in a competition, but we were on opposite teams.

Cecelia: You’re the first group I’ve interviewed that’s been working remotely in different places, so I’m interested in hearing what your experience was like. The next question is how did you find out about the challenge, and why is it that you were interested in participating?

Robert: I think Alfredo told me about the challenge if I remember correctly.

Alfredo: I remember I found out about this challenge on Twitter. Nowadays, it seems like most important events and research are announced on Twitter.

In addition, I had the chance to meet Professor Joelle Pineau at NeurIPS 2018. She had a talk about reproducibility, and was encouraging people to participate in such competitions. This was the motivational boost I needed. Then I told Robert and we agreed to participate.

Cecelia: What is so important about reproducibility?

Alfredo: During my Masters program, I had many final course projects that involved reproducing research papers in order to benchmark them.

It was quite challenging at first. I thought that after carefully reviewing the paper, it would be very straightforward to follow through with the implementation, however, I soon found out just how difficult it really is. Sometimes you do the exact same things they suggest and simply doesn’t work.

Robert: I was very immersed in participating in research projects and the idea of reproducing a paper submitted to one of the most important machine learning conferences was very appealing to me.

So we started to look at all the papers because there were a lot of papers. We had to filter them out based on how much computing power was required to reproduce the results, and also based on topics that interested

Alfredo: So basically, we chose the paper to reproduce after we filtered by topic, complexity, and what computational resources were required for the project.

In addition, I personally found it more interesting to work on a paper that did not have a repository available online. Because otherwise it will be like you have a reference, which you can always access and copy parts from.

Cecelia: The challenge did provide some compute resources. I think Google was a sponsor right?

Alfredo: They provided around 300 dollars of computing credits.

Cecelia: But even with that credit you still felt that some papers were computationally too intensive?

Alfredo: Exactly. Especially the ones that involved running reinforcement learning algorithms for many epochs with different seeds, or ones with huge datasets which required a very good computer just to store the dataset.

There were some circumstances where we actually required more computational resources for this project. But luckily, Robert had access to some computers in his university.

Robert: Right. Being a research assistant here at the University, I had access to a couple of GPUs. So that was very, very helpful for us.

Alfredo: In the beginning, we worked exclusively with small datasets for a proof-of-concept until we got it working correctly.

We read the paper exhaustively, and wrote all the formulas down. We were lucky that after a week our code was working for a small scale with CPU. Later we made some adaptations for bigger datasets which involved the use of GPU for computations.

Robert: After the first version of the machine learning model was implemented, we worked on refactoring the model so that we could include more datasets and set up more experiments later on. That was very helpful. And also we received some comments on the reviews when we submitted the report that the code was very well organized and refactored. So the reviewers could understand it very well.

Cecelia: Very cool. Can you briefly describe the paper?

Alfredo: So the paper is called Variational Sparse Coding. Basically it suggests a model that learns latent features from data by imposing a specific sparse prior distribution. By doing that, it’s actually able to represent better and meaningful features in the latent space that we can use later for different downstream tasks.

Robert: Exactly. The most widely used generative models are generative adversarial networks or GANs and Variational Autoencoders.

Usually these Autoencoders have a normal distribution prior so the the main innovation in this research project was to derive and use another distribution instead, Spike-and-Slab, which would induce sparsity in the latent space.

Cecelia: And given that you’re located in different areas, how is it that you collaborated on this implementation? What tools did you use?

Alfredo: I think we were both pretty familiar with GitHub. We are pretty active on that platform. For experiments we tried to use some Jupyter notebooks or some scripts and that was pretty much it. We didn’t even need Skype or anything.

Robert: That was very interesting. Actually I believe this is the first time we’re seeing each other through a video call. We were communicating the whole time through Facebook Messenger.

Cecelia: Nice. And how did you share your hyperparameter configurations, and system environment details?

Robert: Fortunately, the paper was very detailed about the type of parameters that were used so we followed them. There were just a couple of missing details like the Batch Size and also the Weights Initialization.

We just assumed those parameters. And it was very interesting because when we finished and told the authors about the report we were doing, they confirmed that the information was missing. And then they provided the hyperparameters that they actually used and they were glad to know that, even with different hyperparameters, the results were similar.

Cecelia: What were some of the challenges that you encountered when trying to reproduce this paper?

Alfredo: One of the main challenges we faced was hyperparameter tuning, because certain parameters were not clearly specified in the original paper.

We noticed some particular issues with numerical stability that we needed to take into consideration to make the model training stable.

Cecelia: What kind of bugs did you find if you can provide more details around that.

Alfredo: Since the distribution is spike-and-slab, we had to deal with numerical instabilities, as I mentioned before.

Furthermore, we had carefully choose the activation layers, because some of them may make the output take zero values and we didn’t want that. We wanted to be as close to zero as possible but not zero. So indeed, it was mostly numerical instabilities but nothing dramatically important.

Robert: And actually we are working on those improvements, which we mentioned to the authors, and we were very glad to know that the authors also agreed with the recommendations we made. The main one Alfredo suggested ,was to work with a dataset that was specifically made (by DeepMind) to assess the disentanglement properties of these models, because these variational encoder models must learn a latent space and usually they don’t produce results in a way you expect. Maybe one latent factor describes two factors of variation at the same time. Maybe changing the hair color of a person, also ends up putting glasses on their face. It is preferable that these factors were independent of each other.

Alfredo: The main drawback of the paper was not being able to quantitatively compare the proposed approach with state of the art models. They made good conclusions, but they didn’t support it very much with numbers.

Cecelia: Okay. For this model, what was the way that you evaluated. What was the evaluation metric? What were the numbers that they could have provided to make their case a little bit more clear?

Alfredo: For example in their paper, what they do is they learn these lower dimensional representations of the data, and then use them for different tasks, such as classification…

Cecelia: I see. So really like task-specific.

Alfredo: Yeah exactly and that was good. We replicated that, yet still it was lacking something. For example, what we suggested is to use a disentanglement metric. That will be perhaps more interesting to evaluate the learned features not specifically task oriented. So giving some kind of support on why those futures that are learned are meaningful.

Robert: Actually, in order to quantify what meaningful meant we saw that there was recent research on quantifying these disentanglement properties.

Alfredo: Truth be told, I didn’t know a lot of things about generative models before starting this project. However, we read a lot of papers and we dived deeper into the topic. The thing is that when you implement models from a scratch you also get like the real idea of why some things work and others do not. So that was good.

Cecelia: And did you just share the results with the authors? Or your whole approach along with the repository?

Robert: With a report and also the repo. We communicated with them through OpenReview.

Alfredo: -Yes, exactly , It was anonymized so we couldn’t actually send them an email.

Cecelia: Having done the challenge, how will you approach your own work differently?

Robert: Yeah I would say that I am now very thoughtful about reproducibility

Alfredo: Yes, to have some kind of reproducibility part within the work as a criteria. I think that’s that’s the way to go. Otherwise it takes endless hours to validate, debug and in some cases solve drastic issues from other authors code each time you want to publish something new.

Cecelia: What is it that you think would make machine learning research more reproducible? What are some challenges that you think these researchers have in making their work reproducible? Why isn’t it that everyone is submitting the code and all these details already?

Alfredo: The problem is that it involves sharing the know-how of each company. So if the project includes some particular software, or a private dataset, it cannot just simply be shared publicly; e.g. If you work with medical images. This is the main constraint for people sharing intellectual property and making reproducible projects.

Perhaps a good idea would be to just provide the code and a synthetic dataset. From the academia side, I think the main reason for not supporting reproducible research is the rush for deadlines and grants.

The publish or perish atmosphere restricts the time devoted to reproducibility documentation.

Robert: And I also think that in some cases there may be some gap of knowledge about best practices and tools that could be used to make projects more reproducible.

In one case, I would think that the current research incentives could be a cause. All I need is to submit the paper, so I have no strong incentive for organizing the code and making it reproducible right?

It actually isn’t that difficult to make the project more reproducible. For instance, setting the random seed, so that everything you run will be the same is just a couple of extra lines of code.

You can export your current environment to report all the versions of the libraries you used, as well as details about the operating system environment using a Dockerfile. There is also a research project named repo2docker which is a tool for converting a GitHub repo to a Docker Image.

I think it is important to promote more of these best practices, telling researchers that it’s not that difficult, that it doesn’t take that much time, and it helps a lot for the person who will later read and try to run the code.

Alfredo: -And for the whole community. One concrete example would be cherry picking. Which is common in computer vision and reinforcement learning. Doing this only compromises the unbiased assessment of models performance.

Robert: That’s very funny because I remember that we found a very long paper about good practices for Variational Autoencoders, which also had very good recommendations about how to make them work nearly as good as GANs. This problem of cherry picking is so well known in the community that in the caption of the figures, they included in parentheses “these were not cherry picked by the way.”

Cecelia: This is all really, really great information! Is there anything that you wanted to mention outside the scope of these questions?

Robert: Well in general I think that reproducibility is a quality measurement of a research project. As such, it should definitely be considered as an acceptance criterion in conferences and journals. In that sense, the approach taken by NeurIPs is very good and I hope that it extends to all the other conferences.

Alfredo: It is not surprising that most of these conferences ICLR, ICML, NeurIPS and others, are making workshops for reproducibility.

Interested in seeing how Comet.ml can help your team automatically track your datasets, code, models, and experimentation history for reproducible research? Find out more.

It’s easy to get started

And it's free. Two things everyone loves.