Continuing an Experiment

In situations where an experiment is interrupted or crashes, you may want to pick up where you left off and continue as if nothing had happened. Comet.ML allows this, given that you provide the code to get your ML system back to the same state.

One of the keys of such a workflow is being able to know a run's ID and map that to the experiment key. In the following example, we connect the process up via an environment variable COMET_EXPERIMENT_KEY. This could be set by a process runner and could be any string between 32 and 50 characters. For example, it could be a run ID.

We set the environment variable, and the following script does the rest.

Info

If you don't set COMET_EXPERIMENT_KEY in this example then the experiment would get a random key, and you won't be able to map it back to the run.

import comet_ml
import os

# Check to see if there is a key in environment:
EXPERIMENT_KEY = os.environ.get("COMET_EXPERIMENT_KEY", None)

# First, let's see if we continue or start fresh:
CONTINUE_RUN = False
if (EXPERIMENT_KEY is not None):
    # There is one, but the experiment might not exist yet:
    api = comet_ml.API() # Assumes API key is set in config/env
    try:
        api_experiment = api.get_experiment_by_id(EXPERIMENT_KEY)
    except Exception:
        api_experiment = None
    if api_experiment is not None:
        CONTINUE_RUN = True
        # We can get the last details logged here, if logged:
        step = int(api_experiment.get_parameters_summary("steps")["valueCurrent"])
        epoch = int(api_experiment.get_parameters_summary("epochs")["valueCurrent"])

if CONTINUE_RUN:
    # 1. Recreate the state of ML system before creating experiment
    # otherwise it could try to log params, graph, etc. again
    # ...
    # 2. Setup the existing experiment to carry on:
    experiment = comet_ml.ExistingExperiment(
        previous_experiment=EXPERIMENT_KEY,
        log_env_details=True, # to continue env logging
        log_env_gpu=True,     # to continue GPU logging
        log_env_cpu=True,     # to continue CPU logging
    )
    # Retrieved from above APIExperiment
    experiment.set_step(step)
    experiment.set_epoch(epoch)

else:
    # 1. Create the experiment first
    #    This will use the COMET_EXPERIMENT_KEY if defined in env.
    #    Otherwise, you could manually set it here. If you don't
    #    set COMET_EXPERIMENT_KEY, the experiment will get a
    #    random key!
    experiment = comet_ml.Experiment()
    # 2. Setup the state of the ML system
    # ...

# Train or continue training
# ...

That's it! If you have questions, let us know on on our slack channel