Continuing an Experiment¶
In situations where an experiment is interrupted or crashes, you may want to pick up where you left off and continue as if nothing had happened. Comet.ML allows this, given that you provide the code to get your ML system back to the same state.
One of the keys of such a workflow is being able to know a run's ID
and map that to the experiment key. In the following example, we
connect the process up via an environment variable
COMET_EXPERIMENT_KEY. This could be set by a process runner and
could be any string between 32 and 50 characters. For example, it
could be a run ID.
We set the environment variable, and the following script does the rest.
If you don't set
COMET_EXPERIMENT_KEY in this example then the experiment would get a random key, and you won't be able to map it back to the run.
import comet_ml import os # Check to see if there is a key in environment: EXPERIMENT_KEY = os.environ.get("COMET_EXPERIMENT_KEY", None) # First, let's see if we continue or start fresh: CONTINUE_RUN = False if (EXPERIMENT_KEY is not None): # There is one, but the experiment might not exist yet: api = comet_ml.API() # Assumes API key is set in config/env try: api_experiment = api.get_experiment_by_id(EXPERIMENT_KEY) except Exception: api_experiment = None if api_experiment is not None: CONTINUE_RUN = True # We can get the last details logged here, if logged: step = int(api_experiment.get_parameters_summary("steps")["valueCurrent"]) epoch = int(api_experiment.get_parameters_summary("epochs")["valueCurrent"]) if CONTINUE_RUN: # 1. Recreate the state of ML system before creating experiment # otherwise it could try to log params, graph, etc. again # ... # 2. Setup the existing experiment to carry on: experiment = comet_ml.ExistingExperiment( previous_experiment=EXPERIMENT_KEY, log_env_details=True, # to continue env logging log_env_gpu=True, # to continue GPU logging log_env_cpu=True, # to continue CPU logging ) # Retrieved from above APIExperiment experiment.set_step(step) experiment.set_epoch(epoch) else: # 1. Create the experiment first # This will use the COMET_EXPERIMENT_KEY if defined in env. # Otherwise, you could manually set it here. If you don't # set COMET_EXPERIMENT_KEY, the experiment will get a # random key! experiment = comet_ml.Experiment() # 2. Setup the state of the ML system # ... # Train or continue training # ...
That's it! If you have questions, let us know on on our slack channel