Getting Started with Conda

Overview

Teaching: 15 min
Exercises: 5 min
Questions
  • Why should I use a package and environment management system?

  • What is Conda?

  • How should I organize my project directory?

Objectives
  • Explain why you should use Conda as part of your research workflow.

  • Verify your Conda installation.

  • Learn to manage (create, update, remove, etc) Conda environments.

TLDR;

Here is the recipe for using Conda to manage a project specific software stack.

$ mkdir project-dir
$ cd project-dir
$ nano environment.yml # describe the environment
$ conda env create --prefix ./env --file environment.yml # create the environment
$ conda activate ./env # activate the environment
(/path/to/env) $ nano environment.yml # forgot to add some deps
(/path/to/env) $ conda env update --prefix ./env --file environment.yml --prune) # update the environment
(/path/to/env) $ conda deactivate # done working on project (for now!)

Why should I use a package and environment management system?

Installing software is hard. Installing scientific software (including all required dependencies of said software!) is often even more challenging. In order to minimize the burden of installing and updating software (data) scientists often install software packages that they need for their various projects system-wide.

Installing software system-wide has a number of drawbacks:

Put differently, installing software system-wide creates complex dependencies between your research projects that shouldn’t really exist!

Rather than installing software system-wide, wouldn’t it be great if we could install software separately for each research project?

Trade-offs of project specific software installs

What are some of the potential benefits from installing software separately for each project? What are some of the potential costs?

Solution

Notice that many of the potential benefits from installing software separately for each project require the ability to isolate the projects’ software environments from one another. Also, once you have figured out how to isolate project-specific software environments, you will still need to have some way to manage software packages appropriately.

Package management

A good package management system greatly simplifies the process of installing software by

  1. identifying and installing compatible versions of software and all required dependencies.
  2. handling the process of updating software as more recent versions become available.

If you use some flavor of Linux, then you are probably familiar with the package manager for your Linux distribution (i.e., apt on Ubuntu, yum on CentOS); if you are a Mac OS user then you might be familiar with the Home Brew Project which brings a Linux-like package management system to Mac OS; if you are a Windows OS user, then you may not be terribly familiar with package managers as there isn’t really a standard package manager for Windows (although there is the Chocolatey Project).

Operating system package management tools are great but these tools actually solve a more general problem than you often face as a (data) scientist. As a (data) scientist you typically use one or two core scripting languages (i.e., Python, R, SQL). Each scripting language has multiple versions that can potentially be installed and each scripting language will also have a large number of third-party packages (and additional dependencies) that will need to be installed. The exact version of your core scripting language(s) and additional, third-party packages will also probably change from project to project.

Environment management

An environment management system solves a number of problems commonly encountered by (data) scientists.

An environment management system enables you to set up a new, project specific software environment containing specific Python (R) versions as well as the versions of additional packages and required dependencies that are all mutually compatible.

What is Conda?

From the official Conda documentation. Conda is an open source package and environment management system that runs on Windows, Mac OS and Linux.

Conda as a package manager helps you find and install packages. If you need a package that requires a different version of Python, you do not need to switch to a different environment manager, because Conda is also an environment manager. With just a few commands, you can set up a totally separate environment to run that different version of Python, while continuing to run your usual version of Python in your normal environment.

Conda vs. Miniconda vs. Anaconda

Conda vs. Miniconda vs. Anaconda

Users are often confused about the differences between Conda, Miniconda, and Anaconda. Miniconda (which we will install below!) combines Conda with Python 3 and a small number of core packages; Anaconda includes Miniconda as well as over 150 of the most widely used Python packages.

Installing Conda

As long as you have already installed the Python 3 Miniconda or Anaconda Python distributions, then Conda has already been installed on your machine. If you have not already installed one of the recommended Python 3 distibutions then please see the setup instructions for details.

In order to verify that you have installed Conda correctly run the conda info command in the Terminal (Mac OS or Linux) or Anaconda Powershell Prompt (Windows)

Mac OS

There are two ways to open Terminal on Mac.

  1. In your Applications folder, open Utilities and double-click on Terminal
  2. Press Command + spacebar to launch Spotlight. Type Terminal and then double-click the search result or hit Enter.

Windows

Windows users will need to search for the “Anaconda Prompt” in the search menu. Clicking on the application will launch the prompt.

Regardless of OS, the output of the command should look similar to the following. Note that the conda version number is provided as part of the output of this command.

$ conda info
     active environment : base
    active env location : /Users/$USERNAME/miniconda3
            shell level : 1
       user config file : /Users/$USERNAME/.condarc
 populated config files : 
          conda version : 4.8.3
    conda-build version : not installed
         python version : 3.7.6.final.0
       virtual packages : __osx=10.15.6
       base environment : /Users/$USERNAME/opt/miniconda3  (writable)
           channel URLs : https://repo.anaconda.com/pkgs/main/osx-64
                          https://repo.anaconda.com/pkgs/main/noarch
                          https://repo.anaconda.com/pkgs/r/osx-64
                          https://repo.anaconda.com/pkgs/r/noarch
          package cache : /Users/$USERNAME/opt/miniconda3/pkgs
                          /Users/$USERNAME/.conda/pkgs
       envs directories : /Users/$USERNAME/opt/miniconda3/envs
                          /Users/$USERNAME/.conda/envs
               platform : osx-64
             user-agent : conda/4.8.3 requests/2.23.0 CPython/3.7.6 Darwin/19.6.0 OSX/10.15.6
                UID:GID : 502:20
             netrc file : None
           offline mode : False

An alternative solution is to run the conda command and pass the --version flag which will return the version number if Conda has been properly installed.

$ conda --version
conda 4.8.3

Make sure you have the most recent version

The version of Conda included with the most recent version of Miniconda might not be the latest version of Conda. Run the following command to make sure that you have the most recent version and of Conda (including bug fixes!).

$ conda update --name base --channel defaults --yes conda

You can re-run this command at any time to update to the most recent version of Conda.

Managing Conda Environments

New Project, New Directory

Every new project (no matter how small!) should live in its own directory. While the topic of project organization “best practices” could be an episode in itself, a good reference to get started is Good Enough Practices for Scientific Computing. Let’s create a new project directory for today’s workshop.

$ cd ~/Desktop
$ mkdir python-novice-gapminder
$ cd python-novice-gapminder

New Project, New Environment

Now that we have a new project directory we are ready to create a new environment for our project. We will do this in two steps: first we will create an environment file that describes the software dependencies (including specific version numbers!) for the project; second we will use the newly created environment file to build the software environment. Let’s see how to accomplish each of these steps in turn.

Creating an Environment File

In order to make sure that your environment is truly shareable, we need to make sure that that the contents of your environment are described in such a way that the resulting environment file can be used to re-create your environment on Linux, Mac OS, and Windows. Conda uses YAML (“YAML Ain’t Markup Language”) for writing its environment files. YAML is a human-readable data-serialization language that is commonly used for configuration files and that uses Python-style indentation to indicate nesting.

Default environment.yml file

Note that by convention Conda environment files are called environment.yml. As such if you use the conda env create sub-command without passing the --file option, then conda will expect to find a file called environment.yml in the current working directory and will throw an error if a file with that name can not be found.

Let’s take a look at a canonical environment.yml file to give you an idea of how to write your own environment files. The following is a minimal environment.ymlfile. Note that we specify a name for the environment (which can be null) followed by our project dependencies.

name: python-novice-gapminder-env

dependencies:
  - python=3.8
  - matplotlib
  - pandas

Always specify version numbers for your dependencies!

In order to make your results more reproducible and to make it easier for research colleagues to recreate your Conda environments on their machines it is a “best practice” to always explicitly specify the version number for each package that you install into an environment. If you are not sure exactly which version of a package you want to use, then you can use search to see what versions are available using the following command.

$ conda search $PACKAGE_NAME

So, for example, if you wanted to see what versions of Pandas are available you would run the following.

$ conda search pandas

As always you can run conda search --help to learn about available options.

Creating an environment.yml file

Using the example above, create a file called environment.yml in the python-novice-gapminder directory.

Solution

The following commands can be used to create the environment.yml file.

$ nano environment.yml # copy and paste the above text in the buffer; save and close

Creating a New Environment From an environment.yml File

Let’s suppose that you want to use the environment.yml file defined above to create a Conda environment in a sub-directory of your project directory. Creating your Conda environment as a sub- directory inside your project directory is another “best practice”.

$ conda env create --prefix ./env --file environment.yml

Always version control your environment.yml files!

While you should never version control the contents of your env/ environment sub-directory, you should always version control your environment.yml files. Version controlling your environment.yml files together with your project’s source code means that you always know which versions of which packages were used to generate your results at any particular point in time.

Activating the Environment

Activating environments is essential to making the software in environments work well (or sometimes at all!). Activation of an environment does two things.

  1. Adds entries to PATH for the environment.
  2. Runs any activation scripts that the environment may contain.

Step 2 is particularly important as activation scripts are how packages can set arbitrary environment variables that may be necessary for their operation. On a Unix system you activate the environment using the following command.

$ conda activate ./env

You can see that an environment has been activated because the shell prompt will now include the absolute path to the active environment.

(absolute/path/to/project-dir/env) $

Initializing Conda properly setup for your shell

Conda 4.4 introduced new scripts that make activation behavior uniform across operating systems. Where previously you once had source activate envname on Unix, and just activate envname on Windows, Conda 4.4 allowed conda activate envname. Setting up your shell to use this new feature was tricky. Conda 4.6 added extensive initialization support so that more shells can use the new conda activate command. For more information, read the output from conda init –-help.

$ conda init bash

After running conda init you will need to close and restart your shell for changes to take effect. Alternatively, you can reload your .bashrc profile (which was changed by running the conda init command). To reload your .bashrc profile, use the following command.

$ . ~/.bashrc

If you want to reverse or “undo” the changes to your .bashrc, then you can re-run the conda init command and pass the --reverse option.

(base) $ conda init --reverse

Again, in order for the reversal to take effect you will likely need to close and restart your shell session.

Updating the Environment

You are unlikely to know ahead of time which packages (and version numbers!) you will need to use for your research project. For example it may be the case that…

If any of these occurs during the course of your research project, all you need to do is update the contents of your environment.yml file accordingly and then run the following command.

$ conda env update --prefix ./env --file environment.yml  --prune

Note that the --prune option cause Conda to remove any dependencies that are no longer required from the environment.

Adding JupyterLab to the environment.

Add JupyterLab to your environment.yml file and update the environment.

Solution

The environment.yml file should now look as follows.

name: python-novice-gapminder-env

dependencies:
  - python=3.8
  - jupyerlab
  - matplotlib
  - pandas

The following command will update the environment accordingly.

$ conda env update --prefix ./env --file environment.yml --prune 

Getting the Gapminder Data

The data we will be using is taken from the gapminder dataset. To obtain it, download python-novice-gapminder-data.zip and then unzip the contents of the file within the python-novice-gapminder directory.

On Mac OS or Linux the following commands should move the .zip archive from the downloads folder to the python-novice-gapminder directory we previously created on the desktop. Unzipping the zip archive will create a data directory containing the *.csv files that we will use in the remaining lessons.

$ mv ~/Downloads/python-novice-gapminder-data.zip ~/Desktop/python-novice-gapminder
$ cd ~/Desktop/python-novice-gapminder
$ unzip python-novice-gapminder-data.zip

Deactivating the Environment

When you are done working on your project it is a good idea to deactivate the current environment. To deactivate the currently active environment use the deactivate command as follows.

(absolute/path/to/project-dir/env) $ conda deactivate

You can see that an environment has been deactivated because the shell prompt will no longer include the name of the active environment.

$

Returning to the base environment

To simply return to the base Conda environment, it’s better to call conda activate with no environment specified, rather than to use deactivate. If you run conda deactivate from your base environment, you may lose the ability to run conda commands at all. Don’t worry if you encounter this undesirable state! Just start a new shell.

$ conda deactivate

Key Points

  • Using a package and environment management tool facilitates reproducible research workflows.

  • Conda is a platform agnostic, open source package and environment management system.

  • Every project should have its own directory and its own environment.

  • A project’s environment should be installed as a sub-directory within the project directory.