This lesson is being piloted (Beta version)
If you teach this lesson, please tell the authors and provide feedback by opening an issue in the source repository

Introduction to Conda for (Data) Scientists

Getting Started with Conda

Overview

Teaching: 15 min
Exercises: 5 min
Questions
  • What is Conda?

  • Why should I use a package and environment management system as part of my research workflow?

  • Why use Conda ?

Objectives
  • Understand why you should use a package and environment management system as part of your (data) science workflow.

  • Explain the benefits of using Conda as part of your (data) science workflow.

Packages and Environments

Packages

When working with a programming language, such as Python, that can do almost anything, one has to wonder how this is possible. You download Python, it has about 25 MB, how can everything be included in this small data package. The answer is - it is not. Python, as well as many other programming languages use external libraries or packages for being able to doing almost anything. You can see this already when you start programming. After learning some very basics, you often learn how to import something into your script or session.

Modules, packages, libraries

  • Module: a collection of functions and variables, as in a script
  • Package: a collection of modules with an init.py file (can be empty), as in a directory with scripts
  • Library: a collection of packages with related functionality

Library/Package are often used interchangeably.

Dependencies

A bit further into your programming career you may notice/have noticed that many packages do not just do everything on their own. Instead, they depend on other packages for their functionality. For example, the Scipy package is used for numerical routines. To not reinvent the wheel, the package makes use of other packages, such as numpy (numerical python) and matplotlib (plotting) and many more. So we say that numpy and matplotlib are dependencies of Scipy.

Many packages are being further developed all the time, generating different versions of packages. During development it may happen that a function call changes and/or functionalities are added or removed. If one package can depend on another, this may create issues. Therefore it is not only important to know that e.g. Scipy depends on numpy and matplotlib, but also that it depends on numpy version >= 1.6 and matplotlib version >= 1.1. Numpy version 1.5 in this case would not be sufficient.

Environments

When starting with programming we may not use many packages yet and the installation may be straightforward. But for most people, there comes a time when one version of a package or also the programming language is not enough anymore. You may find an older tool that depends on an older version of your programming language (e.g. Python 2.7), but many of your other tools depend on a newer version (e.g. Python 3.6). You could now start up another computer or virtual machine to run the other version of the programming language, but this is not very handy, since you may want to use the tools together in a workflow later on. Here, environments are one solution to the problem. Nowadays there are several environment management systems following a similar idea: Instead of having to use multiple computers or virtual machines to run different versions of the same package, you can install packages in isolated environments.

Environment management

An environment management system solves a number of problems commonly encountered by (data) scientists.

An environment management system enables you to set up a new, project specific software environment containing specific Python versions as well as the versions of additional packages and required dependencies that are all mutually compatible.

Environment management systems for Python

Conda is not the only way; Python for example has many more ways of working with environments:

Package management

A good package management system greatly simplifies the process of installing software by…

  1. identifying and installing compatible versions of software and all required dependencies.
  2. handling the process of updating software as more recent versions become available.

If you use some flavor of Linux, then you are probably familiar with the package manager for your Linux distribution (i.e., apt on Ubuntu, yum on CentOS); if you are a Mac OSX user then you might be familiar with the Home Brew Project which brings a Linux-like package management system to Mac OS; if you are a Windows OS user, then you may not be terribly familiar with package managers as there isn’t really a standard package manager for Windows (although there is the Chocolatey Project).

Operating system package management tools are great but these tools actually solve a more general problem than you often face as a (data) scientist. As a (data) scientist you typically use one or two core scripting languages (i.e., Python, R, SQL). Each scripting language has multiple versions that can potentially be installed and each scripting language will also have a large number of third-party packages that will need to be installed. The exact version of your core scripting language(s) and additional, third-party packages will also probably change from project to project.

Package management systems for Python

Also here, Conda is not the only way; Python for example has many more ways of working with packages:

Why should I use a package and environment management system?

Installing software is hard. Installing scientific software is often even more challenging. In order to minimize the burden of installing and updating software (data) scientists often install software packages that they need for their various projects system-wide.

Installing software system-wide has a number of drawbacks:

Put differently, installing software system-wide creates complex dependencies between your research projects that shouldn’t really exist!

Rather than installing software system-wide, wouldn’t it be great if we could install software separately for each research project?

Discussion

What are some of the potential benefits from installing software separately for each project? What are some of the potential costs?

Solution

You may notice that many of the potential benefits from installing software separately for each project require the ability to isolate the projects’ software environments from one another (i.e., solve the environment management problem). Once you have figured out how to isolate project-specific software environments, you will still need to have some way to manage software packages appropriately (i.e., solve the package management problem).

What I hope you will have taken away from the discussion exercise is an appreciation for the fact that in order to install project-specific software environments you need to solve two complementary challenges: environment management and package management.

Conda

From the official Conda documentation. Conda is an open source package and environment management system that runs on Windows, Mac OS and Linux.

Conda as a package manager helps you find and install packages. If you need a package that requires a different version of Python, you do not need to switch to a different environment manager, because Conda is also an environment manager. With just a few commands, you can set up a totally separate environment to run that different version of Python, while continuing to run your usual version of Python in your normal environment.

Conda vs. Miniconda vs. Anaconda

Conda vs. Miniconda vs. Anaconda

Users are often confused about the differences between Conda, Miniconda, and Anaconda. Conda is a tool for managing environments and installing packages. Miniconda combines Conda with Python and a small number of core packages; Anaconda includes Miniconda as well as a large number of the most widely used Python packages.

Why use Conda?

Whilst there are many different package and environment management systems that solve either the package management problem or the environment management problem, Conda solves both of these problems and explicitly targeted at (data) science use cases.

Additionally, Anaconda provides commonly used data science libraries and tools, such as R, NumPy, SciPy and TensorFlow built using optimised, hardware specific libraries (such as Intel’s MKL or NVIDIA’s CUDA), which provides a speedup without having to change any of your code.

Key Points

  • Conda is a platform agnostic, open source package and environment management system.

  • Using a package and environment management tool facilitates portability and reproducibility of (data) science workflows.

  • Conda solves both the package and environment management problems and targets multiple programming languages. Other open source tools solve either one or the other, or target only a particular programming language.

  • Anaconda is not only for Python


Working with Environments

Overview

Teaching: 60 min
Exercises: 15 min
Questions
  • What is a Conda environment?

  • How do I create (delete) an environment?

  • How do I activate (deactivate) an environment?

  • How do I install packages into existing environments using Conda (+pip)?

  • Where should I create my environments?

  • How do I find out what packages have been installed in an environment?

  • How do I find out what environments that exist on my machine?

  • How do I delete an environment that I no longer need?

Objectives
  • Understand how Conda environments can improve your research workflow.

  • Create a new environment.

  • Activate (deactivate) a particular environment.

  • Install packages into existing environments using Conda (+pip).

  • Specify the installation location of an environment.

  • List all of the existing environments on your machine.

  • List all of the installed packages within a particular environment.

  • Delete an entire environment.

Workspace for Conda environments

If you haven’t done it yet, create a new introduction-to-conda-for-data-scientists directory on your Desktop in order to maintain a consistent workspace for all your conda environment.

On Mac OSX and Linux running following commands in the Terminal will create the required directory on the Desktop.

$ cd ~/Desktop
$ mkdir introduction-to-conda-for-data-scientists
$ cd introduction-to-conda-for-data-scientists

For Windows users you may need to reverse the direction of the slash and run the commands from the command prompt.

> cd ~\Desktop
> mkdir introduction-to-conda-for-data-scientists
> cd introduction-to-conda-for-data-scientists

Alternatively, you can always “right-click” and “create new folder” on your Desktop. All the commands that are run during the workshop should be run in a terminal within the introduction-to-conda-for-data-scientists directory.

What is a Conda environment

A Conda environment is a directory that contains a specific collection of Conda packages that you have installed. For example, you may be working on a research project that requires NumPy 1.18 and its dependencies, while another environment associated with an finished project has NumPy 1.12 (perhaps because version 1.12 was the most current version of NumPy at the time the project finished). If you change one environment, your other environments are not affected. You can easily activate or deactivate environments, which is how you switch between them.

Avoid installing packages into your base Conda environment

Conda has a default environment called base that include a Python installation and some core system libraries and dependencies of Conda. It is a “best practice” to avoid installing additional packages into your base software environment. Additional packages needed for a new project should always be installed into a newly created Conda environment.

Creating environments

To create a new environment for Python development using conda you can use the conda create command.

$ conda create --name python3-env python

For a list of all commands, take a look at Conda general commands.

It is a good idea to give your environment a meaningful name in order to help yourself remember the purpose of the environment. While naming things can be difficult, $PROJECT-NAME-env is a good convention to follow. Sometimes also the specific version of a package why you had to create a new environment is a good name

The command above will create a new Conda environment called “python3-env” and install the most recent version of Python. If you wish, you can specify a particular version of packages for conda to install when creating the environment.

$ conda create --name python36-env python=3.6

Always specify a version number for each package you wish to install

In order to make your results more reproducible and to make it easier for research colleagues to recreate your Conda environments on their machines it is a “best practice” to always explicitly specify the version number for each package that you install into an environment. If you are not sure exactly which version of a package you want to use, then you can use search to see what versions are available using the conda search command.

$ conda search $PACKAGE_NAME

So, for example, if you wanted to see which versions of Scikit-learn, a popular Python library for machine learning, were available, you would run the following.

$ conda search scikit-learn

As always you can run conda search --help to learn about available options.

You can create a Conda environment and install multiple packages by listing the packages that you wish to install.

$ conda create --name basic-scipy-env ipython=7.13 matplotlib=3.1 numpy=1.18 scipy=1.4

When conda installs a package into an environment it also installs any required dependencies. For example, even though Python is not listed as a packaged to install into the basic-scipy-env environment above, conda will still install Python into the environment because it is a required dependency of at least one of the listed packages.

Creating a new environment

Create a new environment called “machine-learning-env” with Python and the most current versions of IPython, Matplotlib, Pandas, Numba and Scikit-Learn.

Solution

In order to create a new environment you use the conda create command as follows.

$ conda create --name machine-learning-env \
 ipython \
 matplotlib \
 pandas \
 python \
 scikit-learn \
 numba

Since no version numbers are provided for any of the Python packages, Conda will download the most current, mutually compatible versions of the requested packages. However, since it is best practice to always provide explicit version numbers, you should prefer the following solution.

$ conda create --name machine-learning-env \
 ipython=7.19 \
 matplotlib=3.3 \
 pandas=1.2 \
 python=3.8 \
 scikit-learn=0.23 \
 numba=0.51

However, please be aware that the version numbers for each packages may not be the latest available and would need to be adjusted.

Renaming a conda environment

As of conda version 4.14.0 you can rename a conda environment with the conda rename command. conda rename supports renaming your current environment, or any of your existing environments.

conda rename -n old_env_name new_env_name

Activating an existing environment

Activating environments is essential to making the software in environments work well (or sometimes at all!). Activation of an environment does two things.

  1. Adds entries to PATH for the environment.
  2. Runs any activation scripts that the environment may contain.

Step 2 is particularly important as activation scripts are how packages can set arbitrary environment variables that may be necessary for their operation. You activate the basic-scipy-env environment by name using the activate command.

$ conda activate basic-scipy-env

You can see that an environment has been activated because the shell prompt will now include the name of the active environment.

(basic-scipy-env) $

Deactivate the current environment

To deactivate the currently active environment use the Conda deactivate command as follows.

(basic-scipy-env) $ conda deactivate

You can see that an environment has been deactivated because the shell prompt will no longer include the name of the previously active environment.

Returning to the base environment

To return to the base Conda environment, it’s better to call conda activate with no environment specified, rather than to use deactivate. If you run conda deactivate from your base environment, you may lose the ability to run conda commands at all. Don’t worry if you encounter this undesirable state! Just start a new shell.

Activate an existing environment by name

Activate the machine-learning-env environment created in the previous challenge by name.

Solution

In order to activate an existing environment by name you use the conda activate command as follows.

$ conda activate machine-learning-env

Deactivate the active environment

Deactivate the machine-learning-env environment that you activated in the previous challenge.

Solution

In order to deactivate the active environment you use the conda deactivate command.

(active-environment-name) $ conda deactivate

Installing a package into an existing environment

You can install a package into an existing environment using the conda install command. This command accepts a list of package specifications (i.e., numpy=1.18) and installs a set of packages consistent with those specifications and compatible with the underlying environment. If full compatibility cannot be assured, an error is reported and the environment is not changed.

By default the conda install command will install packages into the current, active environment. The following would activate the basic-scipy-env we created above and install Numba, an open source JIT compiler that translates a subset of Python and NumPy code into fast machine code, into the active environment.

$ conda activate basic-scipy-env
$ conda install numba

As was the case when listing packages to install when using the conda create command, if version numbers are not explicitly provided, Conda will attempt to install the newest versions of any requested packages. To accomplish this, Conda may need to update some packages that are already installed or install additional packages. It is always a good idea to explicitly provide version numbers when installing packages with the conda install command. For example, the following would install a particular version of Scikit-Learn, into the current, active environment.

$ conda install scikit-learn=0.22

Freezing installed packages

To prevent existing packages from being updating when using the conda install command, you can use the --freeze-installed option. This may force Conda to install older versions of the requested packages in order to maintain compatibility with previously installed packages. Using the --freeze-installed option does not prevent additional dependency packages from being installed.

Remove a package from an environment

To remove a package from an environment you can run the command.

$ conda uninstall PKGNAME --name ENVNAME

For example to remove the scikit-learn package from the basic-scipy-env environment run

$ conda uninstall scikit-learn --name basic-scipy-env

Installing a package into a specific environment

Dask provides advanced parallelism for data science workflows enabling performance at scale for the core Python data science tools such as Numpy Pandas, and Scikit-Learn. Have a read through the official documentation for the conda install command and see if you can figure out how to install Dask into the machine-learning-env that you created in the previous challenge.

Solution

You can install Dask into machine-learning-env using the conda install command as follow.

$ conda install --name machine-learning-env dask=2020.12

You could also install Dask into machine-learning-env by first activating that environment and then using the conda install command.

$ conda activate machine-learning-env
$ conda install dask=2020.12

Where do Conda environments live?

Environments created with conda, by default, live in the envs/ folder of your miniconda3 (or anaconda3) directory the absolute path to which will look something the following: /Users/$USERNAME/miniconda3/envs or C:\Users\$USERNAME\Anaconda3.

You can see the location of your conda environments by running the command.

conda config --show envs_dirs

Running ls (linux) / dir (Windows) on your anaconda envs/ directory will list out the directories containing the existing Conda environments.

Location of Conda environments on Binder

If you are working through these lessons using a Binder instance, then the default location of the Conda environments is slightly different.

$ /srv/conda/envs

Running ls /srv/conda/envs/ from a terminal will list out the directories containing any previously installed Conda environments.

How do I specify a location for a Conda environment?

You can control where a Conda environment lives by providing a path to a target directory when creating the environment. For example the following command will create a new environment in a sub-directory of the current working directory called env.

$ conda create --prefix ./env ipython=7.13 matplotlib=3.1 pandas=1.0 python=3.6

dot-slash, ./

In unix the dot-slash, ./, is a relative path a file or directory in the current directory.

You activate an environment created with a prefix using the same command used to activate environments created by name.

$ conda activate ./env

It is often a good idea to specify a path to a sub-directory of your project directory when creating an environment. Why?

  1. Makes it easy to tell if your project utilizes an isolated environment by including the environment as a sub-directory.
  2. Makes your project more self-contained as everything including the required software is contained in a single project directory.

An additional benefit of creating your project’s environment inside a sub-directory is that you can then use the same name for all your environments; if you keep all of your environments in your ~/miniconda3/env/ folder, you’ll have to give each of them a different name.

Conda environment sub-directory naming convention

In order to be consistent with the convention used by tools such as venv and Pipenv, I recommend using env as the name of the sub-directory of your project directory that contains your Conda environment. A benefit of maintaining the convention is that your environment sub-directory will be automatically ignored by the default Python .gitignore file used on GitHub.

Whatever naming convention you adopt it is important to be consistent! Using the same name for all of your Conda environments allows you to use the same activate command as well.

$ cd my-project/
$ conda activate ./env

Creating a new environment as a sub-directory within a project directory

First create a project directory called project-dir using the following command.

$ mkdir project-dir
$ cd project-dir

Next, create a new environment inside the newly created project-dir in a sub-directory called env an install Python 3.6, version 3.1 of Matplotlib, and version 2.0 of TensorFlow.

Solution

project-dir $ conda create --prefix ./env \
python=3.6 \
matplotlib=3.1 \
tensorflow=2.0

Placing Conda environments outside of the default ~/miniconda3/envs/ folder comes with a couple of minor drawbacks. First, conda can no longer find your environment with the --name flag; you’ll generally need to pass the --prefix flag along with the environment’s full path to find the environment.

Second, an annoying side-effect of specifying an install path when creating your Conda environments is that your command prompt is now prefixed with the active environment’s absolute path rather than the environment’s name. After activating an environment using its prefix your prompt will look similar to the following.

(/absolute/path/to/env) $

As you can imagine, this can quickly get out of hand.

(/Users/USER_NAME/research/data-science/PROJECT_NAME/env) $

If (like me!) you find this long prefix to your shell prompt annoying, then there is a quick fix: modify the env_prompt setting in your .condarc file, which you can do with the following command.

$ conda config --set env_prompt '({name})'

This will either edit your ~/.condarc file if you already have one or create a ~/.condarc file if you do not. Now your command prompt will display the active environment’s generic name.

$ cd project-directory
$ conda activate ./env
(env) project-directory $

For more on modifying your .condarc file, see the official Conda docs.

Activate an existing environment by path

Activate the environment created in a previous challenge using the path to the environment directory.

Solution

You can activate an existing environment by providing the path the the environment directory instead of the environment name when using the conda activate command as follows.

$ conda activate ./env

Note that the provided path can either be absolute or relative. If the path is a relative path then it must start with ./ on Unix systems and .\ when using PowerShell on Windows.

Conda can create environments for R projects too!

First create a project directory called r-project-dir using the following command.

cd ~/
mkdir r-project-dir
cd r-project-dir

Next, take a look through the list of R packages available by default for installation using conda. Create a new environment inside the newly created r-project-dir in a sub-directory called env and install r-base, r-tidyverse.

Solution

 conda create --prefix ./env \
 r-base \
 r-tidyverse

Listing existing environments

Now that you have created a number of Conda environments on your local machine you have probably forgotten the names of all of the environments and exactly where they live. Fortunately, there is a conda command to list all of your existing environments together with their locations.

$ conda env list

Listing the contents of an environment

In addition to forgetting names and locations of Conda environments, at some point you will probably forget exactly what has been installed in a particular Conda environment. Again, there is a conda command for listing the contents on an environment. To list the contents of the basic-scipy-env that you created above, run the following command.

$ conda list --name basic-scipy-env

If you created your Conda environment using the --prefix option to install packages into a particular directory, then you will need to use that prefix in order for conda to locate the environment on your machine.

$ conda list --prefix /path/to/conda-env

Listing the contents of a particular environment.

List the packages installed in the machine-learning-env environment that you created in a previous challenge.

Solution

You can list the packages and their versions installed in machine-learning-env using the conda list command as follows.

$ conda list --name machine-learning-env

To list the packages and their versions installed in the active environment leave off the --name or --prefix option.

$ conda list

Deleting entire environments

Occasionally, you will want to delete an entire environment. Perhaps you were experimenting with conda commands and you created an environment you have no intention of using; perhaps you no longer need an existing environment and just want to get rid of cruft on your machine. Whatever the reason, the command to delete an environment is the following.

$ conda remove --name python36-env --all

If you wish to delete and environment that you created with a --prefix option, then you will need to provide the prefix again when removing the environment.

$ conda remove --prefix /path/to/conda-env/ --all

Delete an entire environment

Delete the entire “machine-learning-env” environment.

Solution

In order to delete an entire environment you use the conda remove command as follows.

$ conda remove --name machine-learning-env --all --yes

This command will remove all packages from the named environment before removing the environment itself. The use of the --yes flag short-circuits the confirmation prompt (and should be used with caution).

Key Points

  • A Conda environment is a directory that contains a specific collection of Conda packages that you have installed.

  • You create (remove) a new environment using the conda create (conda remove) commands.

  • You activate (deactivate) an environment using the conda activate (conda deactivate) commands.

  • You install packages into environments using conda install; you install packages into an active environment using pip install.

  • You should install each environment as a sub-directory inside its corresponding project directory

  • Use the conda env list command to list existing environments and their respective locations.

  • Use the conda list command to list all of the packages installed in an environment.


Using Packages and Channels

Overview

Teaching: 20 min
Exercises: 10 min
Questions
  • What are Conda packages?

  • What are Conda channels?

  • Why should I be explicit about which channels my research project uses?

  • What should I do if a Python package isn’t available via a Conda channel?

Objectives
  • Install a package from a specific channel.

  • Understand how conda channels work.

  • Use pip to install a package into your environment.

What are Conda packages?

A conda package is a compressed archive file (.tar.bz2) that contains:

Conda keeps track of the dependencies between packages and platforms; the conda package format is identical across platforms and operating systems.

Package Structure

All conda packages have a specific sub-directory structure inside the tarball file. There is a bin directory that contains any binaries for the package; a lib directory containing the relevant library files (i.e., the .py files); and an info directory containing package metadata. For a more details of the conda package specification, including discussions of the various metadata files, see the [docs][conda-pkg-spec-docs].

As an example of Conda package structure consider the Conda package for the RNA-Seq transcript quantification package salmon targeting a 64-bit Linux, salmon-1.4.0-h84f40af_1.tar.bz2.

.
├── bin
│   └── salmon
├── info
│   ├── about.json
│   ├── files
│   ├── git
│   ├── hash_input.json
│   ├── has_prefix
│   ├── index.json
│   ├── licenses
│   │   └── LICENSE
│   ├── paths.json
│   ├── recipe
│   │   ├── 0.14.2-1
│   │   │   ├── build.sh
│   │   │   └── meta.yaml
│   │   ├── build.sh
│   │   ├── conda_build_config.yaml
│   │   ├── meta.yaml
│   │   ├── meta.yaml.template
│   │   └── run_test.sh
│   ├── repodata_record.json
│   └── test
│       ├── run_test.sh
│       └── sample_data.tgz
└── lib
    ├── graphdump
    │   ├── graphdump-targets.cmake
    │   └── graphdump-targets-release.cmake
    ├── libgraphdump.a
    ├── libntcard.a
    ├── libsalmon_core.a
    ├── libtwopaco.a
    ├── ntcard
    │   ├── ntcard-targets.cmake
    │   └── ntcard-targets-release.cmake
    └── twopaco
        ├── twopaco-targets.cmake
        └── twopaco-targets-release.cmake

Conda packages cache directory (pkgs_dirs)

When you first install a package, conda will download the tar.bz2 packages into the package cache directory. Before conda install a package it will look in the package directory . By default this will be placed under ~/.conda/pkgs.
For Anaconda 5.0.1 and newer, you can also configure your pkgs directory location by using the following command:

conda config --add pkgs_dirs <package directory>

What actually happens when I install packages?

During the installation process, files are extracted into the specified environment (defaulting to the current environment if none is specified). Installing the files of a conda package into an environment can be thought of as changing the directory to an environment, and then downloading and extracting the package and its dependencies.

For example, when you conda install a package that exists in a known repository (channel), and has no dependencies, conda does the following.

  1. looks at your configured channels (in priority)
  2. reaches out to the repodata associated with your channels/platform
  3. parses repodata to search for the package
  4. once the package is found, conda pulls it down and installs

The conda documentation has a nice decision tree that describes the package installation process.

Installing with Conda

Channel

Let’s create a new environment basic-rnaseq-env and install a transcript quantification package, salmon , for an RNA-Seq analysis project.

$conda create --name basic-rnaseq-env salmon

This will return;

Loading channels: done
No match found for: salmon. Search: *salmon*

PackagesNotFoundError: The following packages are not available from current channels:

  - salmon

Current channels:

  - https://repo.anaconda.com/pkgs/main/linux-64
  - https://repo.anaconda.com/pkgs/main/noarch
  - https://repo.anaconda.com/pkgs/r/linux-64
  - https://repo.anaconda.com/pkgs/r/noarch

To search for alternate channels that may provide the conda package you're
looking for, navigate to

    https://anaconda.org

and use the search bar at the top of the page.

What does packages are not available from current channels: mean?

What are Conda channels?

When you install or search for a package in conda it searches for it in remote repositories called channels. These remote channel are URLs to directories containing conda packages. By default the conda search command searches a set of channels defined here. Anaconda Cloud channels.

Collectively, the Anaconda managed channels are referred to as the defaults channel because, unless otherwise specified, packages installed using conda will be downloaded from these channels.

My package isn’t available on the defaults channel! What should I do?

As was the case with salmon it may very well be the case that packages (or often more recent versions of packages!) that you need to install for your project are not available on the defaults channel. In this case you should search for alternate channels that may provide the conda package you’re looking for, To do this you should first navigate to

https://anaconda.org

and use the search bar at the top of the page. If we search for salmon we will se it is available via a channel called bioconda.

bioconda

Bioconda is a channel, maintained by the Bioconda project, specialising in bioinformatics software. Bioconda contains 1000’s of bioinformatics packages ready to use with conda install.

R and Bioconductor packages Most R packages on CRAN should be submitted at Conda-Forge. However, if the CRAN package has a Bioconductor, a repository for bioinformatics R packages, dependency, it belongs in Bioconda. If the CRAN package does not have a Bioconductor package dependency, it belongs in Conda-Forge.

conda-forge

In addition to the default channels that are managed by Anaconda Inc., there is another channel called Conda-Forge that also has a special status. The Conda-Forge project “is a community led collection of recipes, build infrastructure and distributions for the conda package manager.”

There are a number of reasons that you may wish to use the conda-forge channel instead of the defaults channel maintained by Anaconda:

  1. Packages on conda-forge may be more up-to-date than those on the defaults channel.
  2. There are packages on the conda-forge channel that aren’t available from defaults.
  3. You may wish to use a dependency such as openblas (from conda-forge) instead of mkl(from defaults).

How do I search for a package from a specific channel?

If you know the channel your package is likely to be located on, you can can use the conda search command with the --channel option and the name of the channel. E.g.

$ conda search --channel bioconda salmon
Loading channels: done
# Name                       Version           Build  Channel
salmon                         0.5.1               0  bioconda
salmon                         0.6.0               0  bioconda
...[truncated]...
salmon                         1.3.0      hf69c8f4_0  bioconda
salmon                         1.4.0      h84f40af_1  bioconda
salmon                         1.4.0      hf69c8f4_0  bioconda
salmon                         1.5.0      h84f40af_0  bioconda

How do I install a package from a specific channel?

If you know the channel your package is available from you can install a package from a specific channel into the currently activate environment by passing the --channel or -c option to the conda install command as follows.

$ conda create --name basic-rnaseq-env
$ conda activate basic-rnaseq-env
$ conda install --channel bioconda salmon

You can also install a package from a specific channel into a named environment (using --name or -n) or into an environment installed at a particular prefix (using --prefix or -p). For example, the following command installs the salmon package from the bioconda channel into the environment called basic-rnaseq-env which we created earlier.

$ conda install salmon --channel bioconda --name basic-rnaseq-env

This command would install salmon package from bioconda channel into an environment installed into the env/ sub-directory.

$ conda install salmon --channel bioconda --prefix ./env

You may have noticed that we didn’t manage to install the latest version fo salmon, why? The bioconda channel contains bioinformatics packages (salmon, STAR, samtools, DESeq2, etc.), however the channel conda-forge has most of the dependencies (numpy, scipy, zlib, CRAN packages, etc.) needed. Therefore we need to specify multiple channels to install the latest version.

Specify multiple channels

To specify multiple channels for installing packages by passing the --channel argument multiple times.

$ conda install salmon=1.5 --channel conda-forge --channel bioconda --name basic-rnaseq-env

This also works when install multiple packages.

$ conda install fastqc=0.11 multiqc=1.10 --channel conda-forge --channel bioconda --name basic-rnaseq-env

Specifying channels when installing packages

Create a new directory called rnaseq-project and then create an environment in a sub-directory called env/ with the the packages salmon=1.5, fastqc=0.11 and multiqc=1.11.

Solution

In order to create a new environment you use the conda create command as follows.

cd ~/
mkdir rnaseq-project
cd rnaseq-project/
conda create --prefix ./env --channel conda-forge \
 --channel bioconda \
 salmon=1.5 \
 fastqc=0.11 \
 multiqc=1.11

Hint: For the lazy typers: the --channel argument can also be shortened to -c, for more abbreviations, see also the Conda command reference .

Alternative syntax for installing packages from specific channels

There exists an alternative syntax for installing conda packages from specific channels that more explicitly links the channel being used to install a particular package.

$ conda install biconda::multiqc  --prefix ./env

Install the latest version of the workflow manager nextflow using this alternative syntax

Solution

One possibility would be to use the conda create command as follows.

$ conda install --prefix ./env --channel conda-forge bioconda::nextflow

Channel priority

Different channels can have the same package, so conda must decide which channel to install the package from. Conda channels have a priority hierarchy.

By default, conda prefers packages from a higher priority channel over any version from a lower priority channel.

Conda collects all of the packages with the same name across all listed channels and processes them as follows:

  1. Sorts packages from highest to lowest channel priority.

  2. Sorts tied packages—packages with the same channel priority—from highest to lowest version number. For example, if channelA contains NumPy 1.12.0 and 1.13.1, NumPy 1.13.1 will be sorted higher.

  3. Sorts still-tied packages, packages with the same channel priority and same version, from highest to lowest build number. For example, if channelA contains both NumPy 1.12.0 build 1 and build 2, build 2 is sorted first. Any packages in channelB would be sorted below those in channelA.

  4. Installs the first package on the sorted list that satisfies the installation specifications.

Note: Channel priority listed on the command decreases from left to right. So if you were to install base R r-base using the command below.

$ conda create --name rproject-env --channel defaults --channel conda-forge r-base

The first channel defaults would have a higher priority than the second conda-forge. This is true even, if the version number of the package is higher in the second channel.

Note: The bioconda team suggests that the conda-forge channel be a higher priority than the bioconda channel.

A Python package isn’t available on any Conda channel! What should I do?

If a Python package that you need isn’t available on any Conda channel, then you can use the default Python package manager Pip to install this package from PyPI. However, there are a few potential issues that you should be aware of when using Pip to install Python packages when using Conda.

First, Pip is sometimes installed by default on operating systems where it is used to manage any Python packages needed by your OS. You do not want to use this pip to install Python packages when using Conda environments.

(base) $ conda deactivate
$ which python
/usr/bin/python
$ which pip # sometimes installed as pip3
/usr/bin/pip

Second, Pip is also included in the Miniconda installer where it is used to install and manage OS specific Python packages required to setup your base Conda environment. You do not want to use this pip to install Python packages when using Conda environments.

$ conda activate
(base) $ which python
~/miniconda3/bin/python
$ which pip
~/miniconda3/bin/pip

Another reason to avoid installing packages into your base Conda environment

If your base Conda environment becomes cluttered with a mix of Pip and Conda installed packages it may no longer function. Creating separate conda environments allows you to delete and recreate environments readily so you dont have to worry about risking your core Conda functionality when mixing packages installed with Conda and Pip.

If you find yourself needing to install a Python package that is only available via Pip, then you should first install pip into your Conda environment and then use that pip to install the desired package. Using the pip installed in your Conda environment to install Python packages not available via Conda channels will help you avoid difficult to debug issues that frequently arise when using Python packages installed via a pip that was not installed inside you Conda environment.

Conda (+Pip): Conda wherever possible; Pip only when necessary

When using Conda to manage environments for your Python project it is a good idea to install packages available via both Conda and Pip using Conda; however there will always be cases where a package is only available via Pip in which case you will need to use Pip. Many of the common pitfalls of using Conda and Pip together can be avoided by adopting the following practices.

  • Always explicitly install pip in every Python-based Conda environment.
  • Always be sure your desired environment is active before installing anything using pip.
  • Prefer python -m pip install over pip install; never use pip with the --user argument.

Installing packages into Conda environments using pip

pandas pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation too

Activate the conda environment ` python3-env and use pip to install pandas`.

Solution

The following commands will activate the ` python3-env and install the python data analysis library pandas`.

$ conda activate python3-env  
( python3-env) $ conda install pip
( python3-env) $ python -m pip install pandas

For more details on using pip see the official documentation.

Key Points

  • A package is a tarball containing system-level libraries, Python or other modules, executable programs and other components, and associated metadata.

  • A Conda channel is a URL to a directory containing a Conda package(s).

  • You can specific a conda channel using the option --channel or add it to your .condarc

  • If a python package isn’t available on a conda channel you can install it into your environment using the python package installer pip.


Sharing Environments

Overview

Teaching: 30 min
Exercises: 15 min
Questions
  • Why should I share my Conda environment with others?

  • How do I share my Conda environment with others?

  • How do I create an environment file that can be read by Windows, Mac OS, or Linux.

  • How do I specifying the package version in a Conda environment file.

Objectives
  • Understand why you would create an Conda environment file.

  • Create a Conda environment file in a text editor, specifying the channel, packages and their version.

  • Use the conda env subcommand to export a given environment to a environment file.

Reproducible research

Conda environments are useful when making bioinformatics projects reproducible. Full reproducibility requires the ability to recreate the system that was originally used to generate the results. This can, to a large extent, be accomplished by using a Conda environment file to make an environment with specific versions of the packages that are needed in the project. This environment file can then be shared with others users to reproduce your analysis environment containing software with the same version number.

Creating an environment file

Conda uses YAML (“YAML Ain’t Markup Language”) for writing its environment files. YAML is a human-readable language that is commonly used for configuration files and that that uses Python-style indentation to indicate nesting.

Creating your project’s Conda environment from a single environment file is a Conda “best practice”. Not only do you have a file to share with collaborators but you also have a file that can be placed under version control which further enhancing the reproducibility of your research project and workflow.

Default environment.yml file

Note that by convention Conda environment files are called environment.yml. As such if you use the conda env create sub-command without passing the --file option, then conda will expect to find a file called environment.yml in the current working directory and will throw an error if a file with that name can not be found.

Let’s take a look at an example environment.yml file to give you an idea of how to write your own environment files.

name: rnaseq-env
channels:
  - conda-forge
  - bioconda
  - defaults
dependencies:
  - salmon
  - fastqc
  - multiqc

The first line specifies a default name rnaseq-env for the environment, however this can be overidden on the command line. The second line specifies a list of channels, listed in priority order, that packages may need to be installed from. Finally the dependencies lists the most current and mutually compatible versions of the listed packages (including all required dependencies) to download.

The newly created environment would be installed inside the conda environment directory e.g. ~/miniconda3/envs/ directory, unless we specified a different path using conda create command line option --prefix or -p.

Since explicit versions numbers for all packages should be preferred a better environment file would be the following.

name: rnaseq-env
channels:
  - conda-forge
  - bioconda
dependencies:
  - salmon=1.5
  - fastqc=0.11
  - multiqc=1.11

Note that we are only specifying the major and minor version numbers and not the patch or build numbers. Defining the version number by fixing only the major and minor version numbers while allowing the patch version number to vary allows us to use our environment file to update our environment to get any bug fixes whilst still maintaining significant consistency of our Conda environment across updates.

Always version control your environment.yml files!

While you should never version control the contents of your env/ environment sub-directory, you should always version control your environment.yml files. Version controlling your environment.yml files together with your project’s source code means that you always know which versions of which packages were used to generate your results at any particular point in time.

Let’s suppose that you want to use the environment.yml file defined above to create a Conda environment in a sub-directory a project directory. Here is how you would accomplish this task.

$ cd ~/
$ mkdir rnaseq-project-2
$ cd rnaseq-project-2

Once your project folder is created, create an environment.yml file using your favourite editor for instance nano.

name: rnaseq-env
channels:
  - conda-forge
  - bioconda
dependencies:
  - salmon=1.5
  - fastqc=0.11
  - multiqc=1.11

Finally create a new conda environment:

$ conda env create --prefix ./env --file environment.yml
$ conda activate ./env

Note that the above sequence of commands assumes that the environment.yml file is stored within your ` rnaseq-project-2` directory.

Automatically generate an environment.yml

We can automatically generate the contents of an environment file using the conda env export command. To export the packages installed into the previously created rnaseq-env you can run the following command:

$ conda env export --name basic-rnaseq-env

When you run this command, you will see the resulting YAML formatted representation of your Conda environment streamed to the terminal. Recall that we only listed three packages when we originally created basic-rnaseq-env yet from the output of the conda env export command we see that these packages result in an environment with a large number of dependencies!

To export this list into an environment.yml file, you can use --file option to directly save the resulting YAML environment into a file.

$ conda env export --name basic-rnaseq-env --file environment.yml

Make sure you do not have any other environment.yml file from before in the same directory when running the above command.

This exported environment file will however not consistently produce environments that are reproducible across Mac OS, Windows, and Linux. The reason is, that it may include operating system specific low-level packages which cannot be used by other operating systems.

If you need an environment file that can produce environments that are reproducible across Mac OS, Windows, and Linux, then you are better off just including those packages into the environment file that your have specifically installed using the --from-history option.

$ conda env export --name basic-rnaseq-env --from-history --file environment.yml

In short: to make sure others can reproduce your environment independent of the operating system they use, make sure to add the --from-history argument to the conda env export command.

Pip and conda env export --from-history

Python packages installed via pip are not exported using the conda env export --from-history argument. You can add them to the environment YAML file using the keyword pip: followed by a list of python packages, For example;

name: rnaseq-env
channels:
 - conda-forge
 - bioconda
dependencies:
 - salmon=1.5
 - fastqc=0.11
 - multiqc=1.11
 pip:
 - pandas  

Create a new environment from a YAML file.

Create a new project directory rnaseq-project-3 and then create a new environment.yml file inside your project directory with the following contents.

name: rnaseq-project3-env
channels:
  - conda-forge
  - bioconda
dependencies:
  - salmon=1.5
  - fastqc=0.11
  - multiqc=1.11

Now use this file to create a new Conda environment. Where is this new environment created? Using the same environment.yml file create a Conda environment as a sub-directory called env/ inside a newly created project directory. Compare the contents of the two environments.

Solution

To create a new environment from a YAML file use the conda env create sub-command as follows.

$ cd ~/
$ mkdir rnaseq-project-3
$ cd rnaseq-project-3
$ nano environment.yml
$ conda env create --file environment.yml

The above sequence of commands will create a new Conda environment inside the envs_dirs directory. In order to create the Conda environment inside a sub-directory of the project directory you need to pass the --prefix to the conda env create command as follows.

$ conda env create --file environment.yml --prefix ./env

You can now run the conda env list command and see that these two environments have been created in different locations but contain the same packages.

Updating an environment

You are unlikely to know ahead of time which packages (and version numbers!) you will need to use for your research project. For example it may be the case that

If any of these occurs during the course of your research project, all you need to do is update the contents of your environment.yml file accordingly and then run the following command.

$ cd ~/
$ cd rnaseq-project-2
$ conda env update --prefix ./env --file environment.yml  --prune

Note that the --prune option tells conda to remove any installed packages not defined in environment.yml

Rebuilding a Conda environment from scratch

When working with environment.yml files it is often just as easy to rebuild the Conda environment from scratch whenever you need to add or remove dependencies. To rebuild a Conda environment from scratch you can pass the --force option to the conda env create command which will remove any existing environment directory before rebuilding it using the provided environment file.

$ conda env create --prefix ./env --file environment.yml --force

Update environment from environment.yml

Update the environment file from the previous exercise, rnaseq-project-3, by adding the package kallisto=0.46 and removing the salmon package. Then rebuild the environment.

Solution

The environment.yml file should now look as follows.

name: rnaseq-env
channels:
  - conda-forge
  - bioconda
dependencies:
  - fastqc=0.11
  - multiqc=1.11
  - kallisto=0.46

You could use the following command, that will rebuild the environment from scratch with the new dependencies:

$ cd ~/rnaseq-project-3
$ conda env create --prefix ./env --file environment.yml --force

Or, if you just want to update the environment in-place with the new kallisto dependencies, you can use:

$ conda env update --prefix ./env --file environment.yml  --prune

Restoring an environment

Conda keeps a history of all the changes made to your environment, so you can easily “roll back” to a previous version. To list the history of each change to the current environment:

$ conda activate basic-rnaseq-env
$ conda list --revisions

To restore environment to a previous revision:

$ conda install --revision=REVNUM or conda install --rev REVNUM.

For example,

$ conda install --revision=1

List revisions.

Activate the environment inside the rnaseq-project-3 and list the revisions

Solution

To create a new environment from a YAML file use the conda env create sub-command as follows.

$ cd ~/
$ cd rnaseq-project-3
$ conda activate ./env
$ conda list --revisions

Key Points

  • Sharing Conda environments with other researchers facilitates the reproducibility of your research.

  • Conda environment files ,environment.yml, describes your project’s software environment.


Configuring Conda

Overview

Teaching: 20 min
Exercises: 5 min
Questions
  • How can I configure conda ?

  • How can I see conda’s configuration values?

  • How can I modify conda’s configuration settings?

Objectives
  • Use the conda config --show to display all configuration values.

  • Modify the .condarc file using the conda config sub command.

  • locate and view the contents of the .condarc file.

Configuration

Conda has a number of configuration setting which control how it performs. To display and control these setting we can use the conda config subcommand.

To display all configuration settings run the config --show subcommand :

$ conda config --show

As you can see conda supports a large number of configuration options. To show a single setting add the setting name after the conda config --show command. For example, to show the list of channels conda searches run:

$ conda config --show channels
channels:
  - defaults

By default conda only searches the defaults channel, this is why we had to include conda-forge and bioconda channels via the command line option --channel in the previous episode.

To get more information about an individual conda setting and its’ possible values run conda config --describe <option>. For Example;

conda config --describe channels
# # channels (sequence: primitive)
# #   aliases: channel
# #   env var string delimiter: ','
# #   The list of conda channels to include for relevant operations.
# #
# channels:
#   - defaults

.condarc

A user’s conda setting are store in the runtime configuration configuration file, .condarc. This file allows users to configure various aspects of conda including:

Like the environment file the .condarc configuration file follows a simple YAML syntax

The .condarc file is not included by default, but it is automatically created in your home directory the first time you run the conda config command.

Creating or modify .condarc

To create or modify a .condarc file, enter the conda config command and use the modifier options --add, --set, --append , --prepend or --remove followed by the configuration key and a value .

conda config <modifier> <KEY> <VALUE>

Adding a configuration value

To add conda-forge to the list of channels we can use the --add, --append or --prepend modifier option:

For example, if we want to add a channel to our list of channels in our configuration file rather than specific it on the command line every time we can can use the conda config --add option modifier.

$ conda config --add channels conda-forge

This would add the conda-forge channel to the top of the channel list.

$ conda config --show channels

We can use the conda config modifier --append to add conda-forge to the end of the channel list, giving it the lowest priority.

$ conda config --append channels conda-forge
Warning: 'conda-forge' already in 'channels' list, moving to the bottom
$ conda config --show channels

To move a channel to the highest priority use the conda config --prepend modifier.

$ conda config --prepend channels conda-forge
Warning: 'conda-forge' already in 'channels' list, moving to the top
$ conda config --show channels

Note: It is generally best to have conda-forge as the highest priority channel as this will usually have the most up-to-date packages.

Adding the channels bioconda and conda-forge to .condarc.

Add the bioconda and conda-forge channels to your .condarc file. Give conda-forge the highest priority.

Solution

To add the bioconda and conda-forge channel to your .condarc file use the command.

$ conda config --add channels bioconda
$ conda config --add channels conda-forge

The above sequence of commands will add the channels to your .condarc . Use the command below to show the channel priority order.

$ conda config --get channels
--add channels 'defaults'   # lowest priority
--add channels 'bioconda'
--add channels 'conda-forge'   # highest priority

Setting configuration settings

If our configuration setting has a single boolean or string value we can use conda config --set to set it.

For example, In a previous episode we set the command line prompt setting for conda using env_prompt.

$ conda config --describe env_prompt

The env_prompt setting takes a value of either '{prefix}', '{name}', and '{default_env}'.

# # env_prompt (str)
# #   Template for prompt modification based on the active environment.
# #   Currently supported template variables are '{prefix}', '{name}', and
# #   '{default_env}'. '{prefix}' is the absolute path to the active
# #   environment. '{name}' is the basename of the active environment
# #   prefix. '{default_env}' holds the value of '{name}' if the active
# #   environment is a conda named environment ('-n' flag), or otherwise
# #   holds the value of '{prefix}'. Templating uses python's str.format()
# #   method.
# #
# env_prompt: '({default_env}) '

To set the env_prompt to the default value '({default_env})' we can run:

$ conda config --set env_prompt '({default_env})'

To change it back to the just the environment name, we can run.

$ conda config --set env_prompt '({name})'

Note:: You need to deactivate then reactivate the environment for the changes in env prompt to take effect.

Set conda channel_priority

Use the conda config --describe to investigate the setting channel_priority. Set the channel_priority so that packages in lower priority channels are not considered if a package with the same name appears in a higher priority channel. Why would you want to do change this setting?

Solution

$ conda config --describe channel_priority
$ conda config --set channel_priority strict
$ conda config --show channel_priority

Using Strict channel priority can dramatically speed up conda operations and also reduce package incompatibility problems. This will be the default as of conda 5.0.

Editing the .condarc file manually

You can also use a text editor such as nano to directly edit the .condarc.

To show the location and contents of your .condarc file you can use the conda config --show-sources command.

$ conda config --show-sources

Note: If the .condarc file is in the root environment, it will override any in the home directory

Locate and view the .condarc

Locate your .condarc file. Using your favourite text editor look at the .condarc.

Solution

$ conda config --show-sources
$ nano ~/.condarc

Getting help

As with all conda commands you can use the --help option to get help.

For example, for a complete list of conda config commands run

$ conda config --help

Or see the command reference.

Key Points

  • The .condarc is an optional configuration file that stores custom conda setting.

  • You can use the conda config subcommand to add, set or remove configuration setting in the .condarc file.

  • You can also edit the contents of the .condarc file directly using a text editor.