Self-installing Python Modules for BlueBEAR¶

Please refer to the following sections for information:

Python Virtual Environments
Conda Environments

Warning

Anaconda Licensing¶

Conda is a popular method for installing research software and is often recommended by developers, particularly for Python-based tools. The use of Conda is discussed in further detail in the section below but please be aware of the following:

The conda command that is provided via the Anaconda and Miniconda packages uses the Anaconda Public Repository Conda channel by default. This channel may not be free for use in academic research and may require a paid Anaconda licence for some types of research.
We recommend the use of Miniforge, which defaults to using the Conda Forge channel.

Further information on Anaconda licensing can be found here:
https://www.anaconda.com/pricing

Python Virtual Environments (a.k.a "venv")¶

Note

The term "module" in this context refers to the name of the extensions to Python's functionality that can be used by including e.g. import flake8 in your Python code.

These are the most commonly used methods for installing Python modules:

pip install flake8
python setup.py install

Where a Python module is available at the Python Package Index (PyPI) it can be installed by using pip, the Python installer command. Executing the default pip install command will not work on BlueBEAR as users don't have the file permissions to write into the directory where this process normally places the Python modules. It is possible to pass the --user option to the command so that it installs into your home directory but this is problematic because it won't distinguish between node types (microarchitectures) and your jobs may subsequently fail.

We therefore recommend that you use a node-specific Python virtual environment. This solution applies to both the pip installation method and also the python setup.py install method.

The process for creating and using a node-specific virtual environment is as follows:

Creating a virtual environment and installing a Python module¶

Load the BEAR Python module on which you want to base your virtual environment.
- Optional: load any additionally required modules, e.g. Matplotlib, SciPy-bundle etc. (See the tips section for further details.)
Change to the directory in which you want to create the virtual environment. (Alternatively you can specify the full path in the following step.)
Create a virtual environment, including the environment variable ${BB_CPU} in its name to identify the node-type:
```
python3 -m venv --system-site-packages my-virtual-env-${BB_CPU}
```

Activate the virtual environment:

source my-virtual-env-${BB_CPU}/bin/activate

Run your Python module installations as normal (N.B. don't include --user):
```
PIP_CACHE_DIR="/scratch/${USER}/pip" # (1)!
pip install flake8
```
1. pip caches can be large, which is potentially problematic due to the limited user quotas on BlueBEAR. Setting this environment variable will force pip to store its cache in a larger /scratch directory.

Using your node-specific virtual environment¶

First load the same BEAR Python module as you used to create the virtual environment in the previous step. This is important, else your Python commands will likely fail.

Activate the virtual environment:

source my-virtual-env-${BB_CPU}/bin/activate

Execute your Python code.

Example script¶

All of the above steps can be encapsulated in a script, which can be included as part of the batch script that you submit to BlueBEAR:

#!/bin/bash
set -e

module purge; module load bluebear
module load bear-apps/2021b
module load Python/3.9.6-GCCcore-11.2.0

export VENV_DIR="${HOME}/virtual-environments"
export VENV_PATH="${VENV_DIR}/my-virtual-env-${BB_CPU}"

# Create a master venv directory if necessary
mkdir -p ${VENV_DIR}

# Check if virtual environment exists and create it if not
if [[ ! -d ${VENV_PATH} ]]; then
    python3 -m venv --system-site-packages ${VENV_PATH}
fi

# Activate the virtual environment
source ${VENV_PATH}/bin/activate


# Store pip cache in /scratch directory, instead of the default home directory location
PIP_CACHE_DIR="/scratch/${USER}/pip"

# Perform any required pip installations. For reasons of consistency we would recommend
# that you define the version of the Python module – this will also ensure that if the
# module is already installed in the virtual environment it won't be modified.
pip install flake8==6.0.0

# Execute your Python scripts
python my-script.py

Removing user-wide Python modules¶

If you have installed Python modules using pip install but without using a virtual environment (as detailed above) then you may experience a variety of issues.
For example, if you performed a pip install against our Python/3.9.6 module then this will have installed content into the following directory: ${HOME}/.local/lib/python3.9/site-packages

Our recommendation is to remove all Python directories located in ~/.local/lib by executing the following command:

rm -r "${HOME}/.local/lib/python"*

User virtual envs and BEAR Portal's JupyterLab app¶

The process for using Python extensions installed in a virtual environment within a Python kernel running on the BEAR Portal JupyterLab app is summarised below.

Warning

BEAR Portal Interactive Apps cannot be constrained to a specific node-type so you will need to create multiple virtual environments (one for each node-type) by passing constraints in your sbatch script. See here for more information.
You will then need to pass the ${BB_CPU} environment variable in the following process, where required.

Process¶

Start a JupyterLab Interactive App session on BEAR Portal, being sure to match the kernel to the Python version against which you created the virtual environment.

Note

A mismatch between the virtual environment's Python version and the running kernel's Python version will likely result in errors.
Once connected to the JupyterLab server, load any additional modules that were also present when you created the venv.
Launch a notebook (or shutdown & restart the kernel for an already-running notebook).

Within your running notebook, copy the following code (modifying paths where necessary) into a cell and execute it to insert your virtual environment's site-packages path to the running system path.

import os
from pathlib import Path
import sys
node_type = os.getenv('BB_CPU')
venv_dir = f'/path/to/venv-{node_type}'  # edit this line to match the venv directory format
venv_site_pkgs = Path(venv_dir) / 'lib' / f'python{sys.version_info.major}.{sys.version_info.minor}' / 'site-packages'
if venv_site_pkgs.exists():
    sys.path.insert(0, str(venv_site_pkgs))
else:
    print(f"Path '{venv_site_pkgs}' not found. Check that it exists and/or that it exists for node-type '{node_type}'.")

Subsequent Python import statements will now search in your virtual environment's path before any others.

Tips¶

Note
Your virtual environment should add to, and not replace, the Python libraries available via the module loaded.
Install the minimum of what is required. For example, if the Python module that you're installing has a dependency on Matplotlib, load the relevant BEAR Matplotlib module first and then perform your virtual environment installations.
If you switch the Python modules being loaded then you must create a new virtual environment based on these new Python modules.
Further to the above tip, you may need to be aware of dependencies' version constraints. For example, if a Python module needs a newer version of Matplotlib than the one we provide, first check if BEAR Applications has the later version. If not, see whether you can install an earlier version of the module you require that will work with the BEAR Applications version of Matplotlib -- this would be our recommendation as some Python modules are complex to install. Finally, you can instead use the BEAR Python module instead of the BEAR Matplotlib module and then install everything yourself although, as mentioned, this may be difficult depending on the complexity of the modules' installation processes.
We strongly recommend using a module instead of the system Python version. Also, note that we do not recommend the use of Python 2 as it's no longer supported by the Python developers.
Python libraries on PyPI can either be binary packages known as 'wheels' which are self contained with compiled code, or source packages which rely on external dependencies such as compiled C/C++/Fortran libraries and which compile at installation. For the latter, you may find that installing through pip will fail and that you need to load additional modules from BEAR Applications before retrying the installation, or that you will need to compile the dependencies yourself.
Some package authors recommend Python package installation via Anaconda or Miniconda. We do not recommend the use of Python packages via this method on the BlueBEAR cluster and would encourage you to contact us if the package you want to make use of suggests this method of installation.

Conda Environments¶

Prior to using Conda environments, please check the information about Anaconda Licensing.

Note

Conda environments are challenging to manage on heterogeneous HPC systems and as such we advise that virtual environments are used instead, wherever possible.

Similarly to the process with Python virtual environments, you have the following options for installing a Conda environment:

Node types (microarchitectures)

Information on all of the available node types can be found here: Specific shared resources

Install the environment on the oldest available node type (microarchitecture), currently Cascade Lake. Compiled code will therefore run on any available node, at the slight expense of less optimisation on the newer nodes.
This is our recommendation as it simplifies the process of using Conda environments.
Install the environment multiple times, once for each node type (microarchitecture).

Conda environment creation¶

The script below can either be run directly in an interactive job or called as a batch job (N.B. the latter would need the addition of some #SBATCH headers).

The following example uses mamba, which is a newer and faster implementation of conda.

#!/bin/bash
set -e

module purge; module load bluebear

module load bear-apps/2022b
module load Miniforge3/24.1.2-0

eval "$(${EBROOTMINIFORGE3}/bin/conda shell.bash hook)" # (1)!
source "${EBROOTMINIFORGE3}/etc/profile.d/mamba.sh"

# Define the path to your environment (modify as appropriate)
# N.B. this path will be created by the subsequent commands if it doesn't already exist
CONDA_ENV_PATH="/rds/projects/_initial_/_projectname_/${USER}_conda_env" # (2)!

export CONDA_PKGS_DIRS="/scratch/${USER}/conda_pkgs" # (3)!

# Create the environment. Only required once.
mamba create --yes --prefix "${CONDA_ENV_PATH}" # (4)!
# Activate the environment
mamba activate "${CONDA_ENV_PATH}"
# Choose your version of Python
mamba install --yes python=3.10

# Continue to install any further items as required.
# For example:
mamba install --yes numpy

Running these eval and source statements removes the need to execute conda init or mamba init, both of which insert content into the user ~/.bashrc file that can in turn cause issues with other software on BlueBEAR.

If your ~/.bashrc file contains the Conda initialisation content (denoted by # >>> conda initialize >>>) then we would advise removing it.
Please see the conda init content section for more information.
Conda environments can be very large so we recommend creating them within BEAR project directories (i.e. /rds/projects) and not in home directories, where quota is limited.
By default, packages are cached in the user's home directory. BlueBEAR home directories have limited quota and as such we advise caching packages in /scratch/${USER}.

See Conda Cache for further details.
Passing the --yes option sets any confirmation values to 'yes' automatically, meaning that the user will not be asked to confirm anything manually.

Conda environment usage¶

Once the environment has been built, it can be used as follows:

#!/bin/bash
set -e

module purge; module load bluebear

module load bear-apps/2022b
module load Miniforge3/24.1.2-0

eval "$(${EBROOTMINIFORGE3}/bin/conda shell.bash hook)"
source "${EBROOTMINIFORGE3}/etc/profile.d/mamba.sh"

# Define the path to your environment (modify as appropriate)
CONDA_ENV_PATH="/rds/projects/_initial_/_projectname_/${USER}_conda_env"

# Activate the environment
mamba activate "${CONDA_ENV_PATH}"

# Run commands within the activate environment
python -c "print('hello world')"

Conda Cache¶

Despite creating the environment in a BEAR project directory (as covered above), Conda will by default cache its data in your home directory, which can rapidly use-up the limited 20 GB quota. We therefore recommend that you perform one of the following options:

Periodically clean the cache by executing the following command in a Terminal window:
```
mamba clean --all
```
Prior to executing any Conda installation commands, export the following environment variable to a path with more storage space:
```
export CONDA_PKGS_DIRS="/scratch/${USER}"
```
(See the Use Local Disk Space documentation for further information on temporary storage.)

`conda init` content¶

The documented process for setting-up Conda normally recommends that you first run conda init to inject an initialisation block into your environment's config file (e.g. ~/.bashrc). This can cause issues when running Conda in a HPC environment so we advise removing this content from your BlueBEAR ~/.bashrc file and instead referring to the process documented above, which performs the necessary environment preamble to make the conda and mamba commands function.

The presence of this block is usually apparent because your terminal prompt will begin with (base), e.g.
(base) [user@bear-pg-login06 ~]$
However, we also provide a script to identify first whether the conda init block is present and then offer to delete the relevant content. To use this script, please run the following command:

remove_conda_init

`remove_conda_init` demo¶

Please see the following demonstration of this command in action: