|
| 1 | +# Reproducible environments and dependencies |
| 2 | + |
| 3 | +:::{objectives} |
| 4 | +- There are not many codes that have no dependencies. |
| 5 | + How should we **deal with dependencies**? |
| 6 | +- We will focus on installing and managing dependencies in Python when using packages from PyPI and Conda. |
| 7 | +- We will not discuss how to distribute your code as a package. |
| 8 | +::: |
| 9 | + |
| 10 | +[This episode borrows from <https://coderefinery.github.io/reproducible-python/reusable/> |
| 11 | +and <https://aaltoscicomp.github.io/python-for-scicomp/dependencies/>] |
| 12 | + |
| 13 | +Essential XKCD comics: |
| 14 | +- [xkcd - dependency](https://xkcd.com/2347/) |
| 15 | +- [xkcd - superfund](https://xkcd.com/1987/) |
| 16 | + |
| 17 | + |
| 18 | +## How to avoid: "It works on my machine 🤷" |
| 19 | + |
| 20 | +Use a **standard way** to list dependencies in your project: |
| 21 | +- Python: `requirements.txt` or `environment.yml` |
| 22 | +- R: `DESCRIPTION` or `renv.lock` |
| 23 | +- Rust: `Cargo.lock` |
| 24 | +- Julia: `Project.toml` |
| 25 | +- C/C++/Fortran: `CMakeLists.txt` or `Makefile` or `spack.yaml` or the module |
| 26 | + system on clusters or containers |
| 27 | +- Other languages: ... |
| 28 | + |
| 29 | + |
| 30 | +## Two ecosystems: PyPI (The Python Package Index) and Conda |
| 31 | + |
| 32 | +:::{admonition} PyPI |
| 33 | +- **Installation tool:** `pip` or `uv` or similar |
| 34 | +- Traditionally used for Python-only packages or |
| 35 | + for Python interfaces to external libraries. There are also packages |
| 36 | + that have bundled external libraries (such as numpy). |
| 37 | +- **Pros:** |
| 38 | + - Easy to use |
| 39 | + - Package creation is easy |
| 40 | +- **Cons:** |
| 41 | + - Installing packages that need external libraries can be complicated |
| 42 | +::: |
| 43 | + |
| 44 | +:::{admonition} Conda |
| 45 | +- **Installation tool:** `conda` or `mamba` or similar |
| 46 | +- Aims to be a more general package distribution tool |
| 47 | + and it tries to provide not only the Python packages, but also libraries |
| 48 | + and tools needed by the Python packages. |
| 49 | +- **Pros:** |
| 50 | + - Quite easy to use |
| 51 | + - Easier to manage packages that need external libraries |
| 52 | + - Not only for Python |
| 53 | +- **Cons:** |
| 54 | + - Package creation is harder |
| 55 | +::: |
| 56 | + |
| 57 | + |
| 58 | +## Conda ecosystem explained |
| 59 | + |
| 60 | +- [Anaconda](https://www.anaconda.com) is a distribution of conda packages |
| 61 | + made by Anaconda Inc. When using Anaconda remember to check that your |
| 62 | + situation abides with their licensing terms (see below). |
| 63 | + |
| 64 | +- Anaconda has recently changed its **licensing terms**, which affects its |
| 65 | + use in a professional setting. This caused uproar among academia |
| 66 | + and Anaconda modified their position in |
| 67 | + [this article](https://www.anaconda.com/blog/update-on-anacondas-terms-of-service-for-academia-and-research). |
| 68 | + |
| 69 | + Main points of the article are: |
| 70 | + - conda (installation tool) and community channels (e.g. conda-forge) |
| 71 | + are free to use. |
| 72 | + - Anaconda repository and **Anaconda's channels in the community repository** |
| 73 | + are free for universities and companies with fewer than 200 employees. |
| 74 | + Non-university research institutions and national laboratories need |
| 75 | + licenses. |
| 76 | + - Miniconda is free, when it does not download Anaconda's packages. |
| 77 | + - Miniforge is not related to Anaconda, so it is free. |
| 78 | + |
| 79 | + For ease of use on sharing environment files, we recommend using |
| 80 | + Miniforge to create the environments and using conda-forge as the main |
| 81 | + channel that provides software. |
| 82 | + |
| 83 | +- Major repositories/channels: |
| 84 | + - [Anaconda Repository](https://repo.anaconda.com) |
| 85 | + houses Anaconda's own proprietary software channels. |
| 86 | + - Anaconda's proprietary channels: `main`, `r`, `msys2` and `anaconda`. |
| 87 | + These are sometimes called `defaults`. |
| 88 | + - [conda-forge](https://conda-forge.org) is the largest open source |
| 89 | + community channel. It has over 28k packages that include open-source |
| 90 | + versions of packages in Anaconda's channels. |
| 91 | + |
| 92 | + |
| 93 | +## Tools and distributions for dependency management in Python |
| 94 | + |
| 95 | +- [Poetry](https://python-poetry.org): Dependency management and packaging. |
| 96 | +- [Pipenv](https://pipenv.pypa.io): Dependency management, alternative to Poetry. |
| 97 | +- [pyenv](https://github.com/pyenv/pyenv): If you need different Python versions for different projects. |
| 98 | +- [virtualenv](https://docs.python.org/3/library/venv.html): Tool to create isolated Python environments for PyPI packages. |
| 99 | +- [micropipenv](https://github.com/thoth-station/micropipenv): Lightweight tool to "rule them all". |
| 100 | +- [Conda](https://docs.conda.io): Package manager for Python and other languages maintained by Anaconda Inc. |
| 101 | +- [Miniconda](https://docs.anaconda.com/miniconda/): A "miniature" version of conda, maintained by Anaconda Inc. By default uses |
| 102 | + Anaconda's channels. Check licensing terms when using these packages. |
| 103 | +- [Mamba](https://mamba.readthedocs.io): A drop in replacement for conda. |
| 104 | + It used be much faster than conda due to better |
| 105 | + dependency solver but nowadays conda |
| 106 | + [also uses the same solver](https://conda.org/blog/2023-11-06-conda-23-10-0-release/). |
| 107 | + It still has some UI improvements. |
| 108 | +- [Micromamba](https://mamba.readthedocs.io/en/latest/user_guide/micromamba.html): Tiny version of the Mamba package manager. |
| 109 | +- [Miniforge](https://github.com/conda-forge/miniforge): Open-source Miniconda alternative with |
| 110 | + conda-forge as the default channel and optionally mamba as the default installer. |
| 111 | +- [Pixi](https://pixi.sh): Modern, super fast tool which can manage conda environments. |
| 112 | +- [uv](https://docs.astral.sh/uv/): Modern, super fast replacement for pip, |
| 113 | + poetry, pyenv, and virtualenv. You can also switch between Python versions. |
| 114 | + |
| 115 | + |
| 116 | +## Best practice: Install dependencies into isolated environments |
| 117 | + |
| 118 | +- For each project, create a **separate environment**. |
| 119 | +- Don't install dependencies globally for all projects. Sooner or later, different projects will have conflicting dependencies. |
| 120 | +- Install them **from a file** which documents them at the same time |
| 121 | + Install dependencies by first recording them in `requirements.txt` or |
| 122 | + `environment.yml` and install using these files, then you have a trace |
| 123 | + (we will practice this later below). |
| 124 | + |
| 125 | +:::{keypoints} |
| 126 | +If somebody asks you what dependencies you have in your project, you should be |
| 127 | +able to answer this question **with a file**. |
| 128 | + |
| 129 | +In Python, the two most common ways to do this are: |
| 130 | +- **requirements.txt** (for pip and virtual environments) |
| 131 | +- **environment.yml** (for conda and similar) |
| 132 | + |
| 133 | +You can export ("freeze") the dependencies from your current environment into these files: |
| 134 | +```bash |
| 135 | +# inside a conda environment |
| 136 | +$ conda env export --from-history > environment.yml |
| 137 | + |
| 138 | +# inside a virtual environment |
| 139 | +$ pip freeze > requirements.txt |
| 140 | +``` |
| 141 | +::: |
| 142 | + |
| 143 | + |
| 144 | +## How to communicate the dependencies as part of a report/thesis/publication |
| 145 | + |
| 146 | +Each notebook or script or project which depends on libraries should come with |
| 147 | +either a `requirements.txt` or a `environment.yml`, unless you are creating |
| 148 | +and distributing this project as Python package. |
| 149 | + |
| 150 | +- Attach a `requirements.txt` or a `environment.yml` to your thesis. |
| 151 | +- Even better: Put `requirements.txt` or a `environment.yml` in your Git repository along your code. |
| 152 | +- Even better: Also [binderize](https://mybinder.org/) your analysis pipeline. |
| 153 | + |
| 154 | + |
| 155 | +## Containers |
| 156 | + |
| 157 | +- A container is like an **operating system inside a file**. |
| 158 | +- "Building a container": Container definition file (recipe) -> Container image |
| 159 | +- This can be used with [Apptainer](https://apptainer.org/)/ |
| 160 | + [SingularityCE](https://sylabs.io/singularity/). |
| 161 | + |
| 162 | +Containers offer the following advantages: |
| 163 | +- **Reproducibility**: The same software environment can be recreated on |
| 164 | + different computers. They force you to know and **document all your dependencies**. |
| 165 | +- **Portability**: The same software environment can be run on different computers. |
| 166 | +- **Isolation**: The software environment is isolated from the host system. |
| 167 | +- "**Time travel**": |
| 168 | + - You can run old/unmaintained software on new systems. |
| 169 | + - Code that needs new dependencies which are not available on old systems can |
| 170 | + still be run on old systems. |
| 171 | + |
| 172 | + |
| 173 | +## How to install dependencies into environments |
| 174 | + |
| 175 | +Now we understand a bit better why and how we installed dependencies |
| 176 | +for this course in the {doc}`installation`. |
| 177 | + |
| 178 | +We have used **Miniforge** and the long command we have used was: |
| 179 | +```console |
| 180 | +$ mamba env create -n course -f https://raw.githubusercontent.com/coderefinery/python-progression/main/software/environment.yml |
| 181 | +``` |
| 182 | + |
| 183 | +This command did two things: |
| 184 | +- Create a new environment with name "course" (specified by `-n`). |
| 185 | +- Installed all dependencies listed in the `environment.yml` file (specified by |
| 186 | + `-f`), which we fetched directly from the web. |
| 187 | + [Here](https://github.com/coderefinery/python-progression/blob/main/software/environment.yml) |
| 188 | + you can browse it. |
| 189 | + |
| 190 | +For your own projects: |
| 191 | +1. Start by writing an `environment.yml` of `requirements.txt` file. They look like this: |
| 192 | +:::::{tabs} |
| 193 | + ::::{tab} environment.yml |
| 194 | + :::{literalinclude} ../software/environment.yml |
| 195 | + :language: yaml |
| 196 | + ::: |
| 197 | + :::: |
| 198 | + |
| 199 | + ::::{tab} requirements.txt |
| 200 | + :::{literalinclude} ../software/requirements.txt |
| 201 | + ::: |
| 202 | + :::: |
| 203 | +::::: |
| 204 | + |
| 205 | +2. Then set up an isolated environment and install the dependencies from the file into it: |
| 206 | +:::::{tabs} |
| 207 | + ::::{group-tab} Miniforge |
| 208 | + - Create a new environment with name "myenv" from `environment.yml`: |
| 209 | + ```console |
| 210 | + $ conda env create -n myenv -f environment.yml |
| 211 | + ``` |
| 212 | + Or equivalently: |
| 213 | + ```console |
| 214 | + $ mamba env create -n myenv -f environment.yml |
| 215 | + ``` |
| 216 | + - Activate the environment: |
| 217 | + ```console |
| 218 | + $ conda activate myenv |
| 219 | + ``` |
| 220 | + - Run your code inside the activated virtual environment. |
| 221 | + ```console |
| 222 | + $ python example.py |
| 223 | + ``` |
| 224 | + :::: |
| 225 | + |
| 226 | + ::::{group-tab} Pixi |
| 227 | + - Create `pixi.toml` from `environment.yml`: |
| 228 | + ```console |
| 229 | + $ pixi init --import environment.yml |
| 230 | + ``` |
| 231 | + - Run your code inside the environment: |
| 232 | + ```console |
| 233 | + $ pixi run python example.py |
| 234 | + ``` |
| 235 | + :::: |
| 236 | + |
| 237 | + ::::{group-tab} Virtual environment |
| 238 | + - Create a virtual environment by running (the second argument is the name |
| 239 | + of the environment and you can change it): |
| 240 | + ```console |
| 241 | + $ python -m venv venv |
| 242 | + ``` |
| 243 | + - Activate the virtual environment (how precisely depends on your operating |
| 244 | + system and shell). |
| 245 | + - Install the dependencies: |
| 246 | + ```console |
| 247 | + $ python -m pip install -r requirements.txt |
| 248 | + ``` |
| 249 | + - Run your code inside the activated virtual environment. |
| 250 | + ```console |
| 251 | + $ python example.py |
| 252 | + ``` |
| 253 | + :::: |
| 254 | + |
| 255 | + ::::{group-tab} uv |
| 256 | + - Create a virtual environment by running (the second argument is the name |
| 257 | + of the environment and you can change it): |
| 258 | + ```console |
| 259 | + $ uv venv venv |
| 260 | + ``` |
| 261 | + - Activate the virtual environment (how precisely depends on your operating |
| 262 | + system and shell). |
| 263 | + - Install the dependencies: |
| 264 | + ```console |
| 265 | + $ uv pip sync requirements.txt |
| 266 | + ``` |
| 267 | + - Run your code inside the virtual environment. |
| 268 | + ```console |
| 269 | + $ uv run python example.py |
| 270 | + ``` |
| 271 | + :::: |
| 272 | +::::: |
| 273 | + |
| 274 | + |
| 275 | +## Updating environments |
| 276 | + |
| 277 | +What if you forgot a dependency? Or during the development of your project |
| 278 | +you realize that you need a new dependency? Or you don't need some dependency anymore? |
| 279 | + |
| 280 | +1. Modify the `environment.yml` or `requirements.txt` file. |
| 281 | +2. Either remove your environment and create a new one, or update the existing one: |
| 282 | + |
| 283 | +:::::{tabs} |
| 284 | + ::::{group-tab} Miniforge |
| 285 | + - Update the environment by running: |
| 286 | + ```console |
| 287 | + $ conda env update --file environment.yml |
| 288 | + ``` |
| 289 | + - Or equivalently: |
| 290 | + ```console |
| 291 | + $ mamba env update --file environment.yml |
| 292 | + ``` |
| 293 | + :::: |
| 294 | + |
| 295 | + ::::{group-tab} Pixi |
| 296 | + - Remove `pixi.toml`. |
| 297 | + - Then update it from the updated `environment.yml` by running: |
| 298 | + ```console |
| 299 | + $ pixi init --import environment.yml |
| 300 | + ``` |
| 301 | + :::: |
| 302 | + |
| 303 | + ::::{group-tab} Virtual environment |
| 304 | + - Activate the virtual environment. |
| 305 | + - Update the environment by running: |
| 306 | + ```console |
| 307 | + $ pip install -r requirements.txt |
| 308 | + ``` |
| 309 | + :::: |
| 310 | + |
| 311 | + ::::{group-tab} uv |
| 312 | + - Activate the virtual environment. |
| 313 | + - Update the environment by running: |
| 314 | + ```console |
| 315 | + $ uv pip sync requirements.txt |
| 316 | + ``` |
| 317 | + :::: |
| 318 | +::::: |
| 319 | + |
| 320 | + |
| 321 | +## Pinning package versions |
| 322 | + |
| 323 | +Let us look at the |
| 324 | +[environment.yml](https://github.com/coderefinery/python-progression/blob/main/software/environment.yml) |
| 325 | +which we used to set up the environment for this progression course. |
| 326 | +Dependencies are listed without version numbers. Should we **pin the |
| 327 | +versions**? |
| 328 | + |
| 329 | +- Both `pip` and `conda` ecosystems and all the tools that we have |
| 330 | + mentioned support pinning versions. |
| 331 | + |
| 332 | +- It is possible to define a range of versions instead of precise versions. |
| 333 | + |
| 334 | +- While your project is still in progress, I often use latest versions and do not pin them. |
| 335 | + |
| 336 | +- When publishing the script or notebook, it is a good idea to pin the versions |
| 337 | + to ensure that the code can be run in the future. |
| 338 | + |
| 339 | +- Remember that at some point in time you will face a situation where |
| 340 | + newer versions of the dependencies are no longer compatible with your |
| 341 | + software. At this point you'll have to update your software to use the newer |
| 342 | + versions or to lock it into a place in time. |
| 343 | + |
| 344 | + |
| 345 | +## Managing dependencies on a supercomputer |
| 346 | + |
| 347 | +- Additional challenges: |
| 348 | + - Storage quotas: **Do not install dependencies in your home directory**. A conda environment can easily contain 100k files. |
| 349 | + - Network file systems struggle with many small files. Conda environments often contain many small files. |
| 350 | +- Possible solutions: |
| 351 | + - Try [Pixi](https://pixi.sh/) (modern take on managing Conda environments) and |
| 352 | + [uv](https://docs.astral.sh/uv/) (modern take on managing virtual |
| 353 | + environments). Blog post: [Using Pixi and uv on a supercomputer](https://research-software.uit.no/blog/2025-pixi-and-uv/) |
| 354 | + - Install your environment on the fly into a scratch directory on local disk (**not** the network file system). |
| 355 | + - Install your environment on the fly into a RAM disk/drive. |
| 356 | + - Containerize your environment into a container image. |
| 357 | + |
| 358 | +--- |
| 359 | + |
| 360 | +:::{keypoints} |
| 361 | +- Being able to communicate your dependencies is not only nice for others, but |
| 362 | + also for your future self or the next PhD student or post-doc. |
| 363 | +- If you ask somebody to help you with your code, they will ask you for the |
| 364 | + dependencies. |
| 365 | +::: |
0 commit comments