Improve install#
TODO-mpd: Move pip and conda dependencies to notebooks#
The smaller you can make your box, the more shareable your code will be. You’ve often wanted to
share notebooks on e.g. Wikipedia or SE and had to link to your own website instead; you once used
Google Colab to share something on Wikipedia. You may want to share notebooks with friends/family in
the future, and have your kids take notes and do homework in Colab. In general, if you pip install
within your notebook then you can get at least some of these dependencies out of your custom docker
image into a latched artifact. As long as you log what you are doing in the pip install process as
part of the notebook, this is no worse than a docker save with an attached log of your docker build.
Notebooks built in this way, if they only have a python dependency, will be directly uploadable and
runnable on Google Colab, Microsoft Azure Notebooks, or AWS notebooks. See:
You don’t have to move all your dependencies inline into your notebook; only pip and conda
dependencies. If you are opening the document in Jupyter to edit and improve it, then it will latch
your installed packages into the running kernel while you are improving the notebook. You should
move all of these kinds of dependencies into notebooks so they are shareable. It would actually be
quite easy to perform this move; you can tell what pip/conda packages you need to install in which
notebooks based on where you use an import
statement with the library in a notebook.
This applies in reverse as well. You were able to get the annotated transformer notebook working quickly because it had a similar list in a requirements.txt file. You really don’t have to list every package twice in the notebook if you want to put everything in a requirements.txt file, or even better, a Pipfile.
After this is done, you could try to install the R packages you have on top of the GPU jupyter docker image. If that works, you suddently have a much more valauble (if monstrous) image you can work from. It’s likely you installed the R packages second as a hack, anyways (just to get it working). You should almost always be installed system (i.e. C, C++) packages before you install interpreted language packages because the latter are almost always going to depend on the former (not vice versa, unless perhaps a C/C++ package uses an interpreted language in installation scripts).
This also simplifies the reproducibility problem in general. You may actually need a different version of a package to achieve reproducibility in one notebook than another.
Generally speaking, this approach is much more similar to the approach you take in your notes of importing a reference (a link) only in the paragraph that you are using it. You can import at either the paragraph (cell) level in Jupyter notebooks, or at the top. In many cases, it may be better to do so at the paragraph level first and move to the top only after you use the dependency more than once (to avoid a duplicate import in the file).