Memoize artifact#

Generalizes:

Test#

An “artifact” is stored in a persistent storage medium to avoid recomputation later.

We use the term “artifact” here partially because it’s so ill-defined and therefore easy to redefine (see the terrible article Artifact (software development)). We also use the term because it’s so common, such as in the “Artifactory” tool. In GitLab, you can think of all the items under “Packages and Registries” (such as the container registry) as places to store artifacts.

Some examples of artifacts are Docker images, libraries, packages, .pth files, .html files (manually or non-manually constructed), C++ files, and .py files, and test results (e.g. pass/fail boolean). In the language of Bazel, the equivalent concept is a Target (see Core concepts - Actual and declared dependencies), which is slightly different from how they define an Artifact. In the definition used here, you could take any English text (i.e. notes) as artifacts that e.g. may depend on concepts from other English text.

We prefer the word Memoization to Caching because the former is a special case of the latter, and we specifically intend to refer to the latter, more special case. That is, we won’t consider data locality or spatial locality here, only whether to build a data artifact at all.

Stability#

We use the term Stability although it is somewhat ambiguous. Most people call an interface “stable” if it doesn’t change frequently e.g. no one switches the order of arguments. Most people would call Ubuntu 20.04 more “stable” than Ubuntu 20.10 in 2021, even if they haven’t used 20.10 and don’t know whether it works perfectly fine for everything they want to do. Many people would call an application more “stable” if it pins more dependencies (artifacts) down.

Sometimes, we use the word “stability” to mean flexible, accepting, or reusable. Software often has to be more complicated to be more flexible and accepting, and it has to be well thought out to continue to be reusable into the far future (e.g. it depends on well-established mathematical concepts). See also Robustness principle, discussed in more detail in Be liberal in what you accept… or not? - SE. In the test-based language of Update dependencies, we want to use a library that doesn’t change it’s API so that when we rebase our code onto a new version most tests are already working with a greater range of its versions.

Other times we use “stability” to mean unchanging; we pin packages so that they don’t change under us. Pinning more dependencies down may not be the best way to make our application more stable, if we’re pinning to unreliable code. If we really must take a dependency in this case, it may be better to take full ownership of the source. We may also actually want to write our code to work with several different versions of a dependency so that more libraries can depend on our library without making the job of dependency resolvers nearly impossible (see Update dependencies.

We want to depend on the former (reusable) kind of artifact; these are the timeless kinds of artifacts. For the sake of reproducibility or company specialization we may need to rely on the latter (pinned) kind of artifact.

As discussed in Push and pull: when and why to update your dependencies, there’s value in the increased stability brought about by pinning packages. The kind of “stability” we’re achieving is a stable learning environment; we can change “our” code downstream of the pinned dependencies and know that any exceptions are (unless previously cached results weren’t actually reproducible) “our” problem. If a software package we thought was “reusable” and therefore depend on suddenly releases a bad version, we’ll let others who are actively upgrading handle the issue and catch the next truly stable version.

In the language of Attention, we’re limiting the amount of state we must attend to if we must Investigate root cause. If we want to continue to partially attend to lower-reliability (less likely to be reproducible) parts of our build networks, then we can get notifications at e.g. night using spare computing resources. See Regularly stress test.

Recursive artifacts#

The idea of artifacts is closely tied to the concept of Bootstrapping. Examples of recursive artifacts are Self-hosting compilers and operating systems (it’s likely the Ubuntu 20.04 developers were using an earlier version of Ubuntu). You can use a docker-in-docker image to build a docker image with docker-py installed, then use that image to build further docker images.

Some of these recursive artifacts are recursive environments. At each level of a bootstrap, you create a new environment, then enter it and use it to get something done (such as building an environment). Said another way, recursive environments are a recursive function-building process, where every function is a function to build an environment, then enter it and call a new function in a multi-level stack. Because the function is more than software (e.g. hardware for some compilers), or because you don’t keep track of every part of the function (e.g. other tools used to compile as part of an old OS), or simply because parts of the process are manual, you can easily lose reproducibility in this situation. Like libraries however, recursive environments are often backwards compatible.

We sometimes call “bootstrapping” the process of building up an improved dataset by training a model on some small dataset, running inference on a larger dataset, having annotators clean the larger dataset up, and training the model again on the larger dataset. The recursive artifact in this case is both the model and the datasets.

Analogies#

To analogize to the human experience, an artifact is a result we write down. The same impulse that drives us to save a file we generated by running a script is the one that drives us to record an answer so we don’t have to look it up later, or define an abstraction so we don’t have to review all the details later. These actions reduce our working memory requirements while we optimize.

Imagine if you were sent back in time to an ancient civilization. Even if you could speak their language, would all your future knowledge help them? You were living in a totally different environment, and most of what you know is for your future world. You aren’t going to be able to teach them everything you know about computers, because there won’t be one to demo anything on. You were also never taught the process of how to rebuild society; it was likely a recursive process that it would be hard to define the base case for.

Value#

Saving an artifact creates a more efficient environment for optimization, assuming the artifact represents reality (is reproducible). Consider the common case in machine learning where a model (e.g. a pth file) is saved to disk and run through a variety of tests. It would be silly to retrain the model before every change to the evaluation code. In particular, an artifact allows for much faster feedback (in this case, on the evaluation code).

Consider how this last paragraph relates to Bazel’s joke:

{Fast, Correct} - Choose two

Reproducibility#

These concepts are closely tied to those of Reproducibility. Bazel in fact defines “Correct” to mean reproducible, given you are allowed to pin yourself to the past world where the cache was created. That is, if running a clean build does not produce the same result as an incremental build (a build based on a cache) then the code is not correct. See the Bazel vision - Bazel main. We’ll use Correct (with a capital C) to capture this concept of perfect (or pinned) reproducibility. For more on perfect reproducibility, see Record Dependencies.

Reproducibility is not always important; sometimes it is sufficient to save history. Many (valuable) published papers don’t include every step necessary to reproduce figures. Once books are published and the author is gone, it’s likely difficult for anyone to reproduce the content. In the case of recursive artifacts mentioned above, it’s often not worth the effort to save every historical version of e.g. a compiler. Museums don’t record how many of the artifacts they store were produced. If you don’t care that much about being able to execute your code anymore, it’s likely no one else cares either. It’d be nice to be able to run it, sure, but the value in being able to run the code can be much lower than maintaining the dependencies that are required to be able to run it.

Defining the boundaries of how much is enough for something to be considered reproducible is an agreement that groups of developers often need to make when they work together. If you didn’t change any of the code in a particular area of a shared codebase, can you merge your code before waiting for a long-running test covering with the untouched code?

Estimation#

How do you put a numerical estimate on the value of reproducibility? The first step is to estimate how long your custom software product will run, and then how often it will need to be upgraded (when reproducibility is most critical).

For companies running web services in a production environment where they’ve promised 99.999% reliability to their customers, the value in reproducibility is high. These companies also need to be able to upgrade quickly, in particular in response to security issues. Google built Bazel based on these values.

Consider instead a research environment. The data product (e.g. a publication) may only need to be reproduced a few times if the research produces negative results (in this case, a publication indicating negative results).

Fast#

Developers need fast feedback, on local systems, and from remote systems.

A major advantage of specifying how to reproduce an artifact is that you effectively get information on how to reproduce or change a chain of artifacts. Build systems like Make and Bazel can use this information to only rebuild what is necessary (or likely to be necessary, if your rules aren’t perfect). Not only is it hard for humans to remember these chains of dependencies, they otherwise need to manually run through every step in them. It’s important to be Correct in long chains, because if you’re only 90% sure you are Correct in every stage then you’re only 81% sure you are Correct after two.

Dependency resolution tools like apt, conda or pip can use this information to help you rework what pre-built artifacts you depend on, essentially “building” a new top-level artifact.

An important side effect is that you can use significantly fewer computing resources, even when your dependencies are imperfect. For example, even if the test you care about will run in 10 minutes, if you haven’t specified dependencies you may be wasting an hour of computing resources for a totally unrelated test that runs when you push to CI. Not only is this wasteful, you may not get automated feedback if you Push for feedback because e.g. GitLab won’t push a pass/fail notification to Slack until the hour-long test has passed.

Long waits for automatic feedback are detrimental to forward progress even when all your tests are passing. If you don’t know that a long-running test is going to pass, or you do but haven’t recorded your dependencies properly in Bazel or some other system, you’ll be waiting for the build to finish to start on your next branch on top of a presumed automatically-merged merge request. Either that, or you end up building on the same branch indefinitely with new branch names. When multiple developers are submitting code at the same time, this problem gets significantly worse.

A system for easy management of dependencies often encourages developers to create more checkpoints (more artifacts) for faster feedback, besides making your existing system faster.

Estimation#

How do you put a numerical estimate on the value of fast feedback? First, how much feedback is there for the computer to give? If developers have written many tests, it will take a long time to provide feedback if the build system has no record of what tests need to be rerun. Second, how many times will you need to run the test before you are able to discover the root cause of any mistakes you make along the way?

Consider a production and research environment again. In a production environment, fast feedback will be critical to fast releases of software to customers, and the code will be well-tested. In a research environment, customers may only get updates infrequently, and there may be no tests yet.

Cost#

Learn Reusable Tools#

If you don’t specify a date and manually save a file, it’s like specifying a date indirectly, the file timestamp. The date you started the process that created the file is always sometime before the file’s timestamp. File systems have included this information for longer than many of us have been alive because it’s so important to reproducibility. The excellent tool Make relies on these timestamps to help us get closer to our dream of Correct.

Unfortunately Make has shortcomings. You’ll struggle to get it to work efficiently in a CI/CD environment because it relies on timestamps, and CI builds are often run on new machines. Said another way, running a CI build is like setting up a new machine for a new developer, and a new developer always has to start with a clean build with Make. See also:

Google tried to make Make work for as long as possible when it built Bazel; see FAQ - Bazel. Like many algorithms that do Memoization, Bazel does this by hashing the inputs to the functions that create artifacts (e.g. functools.lru_cache). To learn Bazel is effectively to learn how to record more of what you relied on to build an artifact.

Some tools automatically record or reconstruct dependency trees. In PyTorch, Theano, Tensorflow, etc. you can see the net activations as artifacts, and the backpropagation graph as a record of how all the artifacts connect. The import and #include statements in python and C++, respectively, are essentially a record of dependencies between files. These dependencies are typically parsed by tools from C++, but not in python. Some languages are designed to make parsing these dependencies faster; see for example performance - How does Go compile so quickly? - Stack Overflow.

Record Dependencies#

Perfect reproducibility is of course practically-speaking impossible, which is why we have concepts like probability. It’s easy for theoretically-minded people to forget that reproducible results require hardware, which exists in the real world. Even theoretically you need to specify your hardware; the code you wrote on a machine with 32 GB of RAM isn’t going to run on GitLab’s free runners with 2 GB or RAM. Beyond that, hardware lives in the real world and is affected by cosmic rays and the power turning off. To get more reproducible results often means to collect more information about an experiment; what are you forgetting to record? We rely on probability when certain inputs are too expensive to record.

Our belief that we can ultimately get a perfectly reproducible result is related to the idea that there is no such thing as probability and Determinism. See also Approximate computing. In some sense Reproducibility is a promise (or a belief about the future) rather than something we can say we have in our hand; see the Problem of induction.

In fact, pinning artifacts can be a problem. When the artifact is a dataset, we refer to the problem as Concept drift. When the artifact is a package or library (e.g. specified in a Dockerfile) the package or library we get is by default the latest, because using the library’s name right now means the latest version. That is, if you specify a pandas dependency without pinning it to a version it’s like using many other English words; concepts can drift. Should you define correct to mean the same as it was before, if it was wrong before? Let’s define inCorrect to mean different than it was before.

So what do we need to get closer to pinned reproducibility? We need to pin package versions, specify random seeds, copy and paste (fork) code, store data we retrieve from sensors, etc. All of this takes engineering time. In many cases, we won’t be able to achieve “Correct” without tremendous amounts of work and will instead rely on manual processes.

If you know Bazel and Make, this will be easier for you. Even then you may not be able to hit your goal, though, and you should think about the long-term cost of your shortcuts. If you pin without recording all your dependencies, the answers you build on top of your artifact will be more inCorrect the more those unrecorded dependencies change in a significant way under you.

Update Dependencies#

A perfectly reproducible build, by definition, requires that cached artifacts remain “Correct” for an indefinite amount of time. By disallowing concept drift, we effectively freeze ourselves at some point in the past. How do we catch up?

Estimation#

As part of any decision about how much to memoize and how much to let float (unpinned), you should estimate the cost of upgrades. For example, let’s say in your particular domain you estimate you will want to upgrade to the latest versions of packages only rarely, but it’s critical that you be able to reproduce results in e.g. a production environment. You will probably want to update your dependencies less frequently (e.g. once a month), but invest more time in the upgrade to make sure you will continue to be able to develop in the environment you are creating (e.g. bug fixing, in a production environment).

import pint
ureg = pint.UnitRegistry()

# All of these are estimates, despite the lack of uncertainties
lifetime = 2 * ureg.year
upgrade_freq = 1 * ureg.month
upgrade_cost = 1 * ureg.day

total_cost = upgrade_cost * (lifetime / upgrade_freq).to_reduced_units()
print(total_cost)
24.0 day

Companies like Canonical and Microsoft build versioned software that effectively depends on a huge number of independently evolving software packages. In their business model, they know how long they will need to support a product and may actually do the calculation above (with extremely uncertain numbers).

Companies like Google, Facebook, and Amazon provide web services without explicit versioning. Assuming we want the cost per month:

upgrade_freq = 1 * ureg.month
upgrade_cost = 1 * ureg.day

total_cost = upgrade_cost / upgrade_freq
print(total_cost)
1.0 day / month

How long an individual upgrade will take depends on the quality of your tools, how much of the process you’ve automated, and how often you upgrade, among other things. To estimate upgrade_cost, see Update dependencies.