Investigate root cause#
Restore system to previously achieved performance; a record must exist of previous performance.
Difference between present and historical performance.
Consolidate symptoms (covariates) in a single version-controlled document or DAG. Agonize over all the presently known covariates:
Use open source analysis tools to get the most out of the data you already have. Depending on the cost of experimentation, develop your own model comparison tools to compress big data into a single variable you can e.g. add to tables. To automate data collection, generate reports (e.g. html or md files) that represent one row (i.e. observation / experiment) or that can compare a few observations.
Causal reasoning is intrinsically based in time, so it is usually helpful to collect a timeline. Parse a sequence of events (a log without timestamps) that includes the observed symptoms. For example:
The job started.
A warning was printed.
An exception was thrown.
Collect logs from the system:
Parse logs with context as a table. Helpful columns:
Parse and strip timestamps to diff logs.
To separate threads or processes.
Print other variables in the same way, so you can e.g. serialize it back into a unit test (serialize to and from a string).
Develop tools to parse logs and diff them to determine when and how the behavior went off track, or how models compare. For example, visually compare or subtract loss curves.
For every symptom in the logs, try to find an equivalent manifestation of it in the logs between where you can see the symptoms clearly and where you aren’t sure it has started. That is, backprop to intermediate variables and add these new dimensions to your table.
Is a running program still available? Connect to it in the debugger, dump call stacks (all
threads/processes), and disconnect again quickly to keep the state intact. Consider
procfs, in particular
/proc/PID/stack. Try to find a way to make the program hang and
If the defect is an uncaught exception or segfault, find the core dump.
Prefer automatic tests to tooling and instrumentation (requiring manual intervention). If you can say that some behavior is better than other behavior rather than simply statistically associated with a particular defect, you should write a test indicating your interest in maintaining the behavior.
The debate over whether to invest in an automatic test comes down to whether you both care about the behavior and whether behavior is likely to regress. The defect you are presently fixing is one sample indicating the behavior is likely to regress; the fact you are interested in fixing it indicates you care.
Using both the data and your priors, develop causal theories. See:
There may be multiple causes of the primary symptom, but most of the time you can expect a Pareto distribution and rely on the Pareto principle.
Drawing a DAG forces reasoning in at least some observable quantities (the bubbles/covariates). In the language of the scientific process, causal reasoning should force the user to make testable predictions. Usually it encourages measuring not for its own sake (e.g. setting up instrumentation hoping new data inspires hypotheses) but for the sake of verifying a hypothesis as cheaply as possible. The better your theory and the more likely hypotheses you are considering, the more cheaply you can measure and experiment (see Design of experiments below).
The number of alternative hypotheses you should develop before moving on to new data collection will
depend on the network, your prior network understanding, the stage of your investigation (including
how much data you’ve collected), the cost of measurement (analysis), and the cost of
experimentation. Call this
H (number of hypotheses).
Specialize standard hypotheses#
Common starting points for developing testable predictions; many of these come with an example DAG.
It seems basic, but start by asking what is different (include being run later, at a new timestamp). What are you doing in your application that is unique about how you’re using this library? For example, why did the authors never see your training example? Do they have a FAQ for common root causes (not in the warning message)?
When you’re debugging training (Bayesian inference) consider the dataset history, the code history, and weight initialization. For example, to reproduce model performance from random initialization, consider whether you should first reproduce performance (no increase in loss) with known good model weights. An example DAG:
Out of memory#
See Dying, fast and slow: OOM crashes in Python for examples of how out of memory issues display in Python, and the related article Measuring memory usage in Python: it’s tricky! for a discussion of resident memory. Almost every covariate in this DAG is observable:
In the Space-time tradeoff, we’d ideally like to achieve the green curve below. That is, we
want to use all the space we have to finish as quickly as possible (so we need less time). In
practice, all we achieve is the blue curve, filling an in-memory concurrent pipeline (see Prescribe
computing metrics) as soon as possible from disk. If we’re
O(n) in memory or in general our
memory consumption is a function of the input (that is, we do not have a chunk size adjusting to the
environment’s memory), then we’ll crash on larger
What logging statements were not printed? Infer a program’s control flow without adding new logging statements by checking what was NOT printed as well as what was printed.
Form hypotheses by rejecting statements you expect to be true. Produce a variety of statements you expect to be true, ordered by confidence.
Assume a regression#
Was there any point in the past when the feature worked as expected? Narrow the range of commits in
which the regression was introduced with
git bisect. Ask developers on commits between the broken
and working commits for help developing theories.
Read Design of experiments. Assign likelihood estimates to hypotheses to help you decide which to run expensive testing for. For example:
Some data will rule out certain hypotheses. Other times, data will only make certain hypotheses less likely. In Bayesian inference, we estimate parameters from the data. In this case, we’re ranking models based on data (and priors).
For the most likely hypotheses, design experiments that will either falsify or continue to confirm them.
New data: Unobservable variables#
Variables can be observable because they are unavailable in the raw data, because they have not been extracted from the raw data, or both. Sometimes you need to add “permanent instrumentation” i.e. both extract the data and compress it in your analysis tools in one step.
New data: Log vs. Debug#
Re-run with more detailed logging levels:
At run time with
In shared libraries (e.g.
As you design experiments, should you prefer logging to debugging to answer your questions? When you debug you inspect the local causal graph; logging is for a bigger picture. Temporarily log call stacks to get some of both.
Logging takes more space; debugging takes more time. In most cases time is more valuable than computer resources (space). From Apache log4j 1.2 - Short introduction to log4j:
… debugging statements stay with the program; debugging sessions are transient.
On the other hand, excessive logging pollutes the code base with unhelpful comments. The call stack (context) at an exception is easier to digest than verbose logs.
Testing and experimentation doesn’t change the real-world model, it only collects more training examples from it.
Reproduce in new environments#
Reproduce the issue in a more flexible or more easily accessible environment (e.g. locally). A cloud debug environment can be as good as a local environment, depending on what you need to test. The goal is to answer “why” questions faster. Sometimes it is easy to reproduce quickly, but you can’t get more data to figure out what is going on. For example:
You can’t build the code to add logging statements.
You can’t debug the code.
If it is not easy to test more than one theory in parallel, reproduce in multiple environments.
Reduce delay in performing experiments.
Human time cost#
Reproduce with less manual human time investment. It is easy to become focused on confirming a single cause. We often have many causes to discover before we reach the root (it is often better to invest long-term).
Finding the root cause will take too much wall clock time if a reproducible problem has a long cycle time. Imagine discovering an issue that takes hours to reproduce to test hypotheses:
When feedback is slow and single-thread, ask yourselves: If we did this and it was as expected, what would we do next? If it takes 3 hours to reproduce you must think of several “why” questions to ask and get answers to. If it takes 5 minutes to reproduce you can ask one why question.
To reproduce faster, can you:
Linearly scale down the amount of input to the algorithm?
Cut out part of the context?
Rule out parallelization being the culprit (thrashing, locking issues); single-threaded code is also easier to debug.
If you can’t reproduce a defect when you run it with reduced context in a new setting, what context made the difference? The state you need to reproduce faster is often invaluable for determining what the root cause of the defect is. You’re effectively narrowing down the problem to where it occurs in both time and code.
To reduce context/state so you can reproduce faster:
Check the logs for printed state.
Call a function lower in the call stack with the same context (arguments).
The final result of reproducing with a smaller amount of context and less wall clock time is a unit test.