Improve design experiments#

TODO-rib: When should you reduce inductive bias?#

See:

TODO-dr: Should you reduce the dimensionality of your inputs?#

Add a reduce-dimensionality.md article.

The model is not the same as the real thing; you almost always simplify the real world to put it into a model. So it’s almost never a question of whether or not to reduce dimensionality, but whether you can afford to reduce it more (for your particular task).

A greater simplification lets you attend to more at once, at the cost of effectively letting you see less detail in every item because it’s more simplified. For example, a planning git graph has many tasks that are all drastic simplifications of the full task. Some of the worst aspects of it are that you can’t really rearrange the items in the graph without understand them in enough detail to understand their dependencies. Still, the graph is useful because it lets you see (attend to) more at once. Said another way, it provides a useful abstraction.

Similarly, to “load” code into your head by reading it is actually to decompress it into natural language (increasing dimensionality). You often think of compressing code into natural language, because documentation is often shorter than the code. The reason this isn’t “compression” is that documentation is also much less complete (detailed) than code. Documentation attends to less, but can be more detailed because of that. Watch out for code that throws away details, however, in the sense of throwing away long variable names in exchange for single letter names (or even more so, when compiling to a binary). Ideally the documentation is in the code so nothing gets lost (avoid lossy compression), or at the least you retain links to documentation (as e.g. .md files.

Should code have minimal comments so you can avoid conflicts only due to documentation changes? You should deal with documentation conflicts separately than code conflicts, even if you are dealing with them in the same commit. If you see natural language as compressing to code, and plain text code as compressing to binaries, then this is similar to the question of whether to include symbols in your binaries. Most of the time, you don’t need to. In fact, if you allow different people to have a different understanding of the code (e.g. Bayesian vs. Frequentist, or simply based on different documentation) then you don’t want to include all possible interpretations. What is “tricky” (needing comments) may be different to different people. See also Linus’ thoughts in a/lt-debugger; you found this link in Forcing people to read and understand code instead of using comments, function summaries and debuggers? - SESE.

Said another way, don’t use a fork of the code to make comments. Comment on the code through documentation; your fork should only remove and simplify code. You also like that this lets you fit more code onto your screen at once (that you’ve understood). If you do need to make comments inline to avoid the split-attention effect, then you can move them out of the code once you understand it and you have your first conflict.

Is a picture worth a thousand words? It depends on the picture, and what kind of task you need to do with it. Why aren’t you asking about the resolution of the image?

Could you use PCA to come up with initial weights to help ease the training of a net? That is, manually strip out as many variables as you want, use that as the dimensionality of a matrix, and then let the net learn to tweak the weights of the projection matrix as well.

See:

Tag search on SE: Highest scored ‘dimensionality-reduction’ questions - CV.

Is multicollinearity really much of a problem as long as you are doing some dimensionality reduction anyways? See:

This is related to whether you should ever add a new variable to a model. Does the variable matter? See also:

Once you define this better, replace the link to Wikipedia with your own article in your public notes (run a git grep to find the link you have now).

What if your input data had redundancies? The neural network can’t tell you it is ignoring some of your inputs. Should you always use PCA or an autoencoder to help remove the unnecessary information? If it just learns to ignore the data in the first few layers it’s not a big deal, though. [1]: https://stats.stackexchange.com/questions/70899/what-correlation-makes-a-matrix-singular-and-what-are-implications-of-singularit [2]: https://en.wikipedia.org/wiki/Dimensionality_reduction#Feature_selection

TODO-rbsap#

Commenting on SVD for PCA.

Why is flip_signs necessary? See also linear algebra - Calculating SVD by hand: resolving sign ambiguities in the range vectors. - Math SE.

See point 4.. What are loadings?

See point 7.. Why would anyone want to perform PCA on a correlation matrix rather than a covariance matrix? Related to Principal component analysis - Further considerations.

See also Highest scored ‘linear-algebra’ questions - Math SE.

TODO-mspca#

See the first answer to Making sense of principal component analysis, eigenvectors & eigenvalues - CV. Missing the Spectral theorem dependency.

TODO-ce: Is “cross-entropy” a useful abstraction?#

You should write out or add to your own notes in SR2 on this topic. That is, publish an article in your own words. You like how this article uses Wikipedia images, just like you intend to:

Closer to understanding the Kullback-Leibler Divergence:

Colah’s take on information theory:

This loss is not strictly required for classification, and actually may be suboptimal:

However, this loss is still useful for e.g. generative models:

This loss can be interpreted in many different ways:

See also:

See also:

See also:

TODO-unsup: Do you prefer unsupervised to supervised learning?#

Estimate value#

You’re already quite familiar with the latter, at least relatively speaking. Seek novelty.

You drive improvements to signs based on what you discover in unsupervised learning. Why wouldn’t you expect a deep learning model to do the same? If you can detect patterns, you can assume some kind of structure. If you can assume some kind of structure, you can make a net that is much more efficient than a completely general learner (or reuse a net you have based on the same common mathematical structure). In fact, you often see solutions that fit the tools you already know. That is, you prefer to use dicts and map a lot when you code only because you know them. Similarly, people have filter bubbles and try to come up with economic solutions that fit the simple models they already know (e.g. laissez-faire is always better).

It’s like detecting patterns is the first step in the scientific process; it’s how you establish i.e. come up with educated guesses or working hypotheses.

You’re not the only one thinking like this. The large transformer models are being pretrained on unsupervised data before supervised learning. These models scale with more data:

In general, you have a lot more unsupervised training data. It’s much cheaper to manage, understand, etc. You don’t have to rely on another team to get you what you need.

Hinton is deeply suspicious of supervised learning for a reason. See the end of this article for a brief summary of his preference for unsupervised learning:

Is your role as a developer to do the pattern recognition for a network that can’t yet do so for itself? If you see an image, you know to model it with a convolutional network. In general, you make the decision about how to model a system and make architectural choices based on the mental library of model pieces you know may help build a useful mode for your new problem. Is your library expanding?

Is unsupervised learning similar to Bayesian statistics? It seems like you can only argue against in terms of efficiency:

TODO-visc: Do visual proofs imply the importance of causality?#

Do we learn so much faster with a visual of something because logic is ultimately based on causality, and a visual makes it “clear” (faster) how one thing causes another? When you have to read, you have to figure out for yourself what the causes are, perhaps by reading a whole paragraph or more.

In the same way, a computer program is a much more “compact” (compressed) version of a causal diagram (something visual). If you have a visual representation of code, you can “read” it much faster.

It seems like you want to prefer natural logic systems to Hilbert style:

The former (natural) likely fit closer to causual statements that are common to natural language. For example, see the counterfactual conditional that starts this topic:

In the words of Curry, don’t run away from paradoxes.

If you see the human brain as emulating the natural world, then the electrical impulses across neurons should roughly correspond to more complicated activity (a compressed version of) what happens in the world. This fits causality as being based in local interactions in the physical world. We’ve created computers that emulate the human brain, to some extent, also using electrical impulses at the microscopic level.

TODO-sscb: Should you fork the content of SSC?#

It would make coming back to the material later easier (fix errors). That’d make your own answers on the topic much cleaner.

Would the authors want to put the source of the book up on GitLab or GitHub so that others can suggest edits to the source rather than in a Google Doc? You’re also not sure if you’re forking an old version of the content.

You could get the same effect by simply copying and pasting any updates they make into your own build of the pdf. You really only need to copy and paste seven or so files.

TODO-cuhc: What is a short summary of the Curry-Howard Correspondence?#

See:

TODO-cycl: How is cyclomatic complexity measured?#

Should you set a limit on this in pylint? Right now you ignore all those errors, mostly because you don’t understand the metric.

Cyclomatic complexity is related to Betti numbers:

TODO-prit: How do you prove it?#

You are bad at proofs (maybe because they are hard, though). You typically start using a result (like code) and then only “prove” it (work out the bugs) once you’ve been using it some time. The downside to this approach is that it’s not necessary when you’re not working with data, but definitions. Whenever you’re trying to prove something, you should use this list:

You’ve avoided proofs (and math) in the past because you feel like all you’re doing is symbol manipulation, looking for just the right symbols to come together (solving the word problem by simple exploration). A computer should be able to quickly present to you all the deductions you can come up with from certain facts, it seems. Is the issue that humans typically don’t provide it with enough facts? We have a lot of knowledge we could forget to share. See also:

Does SymPy provide anything to automate this? No, it looks like it’s unrelated.

Notice the curry-howard correspondence has huge implications to how you take notes; you design your .md files to be functions. Generally speaking you should refine them so that they all “prove” something even if only probabilistically (see !w Bayesian logic).

Consider this list of which tools have formalized which of many famous theorems to decide which to work with. You can also use this list as a way to learn how to do a proof with relevant examples (e.g. whatever proofs or topics you are currently learning about):

You should probably start by reviewing propositional calculus, the basis for other logical calculi:

A proof is the process of building a function; we always start from what we know and start this search process in the direction of where we think we may find value. You can take your premises and randomly combine them to try to discover something new, or take specific premises and combine them through educated guesses to try to go in a specific direction of a desired result. A proof is not just the process of creating many more true statements (propositions) from old ones; it’s about generating valuable propositions (reusable, achieve human needs). In fact, you should prefer the term “operator” to “function” for this purpose; see Operator (mathematics) - Wikipedia. Or should you not? At some point you need to be recursive and say either a function that takes a function or an operator that takes an operator.

This search was particularly interesting with respect to Execise 2.45, part 2.. You had to guess (as hypotheses) different kinds of functions: addition, polynomial, exponentiation, constant, piecewise, etc. In some sense there’s creativity here; in another sense you’re simply going through a list of potential functions (perhaps only those you know). You could see the requirements (of a monoidal monotone, in this case) as your constraints or in general a list of relations that must hold (like a list of images when training a net). If a neural network can approximate any function, how well does it approximate an exponential function? It was the answer here. See also:

TODO-idb: How does one quickly identify the bottleneck in a computer program?#

The idea here is to provide a context for studying Turing machines and computability, complexity, etc. There are also a lot of notes to move on this subject. So in some sense, it’s exploring the domain to come up with better questions.

TODO-catt: What’s a simple high-level summary of category theory?#

It would have been helpful to have category theory when you were revewing linear algebra while trying to understand the projections into a space that the KQ matrices do in KQV attention. Almost every concept you were trying to understand had an alternative explanation in terms of category theory. You should definitely organize your notes on the topic as part of this effort; perhaps that’s the first step. Once your own notes are organized, then you “explore domain” by simply reading the notes of others (reading e.g. Wikipedia). You don’t have to be writing notes to be exploring a domain; reading is exploring as long as you are understanding and have a goal.

Math is critical not only because it has already created a large body of language (unique words) to describe concepts, but also because it is old and therefore already holds many places in the English namespace. You want to understand linear algebra better by understanding it from another perspective (as well as group theory). I’d say linear algebra is the basis for pretty much all machine learning (tensors). You should have a solid understanding in linear algebra before trying to generalize it, however. It’s also critical because it defines the data structures that we use; focus on data structures first. Should you start with an article on the importance of mathematics? When exploring, for example, you need some general guidelines about how to explore (prefer math). You also see mathematical models apply to an infinite number of training examples, rather than a natural number (no matter how big). You can’t just follow curiosity (the curiosity gradient, novelty), unless curiosity is based on the problems you’ve experienced in the past.

In the past you’ve experienced being able to answer a question you had on one page of Wikipedia by almost randomly following links and then see it show up elsewhere. Because of the connectedness of mathematics, many concepts are discussed in multiple places. That is, you don’t need to keep track of your mental train of questions as much as you would have to with another resource and another topic. It’s like following links in a consistent set of notes that you hope your own will be someday. In fact, if you don’t find the answer elsewhere it may not be an important answer.

It seems better to study category theory before topology. You already have two examples (group theory and linear algebra) you can generalize from, and you have a lot of background in general. You can also use category theory in other places besides math, such as functional programming. It’s a way to make your brain remember more things:

Understanding category theory is like importing a library dependency, rather than taking dependencies on individual functions. In general seeing a “theory” after a name is a good indicator that you need to make a concerted effort to learn something, similar to a library.

Wikipedia is an excellent source for learning mathematics. You’ve read their whole page of caveats on the topic, and you agree with it. However, any study of mathematics requires some reference material as you work (to go along your primary material), and Wikipedia is excellent in this area. For example:

https://en.wikipedia.org/wiki/Function_(mathematics)#Other_terms

You care about applications to linear algebra:

See all the examples here:

Document how Boolean algebra is a Magma

You understand homomorphisms for groups now, which is part of category theory:

In VGT a question went over commutators. Can you understand how they are functors?

https://math.stackexchange.com/questions/312605/what-is-category-theory-useful-for

https://en.wikipedia.org/wiki/Functional_programming https://en.wikipedia.org/wiki/Mathematical_logic

A “Basic category theory” pdf:

https://en.wikipedia.org/wiki/Dynamic_dispatch

  • Notice the types in this example - dividend and divisor can be matrices, floats, etc. Work through an example like this (with mathematical sets/types) and relate it to category theory.

  • Also, Nick talks about dispatch a lot in his code.

  • See dispatch methods here: https://docs.python.org/3/library/functools.html#module-functools

https://en.wikipedia.org/wiki/Higher-order_function

TODO-ctca: How do category theory and order theory relate to causality?#

See Causality. One could see the lack of preservation of joins and meets as a way of losing history in causal DAGs. For example, if you have the number 12 you can’t say for sure if it was generated by 6 * 2 or 3 * 4.

TODO-cbow: Is composition a binary operation?#

See Binary operation, where it clearly seems to be one. That is, you should be able to see \(\circ\) as a binary operation just like \(\ast\) and \(+\).

However, the page Function composition - Wikipedia makes no link to this other article, except at the very bottom where the article is listed under the “Binary operations” category:

TODO-cmmd: How do you simply describe commutative diagrams?#

Notice the picture of ab = ba on this page:

This is the same way that commutativity is described visually in VGT. It’s the same pattern you see in commutative diagrams as well:

A similar simple example of a commutative diagram exists on this page:

See also simple examples of inversion here:

Perhaps some of these simpler examples should be added under “Examples” on the Wikipedia page.