`Muse! When we learned to count, little did we know all the things we could do some day by shuffling those numbers: Pythagoras said "All is number" long before he saw computers and their effects, or what they could do by computation, naive and mechanical fast arithmetic. It changed the world, it changed our consciousness and lives to have such fast math available to us and anyone who cared to learn programming. Now help me, Muse, for I wish to tell a piece of controversial math, for which the lawyers of DVD CCA don't forbear to sue: that they alone should know or have the right to teach these skills and these rules. (Do they understand the content, or is it just the effects they see?) And all mathematics is full of stories (just read Eric Temple Bell); and CSS is no exception to this rule. Sing, Muse, decryption once secret, as all knowledge, once unknown: how to decrypt DVDs.`

Seth Schoen, DeCSS haiku

My take on the Nature paper *Learning can be undecidable*:

Fantastic article, very clearly written.

So it reduces a kind of learninability called estimating the maximum (EMX) to the cardinality of real numbers which is undecidable.

When it comes to the relation between EMX and the rest of machine learning framework, the article mentions that EMX belongs to "extensions of PAC learnability include Vapnik’s statistical learning setting and the equivalent general learning setting by Shalev-Shwartz and colleagues" (I have no idea what these two things are), but it does not say whether EMX is representative of or reduces to common learning tasks. So it is not clear whether its undecidability applies to ML at large.

Another condition to the main theorem is the union bounded closure assumption. It seems a reasonable property of a family of sets, but then again I wonder how that translates to learning.

The article says "By now, we know of quite a few independence [from mathematical axioms] results, mostly for set theoretic questions like the continuum hypothesis, but also for results in algebra, analysis, infinite combinatorics and more. Machine learning, so far, has escaped this fate." but the description of the EMX learnability makes it more like a classical mathematical / theoretical computer science problem rather than machine learning.

An insightful conclusion: "How come learnability can neither be proved nor refuted? A closer look reveals that the source of the problem is in defining learnability as the existence of a learning function rather than the existence of a learning algorithm. In contrast with the existence of algorithms, the existence of functions over infinite domains is a (logically) subtle issue."

In relation to practical problems, it uses an example of ad targeting. However, A lot is lost in translation from the main theorem to this ad example.

The EMX problem states: given a domain X, a distribution P over X which is unknown, some samples from P, and a family of subsets of X called F, find A in F that approximately maximises P(A).

The undecidability rests on X being the continuous [0, 1] interval, and from the insight, we know the problem comes from the cardinality of subsets of the [0, 1] interval, which is "logically subtle".

In the ad problem, the domain X is all potential visitors, which is finite because there are finite number of people in the world. In this case P is a categorical distribution over the 1..n where n is the population of the world. One can have a good estimate of the parameters of a categorical distribution by asking for sufficiently large number of samples and computing the empirical distribution. Let's call the estimated distribution Q. One can choose the from F (also finite) the set that maximises Q(A) which will be a solution to EMX.

In other words, the theorem states: EMX is undecidable because not all EMX instances are decidable, because there are some nasty ones due to infinities. That does not mean no EMX instance is decidable. And I think the ad instance is decidable. Is there a learning task that actually corresponds to an undecidable EMX instance? I don't know, but I will not believe the result of this paper is useful until I see one.

h/t Reynaldo Boulogne

I don’t know about you people, but I don’t want to live in a world where someone else makes the world a better place better than we do.

Gavin Belson, Silicon Valley S2E1.

I came across this quote in a Slate post about Facebook

Just some non-rigorous guess / thought: Feedforward networks are like combinatorial logic, and recurrent networks are like sequential logic (e.g. data flip-flop is like the feedback connection in RNN). Since NAND + combinatorial logic + sequential logic = von Neumann machine which is an approximation of the Turing machine, it is not surprising that RNN (with feedforward networks) is Turing complete (assuming that neural networks can learn the NAND gate).

- ShortScience.org is a platform for post-publication discussion aiming to improve accessibility and reproducibility of research ideas.
- The website has over 800 summaries, mostly in machine learning, written by the community and organized by paper, conference, and year.
- Reading summaries of papers is useful to obtain the perspective and insight of another reader, why they liked or disliked it, and their attempt to demystify complicated sections.
- Also, writing summaries is a good exercise to understand the content of a paper because you are forced to challenge your assumptions when explaining it.
- Finally, you can keep up to date with the flood of research by reading the latest summaries on our Twitter and Facebook pages.

Darknet Diaries is a cool podcast. According to its about page it covers "true stories from the dark side of the Internet. Stories about hackers, defenders, threats, malware, botnets, breaches, and privacy."

But as more nontechnical people bought computers, the things that impressed hackers were not as essential. While the programs themselves had to maintain a certain standard of quality, it was quite possible that the most exacting standards—those applied by a hacker who wanted to add one more feature, or wouldn’t let go of a project until it was demonstrably faster than anything else around—were probably counterproductive. What seemed more important was marketing. There were plenty of brilliant programs which no one knew about. Sometimes hackers would write programs and put them in the public domain, give them away as easily as John Harris had lent his early copy of Jawbreaker to the guys at the Fresno computer store. But rarely would people ask for public domain programs by name: they wanted the ones they saw advertised and discussed in magazines, demonstrated in computer stores. It was not so important to have amazingly clever algorithms. Users would put up with more commonplace ones.

The Hacker Ethic, of course, held that every program should be as good as you could make it (or better), infinitely flexible, admired for its brilliance of concept and execution, and designed to extend the user’s powers. Selling computer programs like toothpaste was heresy. But it was happening. Consider the prescription for success offered by one of a panel of high-tech venture capitalists, gathered at a 1982 software show: “I can summarize what it takes in three words: marketing, marketing, marketing.” When computers are sold like toasters, programs will be sold like toothpaste. The Hacker Ethic notwithstanding.

Hackers: Heroes of Computer Revolution, by Steven Levy.

To compute Catalan numbers without unnecessary overflow, use the recurrence formula \(C_n = {4 n - 2 \over n + 1} C_{n - 1}\).

The Boyer-Moore algorithm for finding the majority of a sequence of elements falls in the category of "very clever algorithms".

```
int majorityElement(vector<int>& xs) {
int count = 0;
int maj = xs[0];
for (auto x : xs) {
if (x == maj) count++;
else if (count == 0) maj = x;
else count--;
}
return maj;
}
```

Roger Grosse's post How to learn on your own (2015) is an excellent modern guide on how to learn and research technical stuff (especially machine learning and maths) on one's own.

ATS (Applied Type System) is a programming language designed to unify programming with formal specification. ATS has support for combining theorem proving with practical programming through the use of advanced type systems. A past version of The Computer Language Benchmarks Game has demonstrated that the performance of ATS is comparable to that of the C and C++ programming languages. By using theorem proving and strict type checking, the compiler can detect and prove that its implemented functions are not susceptible to bugs such as division by zero, memory leaks, buffer overflow, and other forms of memory corruption by verifying pointer arithmetic and reference counting before the program compiles. Additionally, by using the integrated theorem-proving system of ATS (ATS/LF), the programmer may make use of static constructs that are intertwined with the operative code to prove that a function attains its specification.

(5-second fame) I sent a picture of my kitchen sink to BBC and got mentioned in the latest Boston Calling episode (listen at 25:54).

colah's blog has a cool feature that allows you to comment on any paragraph of a blog post. Here's an example. If it is doable on a static site hosted on Github pages, I suppose it shouldn't be too hard to implement. This also seems to work more seamlessly than Fermat's Library, because the latter has to embed pdfs in webpages. Now fantasy time: imagine that one day arXiv shows html versions of papers (through author uploading or conversion from TeX) with this feature.

### Notes on random froests

Stanford Lagunita's statistical learning course has some excellent lectures on random forests. It starts with explanations of decision trees, followed by bagged trees and random forests, and ends with boosting. From these lectures it seems that:

- The term "predictors" in statistical learning = "features" in machine learning.
- The main idea of random forests of dropping predictors for individual trees and aggregate by majority or average is the same as the idea of dropout in neural networks, where a proportion of neurons in the hidden layers are dropped temporarily during different minibatches of training, effectively averaging over an emsemble of subnetworks. Both tricks are used as regularisations, i.e. to reduce the variance. The only difference is: in random forests, all but a square root number of the total number of features are dropped, whereas the dropout ratio in neural networks is usually a half.

By the way, here's a comparison between statistical learning and machine learning from the slides of the Statistcal Learning course:

### Open peer review

Open peer review means peer review process where communications e.g. comments and responses are public.

Like SciPost mentioned in my post, OpenReview.net is an example of open peer review in research. It looks like their focus is machine learning. Their about page states their mission, and here's an example where you can click on each entry to see what it is like. We definitely need this in the maths research community.

### Some notes on RNN, FSM / FA, TM and UTM

Related to a previous micropost.

These slides from Toronto are a nice introduction to RNN (recurrent neural network) from a computational point of view. It states that RNN can simulate any FSM (finite state machine, a.k.a. finite automata abbr. FA) with a toy example computing the parity of a binary string.

Goodfellow et. al.'s book (see page 372 and 374) goes one step further, stating that RNN with a hidden-to-hidden layer can simulate Turing machines, and not only that, but also the *universal* Turing machine abbr. UTM (the book referenced Siegelmann-Sontag), a property not shared by the weaker network where the hidden-to-hidden layer is replaced by an output-to-hidden layer (page 376).

By the way, the RNN with a hidden-to-hidden layer has the same architecture as the so-called linear dynamical system mentioned in Hinton's video.

From what I have learned, the universality of RNN and feedforward networks are therefore due to different arguments, the former coming from Turing machines and the latter from an analytical view of approximation by step functions.

### Writing readable mathematics like writing an operating system

One way to write readable mathematics is to decouple concepts. One idea is the following template. First write a toy example with all the important components present in this example, then analyse each component individually and elaborate how (perhaps more complex) variations of the component can extend the toy example and induce more complex or powerful versions of the toy example. Through such incremental development, one should be able to arrive at any result in cutting edge research after a pleasant journey.

It's a bit like the UNIX philosophy, where you have a basic system of modules like IO, memory management, graphics etc, and modify / improve each module individually (H/t NAND2Tetris).

The book Neutral networks and deep learning by Michael Nielsen is an example of such approach. It begins the journey with a very simple neutral net with one hidden layer, no regularisation, and sigmoid activations. It then analyses each component including cost functions, the back propagation algorithm, the activation functions, regularisation and the overall architecture (from fully connected to CNN) individually and improve the toy example incrementally. Over the course the accuracy of the example of mnist grows incrementally from 95.42% to 99.67%.

One way RNNs are currently being used is to connect neural networks more closely to traditional ways of thinking about algorithms, ways of thinking based on concepts such as Turing machines and (conventional) programming languages. A 2014 paper developed an RNN which could take as input a character-by-character description of a (very, very simple!) Python program, and use that description to predict the output. Informally, the network is learning to "understand" certain Python programs. A second paper, also from 2014, used RNNs as a starting point to develop what they called a neural Turing machine (NTM). This is a universal computer whose entire structure can be trained using gradient descent. They trained their NTM to infer algorithms for several simple problems, such as sorting and copying.

As it stands, these are extremely simple toy models. Learning to execute the Python program

`print(398345+42598)`

doesn't make a network into a full-fledged Python interpreter! It's not clear how much further it will be possible to push the ideas. Still, the results are intriguing. Historically, neural networks have done well at pattern recognition problems where conventional algorithmic approaches have trouble. Vice versa, conventional algorithmic approaches are good at solving problems that neural nets aren't so good at. No-one today implements a web server or a database program using a neural network! It'd be great to develop unified models that integrate the strengths of both neural networks and more traditional approaches to algorithms. RNNs and ideas inspired by RNNs may help us do that.

Michael Nielsen, Neural networks and deep learning

What makes the rectified linear activation function better than the sigmoid or tanh functions? At present, we have a poor understanding of the answer to this question. Indeed, rectified linear units have only begun to be widely used in the past few years. The reason for that recent adoption is empirical: a few people tried rectified linear units, often on the basis of hunches or heuristic arguments. They got good results classifying benchmark data sets, and the practice has spread. In an ideal world we'd have a theory telling us which activation function to pick for which application. But at present we're a long way from such a world. I should not be at all surprised if further major improvements can be obtained by an even better choice of activation function. And I also expect that in coming decades a powerful theory of activation functions will be developed. Today, we still have to rely on poorly understood rules of thumb and experience.

Michael Nielsen, Neutral networks and deep learning

Primer Science is a tool by a startup called Primer that uses NLP to summarize contents (but not single papers, yet) on arxiv. A developer of this tool predicts in an interview that progress on AI's ability to extract meanings from AI research papers will be the biggest accelerant on AI research.

no-one has yet developed an entirely convincing theoretical explanation for why regularization helps networks generalize. Indeed, researchers continue to write papers where they try different approaches to regularization, compare them to see which works better, and attempt to understand why different approaches work better or worse. And so you can view regularization as something of a kludge. While it often helps, we don't have an entirely satisfactory systematic understanding of what's going on, merely incomplete heuristics and rules of thumb.

There's a deeper set of issues here, issues which go to the heart of science. It's the question of how we generalize. Regularization may give us a computational magic wand that helps our networks generalize better, but it doesn't give us a principled understanding of how generalization works, nor of what the best approach is.

Michael Nielsen, Neural networks and deep learning

Computerphile has some brilliant educational videos on computer science, like a demo of SQL injection, a toy example of the lambda calculus, and explaining the Y combinator.

### Learning via knowledge graph and reddit journal clubs

It is a natural idea to look for ways to learn things like going through a skill tree in a computer RPG.

For example I made a DAG for juggling.

Websites like Knowen and Metacademy explore this idea with added flavour of open collaboration.

The design of Metacademy looks quite promising. It also has a nice tagline: "your package manager for knowledge".

There are so so many tools to assist learning / research / knowledge sharing today, and we should keep experimenting, in the hope that eventually one of them will scale.

On another note, I often complain about the lack of a place to discuss math research online, but today I found on Reddit some journal clubs on machine learning: 1, 2. If only we had this for maths. On the other hand r/math does have some interesting recurring threads as well: Everything about X and What Are You Working On?. Hopefully these threads can last for years to come.

### Pastebin for the win

The lack of maths rendering in major online communication platforms like instant messaging, email or Github has been a minor obsession of mine for quite a while, as I saw it as a big factor preventing people from talking more maths online. But today I realised this is totally a non-issue. Just do what people on IRC have been doing since the inception of the universe: use a (latex) pastebin.

Neural networks are one of the most beautiful programming paradigms ever invented. In the conventional approach to programming, we tell the computer what to do, breaking big problems up into many small, precisely defined tasks that the computer can easily perform. By contrast, in a neural network we don't tell the computer how to solve our problem. Instead, it learns from observational data, figuring out its own solution to the problem at hand.

Michael Nielsen - What this book (Neural Networks and Deep Learning) is about

Unrelated to the quote, note that Nielsen's book is licensed under CC BY-NC, so one can build on it and redistribute non-commercially.

But, users have learned to accommodate to Google not the other way around. We know what kinds of things we can type into Google and what we can’t and we keep our searches to things that Google is likely to help with. We know we are looking for texts and not answers to start a conversation with an entity that knows what we really need to talk about. People learn from conversation and Google can’t have one. It can pretend to have one using Siri but really those conversations tend to get tiresome when you are past asking about where to eat.

Roger Schank - Fraudulent claims made by IBM about Watson and AI

- Access to computers—and anything that might teach you something about the way the world works—should be unlimited and total. Always yield to the Hands-On Imperative!
- All information should be free.
- Mistrust Authority—Promote Decentralization.
- Hackers should be judged by their hacking, not bogus criteria such as degrees, age, race, or position.
- You can create art and beauty on a computer.
- Computers can change your life for the better.

The Hacker Ethic, Hackers: Heroes of Computer Revolution, by Steven Levy

"Static site generators seem like music databases, in that everyone eventually writes their own crappy one that just barely scratches the itch they had (and I'm no exception)."

So did I.