A Network of Cells Coordinated by Energy Functions

30 Oct 2023

This is a description of a, so far hypothetical, architecture for AI. It builds upon the idea of Energy Based models (See the introduction to this blog post for what that is).

The two basic building blocks of this architecture are “cells” and “agreement functions”. A cell is simply a thing with a value, it’s “state”, which is in some high-dimensional vector space. The state of a cell can be dictated by the output of some external modality, say a ConvNet embedding of a video input, or be “abstract”, i.e. the state only changes due to the internal mechanics of the network. Two (or principally any number of) cells are “coordinated” by an agreement function.

An agreement function is simply an energy function, i.e. a function that takes as input the states of the cells it coordinates and outputs a single scalar value. A high output value of the agreement function means that the states of the cells it coordinates are in disagreement and so they should change in accordance with the gradient of the agreement function.

The operation of the network is continuous and dynamic. Both execution and learning happens at the same time and due to the same principle, gradient descent on the agreement function. Say we have two cells, coordinated by an agreement function. We backpropagate from the output of the agreement function through the weights and to the states of the two cells. Gradients are computed with respect to both the weights and the states. Based upon these gradients, both the weights and the states are updated (but possibly with different rates, more on that later).

The idea here is that the energy landscape of the agreement function over the spaces of the cells it coordinates is shaped over time such that the states of cells are already pushed in the right direction, i.e. the network is predictive. Say we have a cell which has as it’s state the output of some embedding function of a continous signal, like a video input. If the agreement function that governs it is poorly trained, then it does not predict the external dynamics, resulting in the the external signal pushing the cell’s state to possibly high energy regions in the agreement function’s landscape, i.e. the network is surprised. However, if the agreement function is well trained, then it’s gradient will already push the state of the cell in the direction of the next frame’s embedding, minimizing surprisal.

To give a perhaps motivating example for this overall idea of agreement functions: Say we have two modalities, be it vision and audition. Visually, we see a black box. Auditorily, we hear frog croaking. We open the box and see a frog. Are we surprised? No. Why though? Have we ever seen a frog in a black box before? I definitely haven’t. But our visual expectation was in agreement with the auditory information we are given. So our visual expectation is dictated by what we hear (and also the other way around of course). But the thing is, there are many situations when the information we get visually and the information we get auditorily are simply not related. Say when the source of what I hear is behind me. What matters is not that the model of the world that we derive from what we hear is always in agreement with the model of the world that we derive from what we see, rather, it matters that they do not disagree. The same goes for abstract information.