Conceived and designed the experiments: KJF. Performed the experiments: KJF. Analyzed the data: KJF. Contributed reagents/materials/analysis tools: KJF. Wrote the paper: KJF.
The author has declared that no competing interests exist.
This paper describes a general model that subsumes many parametric models for continuous data. The model comprises hidden layers of state-space or dynamic causal models, arranged so that the output of one provides input to another. The ensuing hierarchy furnishes a model for many types of data, of arbitrary complexity. Special cases range from the general linear model for static data to generalised convolution models, with system noise, for nonlinear time-series analysis. Crucially, all of these models can be inverted using exactly the same scheme, namely, dynamic expectation maximization. This means that a single model and optimisation scheme can be used to invert a wide range of models. We present the model and a brief review of its inversion to disclose the relationships among, apparently, diverse generative models of empirical data. We then show that this inversion can be formulated as a simple neural network and may provide a useful metaphor for inference and learning in the brain.
Models are essential to make sense of scientific data, but they may also play a central role in how we assimilate sensory information. In this paper, we introduce a general model that generates or predicts diverse sorts of data. As such, it subsumes many common models used in data analysis and statistical testing. We show that this model can be fitted to data using a single and generic procedure, which means we can place a large array of data analysis procedures within the same unifying framework. Critically, we then show that the brain has, in principle, the machinery to implement this scheme. This suggests that the brain has the capacity to analyse sensory input using the most sophisticated algorithms currently employed by scientists and possibly models that are even more elaborate. The implications of this work are that we can understand the structure and function of the brain as an inference machine. Furthermore, we can ascribe various aspects of brain anatomy and physiology to specific computational quantities, which may help understand both normal brain function and how aberrant inferences result from pathological processes associated with psychiatric disorders.
This paper describes hierarchical dynamic models (HDMs) and reviews a generic variational scheme for their inversion. We then show that the brain has evolved the necessary anatomical and physiological equipment to implement this inversion, given sensory data. These models are general in the sense that they subsume simpler variants, such as those used in independent component analysis, through to generalised nonlinear convolution models. The generality of HDMs renders the inversion scheme a useful framework that covers procedures ranging from variance component estimation, in classical linear observation models, to blind deconvolution, using exactly the same formalism and operational equations. Critically, the nature of the inversion lends itself to a relatively simple neural network implementation that shares many formal similarities with real cortical hierarchies in the brain.
Recently, we introduced a variational scheme for model inversion (i.e., inference on
models and their parameters given data) that considers hidden states in generalised
coordinates of motion. This enabled us to derive estimation procedures that go
beyond conventional approaches to time-series analysis, like Kalman or particle
filtering. We have described two versions; variational filtering
A key aspect of
This paper comprises five sections. In the first, we introduce hierarchical dynamic
models. These cover many observation or generative models encountered in the
estimation and inference literature. An important aspect of these models is their
formulation in generalised coordinates of motion; this lends them a hierarchal form
in both structure and dynamics. These hierarchies induce empirical priors that
provide structural and dynamic constraints, which can be exploited during inversion.
In the second and third sections, we consider model inversion in general terms and
then specifically, using dynamic expectation maximisation (
To simplify notation we will use
In this section, we cover hierarchal models for dynamic systems. We start with the basic model and how generalised motion furnishes empirical priors on the dynamics of the model's hidden states. We then consider hierarchical forms and see how these induce empirical priors in a structural sense. We will try to relate these perspectives to established treatments of empirical priors in static and state-space models.
Dynamic causal models are probabilistic generative models
A dynamic input-state-output model can be written as
At this point, readers familiar with standard state-space models may be
wondering where all the extra equations in Equation 2 come from and, in
particular, what the generalised motions;
Having said this, it is possible to convert the generalised state-space model
in Equation 2 into a standard form by expressing the components of
generalised motion in terms of a standard [uncorrelated]
Markovian process,
If there is a formal equivalence between standard and generalised state-space
models, why not use the standard formulation, with a suitably high-order
approximation? The answer is that we do not need to; by retaining an
explicit formulation in generalised coordinates we can devise a simple
inversion scheme (Equation 23) that outperforms standard Markovian
techniques like Kalman filtering. This simplicity is important because we
want to understand how the brain inverts dynamic models. This requires a
relatively simple neuronal implementation that could have emerged through
natural selection. From now on, we will reserve ‘state-space
models’ (SSM) for standard
Given the form of generalised state-space models we now consider what they
entail as probabilistic models of observed signals. We can write Equation 2
compactly as
Gaussian assumptions about the fluctuations
The nodes of these graphs correspond to quantities in the model and the responses they generate. The arrows or edges indicate conditional dependencies between these quantities. The form of the models is provided, both in terms of their state-space equations (above) and in terms of the prior and conditional probabilities (below). The hierarchal structure of these models induces empirical priors; dynamical priors are mediated by the equations of generalised motion and structural priors by the hierarchical form, under which states in higher levels provide constraints on the level below.
HDMs have the following form, which generalises the
(
The conditional independence of the fluctuations at different hierarchical
levels means that the HDM has a Markov property over levels, which
simplifies attending inference schemes. See
In generalised coordinates, the precision,
The precision in generalised coordinates (left) and over discrete
samples in time (right) are shown for a roughness of
When dealing with discrete time-series it is necessary to map the trajectory
implicit in the generalised motion of the response onto discrete samples,
[
We can now write down the exact form of the generative model. For dynamic
models, under Gaussian assumptions about the random terms, we have a simple
quadratic form (ignoring constants)
These constraints on the structural and dynamic form of the system are
specified by the functions
In this section, we have introduced hierarchical dynamic models in generalised coordinates of motion. These models are about as complicated as one could imagine; they comprise causes and hidden states, whose dynamics can be coupled with arbitrary (analytic) nonlinear functions. Furthermore, these states can have random fluctuations with unknown amplitude and arbitrary (analytic) autocorrelation functions. A key aspect of the model is its hierarchical form, which induces empirical priors on the causes. These recapitulate the constraints on hidden states, furnished by the hierarchy implicit in generalised motion. We now consider how these models are inverted.
This section considers variational inversion of models under mean-field and
Laplace approximations, with a special focus on HDMs. This treatment provides a
heuristic summary of the material in
The objective is to optimise
Invoking an arbitrary density,
In this dynamic setting
These equations provide closed-form expressions for the conditional or variational density in terms of the internal energy defined by our model; Equation 10. They are intuitively sensible, because the conditional density of the states should reflect the instantaneous energy; Equation 17. Whereas the conditional density of the parameters can only be determined after all the data have been observed; Equation 18. In other words, the variational energy involves the prior energy and an integral of time-dependent energy. In the absence of data, when the integrals are zero, the conditional density reduces to the prior density.
If the analytic forms of Equations 17 and 18 were tractable (e.g., through the
use of conjugate priors),
Under the Laplace approximation, the marginals of the conditional density
assume a Gaussian form
The advantage of the Laplace assumption is that the conditional covariance is
a simple function of the modes. Under the Laplace assumption, the internal
and variational actions are (ignoring constants)
By differentiating Equation 19 with respect to the covariances and solving
for zero, it is easy to show that the conditional precisions are the
negative curvatures of the internal action
The Laplace approximation gives a compact and simple form for the conditional
precisions; and reduces the problem of inversion to finding the conditional
modes. This generally proceeds in a series of iterated steps, in which the
mode of each parameter set is updated. These updates optimise the
variational actions in Equation 19 with respect to
As with conventional variational schemes, we can update the modes of our three
parameter sets in three distinct steps. However, the step dealing with the state
(
In static systems, the mode of the conditional density maximises variational
energy, such that
∂
Another way of looking at this is to consider the problem of finding the path
of the conditional mode. However, the mode is in generalised coordinates and
already encodes its path. This means we have to optimise the path of the
mode subject to the constraint that
Equation 23 prescribes the trajectory of the conditional mode, which can be
realised with a local linearization
Exactly the same update procedure can be used for the
These steps represent a full variational scheme. A simplified version, which
discounts uncertainty about the parameters and states in the
These updates furnish a variational scheme under the Laplace approximation.
To further simplify things, we will assume
Δ
In this section, we have seen how the inversion of dynamic models can be formulated as an optimization of action. This action is the anti-derivative or path-integral of free-energy associated with changing states and a constant (of integration) corresponding to the prior energy of time-invariant parameters. By assuming a fixed-form (Laplace) approximation to the conditional density, one can reduce optimisation to finding the conditional modes of unknown quantities, because their conditional covariance is simply the curvature of the internal action (evaluated at the mode). The conditional modes of (mean-field) marginals optimise variational action, which can be framed in terms of gradient ascent. For the states, this entails finding a path or trajectory with stationary variational action. This can be formulated as a gradient ascent in a frame of reference that moves along the path encoded in generalised coordinates.
In this section, we review the model and inversion scheme of the previous section in
light of established procedures for supervised and self-supervised learning. This
section considers HDMs from the pragmatic point of view of statistics and machine
learning, where the data are empirical and arrive as discrete data sequences. In the
next section, we revisit these models and their inversion from the point of view of
the brain, where the data are sensory and continuous. This section aims to establish
the generality of HDMs by showing that many well-known approaches to data can be
cast as an inverting a HDM under simplifying assumptions. It recapitulates the
unifying perspective of Roweis and Ghahramani
All the schemes described in this paper are available in Matlab code as academic
freeware (
In these models the causes are known and enter as priors
Usually, supervised learning entails learning the parameters of static
nonlinear generative models with known causes. This corresponds to a HDM
with infinitely precise priors at the last level, any number of subordinate
levels (with no hidden states)
In nonlinear optimisation, we want to identify the parameters of a static,
nonlinear function that maps known causes to responses. This is a trivial
case of the static model above that obtains when the hierarchical order
reduces to
Consider the linear model, with a response that has been elicited using known
causes,
If we have flat priors on the parameters, Π
It is interesting to note that transposing the general linear model is
equivalent to the switching the roles of the causes and parameters;
In the identification of nonlinear dynamic systems, one tries to characterise
the architecture that transforms known inputs into measured outputs. This
transformation is generally modelled as a generalised convolution
In these models, the parameters are known and enter as priors
In static systems, the problem reduces to estimating the causes of inputs
after they are passed through some linear or nonlinear mapping to generate
observed responses. For simple nonlinear estimation, in the absence of prior
expectations about the causes, we have the nonlinear hierarchal model
When the model above is linear, we have the ubiquitous hierarchical linear
observation model used in Parametric Empirical Bayes (
The inversion was cross-validated with expectation maximization (EM), where the M-step corresponds to restricted maximum likelihood (ReML). This example used a simple two-level model that embodies empirical shrinkage priors on the first-level parameters. These models are also known as parametric empirical Bayes (PEB) models (left). Causes were sampled from the unit normal density to generate a response, which was used to recover the causes, given the parameters. Slight differences in the hyperparameter estimates (upper right), due to a different hyperparameterisation, have little effect on the conditional means of the unknown causes (lower right), which are almost indistinguishable.
When there are many more causes then observations, a common device is to
eliminate the causes in Equation 40 by recursive substitution to give a
model that generates sample covariances and is formulated in terms of
covariance components (i.e., hyperparameters).
The model in Equation 41 is also referred to as a Gaussian process model
In deconvolution problems, the objective is to estimate the inputs to a
dynamic system given its response and parameters.
State-space models have the following form in discrete time and rest on a
vector autoregressive (
Deconvolution under HDMs is related to Bayesian approaches to inference on
states using Bayesian belief update procedures (i.e., incremental or
recursive Bayesian filters). The conventional approach to online Bayesian
tracking of nonlinear or non-Gaussian systems employs extended Kalman
filtering
In terms of establishing the generality of the HDM, it is sufficient to note
that Bayesian filters simply estimate the conditional density on the hidden
states of a HDM. As intimated in the
In all the examples below, both the parameters and states are unknown. This
entails a dual or triple estimation problem, depending on whether the
hyperparameters are known. We will start with simple static models and work
towards more complicated dynamic variants. See
The Principal Components Analysis (
The model for factor analysis is exactly the same as for
Parameters and causes were sampled from the unit normal density to
generate a response, which was then used for their estimation. The
aim was to recover the causes without knowing the parameters, which
is effected with reasonable accuracy (upper). The conditional
estimates of the causes and parameters are shown in lower panels,
along with the increase in free-energy or log-evidence, with the
number of DEM iterations (lower left). Note that there is an
arbitrary affine mapping between the conditional means of the causes
and their true values, which we estimated,
Independent component analysis (
In the same way that factor analysis is a generalisation of
Blind deconvolution tries to estimate the causes of an observed response
without knowing the parameters of the dynamical system producing it. This
represents the least constrained problem we consider and calls upon the same
HDM used for system identification. An empirical example of triple
estimation of states, parameters and hyperparameters can be found in
Level | Π |
Π |
|
Π |
|
Π |
|||
|
exp( |
exp( |
0 |
|
0 |
|
|||
1 | 0 |
In this model, causes or inputs perturb the hidden states, which decay
exponentially to produce an output that is a linear mixture of hidden
states. Our example used a single input, two hidden states and four outputs.
To generate data, we used a deterministic Gaussian bump function input
In this model, a simple Gaussian ‘bump’ function
acts as a cause to perturb two coupled hidden states. Their dynamics
are then projected to four response variables, whose time-courses
are cartooned on the left. This figure also summarises the
architecture of the implicit inversion scheme (right), in which
precision-weighted prediction errors drive the conditional modes to
optimise variational action. Critically, the prediction errors
propagate their effects up the hierarchy (c.f., Bayesian belief
propagation or message passing), whereas the predictions are passed
down the hierarchy. This sort of scheme can be implemented easily in
neural networks (see last section and
Each row corresponds to a level, with causes on the left and hidden states on the right. In this case, the model has just two levels. The first (upper left) panel shows the predicted response and the error on this response (their sum corresponds to the observed data). For the hidden states (upper right) and causes (lower left) the conditional mode is depicted by a coloured line and the 90% conditional confidence intervals by the grey area. These are sometimes referred to as “tubes”. Finally, the grey lines depict the true values used to generate the response. Here, we estimated the hyperparameters, parameters and the states. This is an example of triple estimation, where we are trying to infer the states of the system as well as the parameters governing its causal architecture. The hyperparameters correspond to the precision of random fluctuations in the response and the hidden states. The free parameters correspond to a single parameter from the state equation and one from the observer equation that govern the dynamics of the hidden states and response, respectively. It can be seen that the true value of the causal state lies within the 90% confidence interval and that we could infer with substantial confidence that the cause was non-zero, when it occurs. Similarly, the true parameter values lie within fairly tight confidence intervals (red bars in the lower right).
This section has tried to show that the HDM encompasses many standard static
and dynamic observation models. It is further evident than many of these
models could be extended easily within the hierarchical framework.
This ontology is one of many that could be constructed and is based on the fact that hierarchical dynamic models have several attributes that can be combined to create an infinite number of models; some of which are shown in the figure. These attributes include; (i) the number of levels or depth; (ii) for each level, linear or nonlinear output functions; (iii) with or without random fluctuations; (iii) static or dynamic (iv), for dynamic levels, linear or nonlinear equations of motion; (v) with or without state noise and, finally, (vi) with or without generalised coordinates.
In summary, we have seen that endowing dynamical models with a hierarchical
architecture provides a general framework that covers many models used for
estimation, identification and unsupervised learning. A hierarchical
structure, in conjunction with nonlinearities, can emulate non-Gaussian
behaviours, even when random effects are Gaussian. In a dynamic context, the
level at which the random effects enter controls whether the system is
deterministic or stochastic and nonlinearities determine whether their
effects are additive or multiplicative.
In this final section, we revisit
A key architectural principle of the brain is its hierarchical organisation
The hierarchical structure of the brain speaks to hierarchical models of sensory input. We now consider how this functional architecture can be understood under the inversion of HDMs by the brain. We first consider inference on states or perception.
If we assume that the activity of neurons encode the conditional mode of
states, then the
If we unpack these equations we can see the hierarchical nature of this
message passing (see
This schematic shows the speculative cells of origin of forward driving connections that convey prediction error from a lower area to a higher area and the backward connections that are used to construct predictions. These predictions try to explain away input from lower areas by suppressing prediction error. In this scheme, the sources of forward connections are the superficial pyramidal cell population and the sources of backward connections are the deep pyramidal cell population. The differential equations relate to the optimisation scheme detailed in the main text and their constituent terms are placed alongside the corresponding connections. The state-units and their efferents are in black and the error-units in red, with causes on the left and hidden states on the right. For simplicity, we have assumed the output of each level is a function of, and only of, the hidden states. This induces a hierarchy over levels and, within each level, a hierarchical relationship between states, where hidden states predict causes.
The connections from error to state-units have a simple form that depends on
the gradients of the model's functions; from Equation 12
We can identify error-units with superficial pyramidal cells, because the
only messages that pass up the hierarchy are prediction errors and
superficial pyramidal cells originate forward connections in the brain. This
is useful because it is these cells that are primarily responsible for
electroencephalographic (EEG) signals that can be measured non-invasively.
Similarly the only messages that are passed down the hierarchy are the
predictions from state-units that are necessary to form prediction errors in
lower levels. The sources of extrinsic backward connections are largely the
deep pyramidal cells and one might deduce that these encode the expected
causes of sensory states (see
This schematic shows how the neuronal populations of the previous figure may be deployed hierarchically within three cortical areas (or macro-columns). Within each area the cells are shown in relation to the laminar structure of the cortex that includes supra-granular (SG) granular (L4) and infra-granular (IG) layers.
Equation 51 is cast in terms of generalised states. This suggests that the
brain has an explicit representation of generalised motion. In other words,
there are separable neuronal codes for different orders of motion. This is
perfectly consistent with empirical evidence for distinct populations of
neurons encoding elemental visual features and their motion (e.g.,
motion-sensitive area V5;
When dealing with empirical data-sequences one has to contend with sparse and
discrete sampling. Analogue systems, like the brain can sample generalised
motion directly. When sampling sensory data, one can imagine easily how
receptors generate
The conditional expectations of the parameters,
If synaptic efficacy encodes the parameter estimates, we can cast parameter
optimisation as changing synaptic connections. These changes have a
relatively simple form that is recognisable as associative plasticity. To
show this, we will make the simplifying but plausible assumption that the
brain's generative model is based on nonlinear functions
Equation 51 shows that the influence of prediction error is scaled by its
precision
Equation 51 formulates this bias or gain-control in terms of lateral
connections,
As above, changes in
The mean-field approximation
We have seen that the brain has, in principle, the infrastructure needed to
invert hierarchical dynamic models of the sort considered in previous
sections. It is perhaps remarkable that such a comprehensive treatment of
generative models can be reduced to recognition dynamics that are as simple
as Equation 51. Having said this, the notion that the brain inverts
hierarchical models, using a
The hierarchical organisation of cortical areas (c.f.,
Each area comprises distinct neuronal subpopulations, encoding
expected states of the world and prediction error (c.f.,
Extrinsic forward connections convey prediction error (from
superficial pyramidal cells) and backward connections mediate
predictions, based on hidden and causal states (from deep pyramidal
cells)
Recurrent dynamics are intrinsically stable because they are trying
to suppress prediction error
Functional asymmetries in forwards (linear) and backwards (nonlinear)
connections may reflect their distinct roles in recognition (c.f.,
Principal cells elaborating predictions (e.g., deep pyramidal cells) may show distinct (low-pass) dynamics, relative to those encoding error (e.g., superficial pyramidal cells)
Lateral interactions may encode the relative precision of prediction
errors and change in a way that is consistent with classical
neuromodulation (c.f.,
The rescaling of prediction errors by recurrent connections, in
proportion to their precision, affords a form of cortical bias or
gain control
The dynamics of plasticity and modulation of lateral interactions
encoding precision or uncertainty (which optimise a path-integral of
variational energy) must be slower than the dynamics of neuronal
activity (which optimise variational energy
Neuronal activity, synaptic efficacy and neuromodulation must all affect each other; activity-dependent plasticity and neuromodulation shape neuronal responses and:
Neuromodulatory factors play a dual role in modulating postsynaptic
responsiveness (e.g., through modulating in after-hyperpolarising
currents) and synaptic plasticity
These observations pertain to the anatomy and physiology of neuronal
architectures; see Friston et al.
We have tried to establish the generality of HDMs as a model that may be used
by the brain. However, there are many alternative formulations that could be
considered. Perhaps the work of Archambeau et al.
Clearly, the theoretical treatment of this section calls for an enormous
amount of empirical verification and hypothesis testing, not least to
disambiguate among alternative theories and architectures. We have laid out
the neurobiological and psychophysical motivation for the neuronal
implementation of
We are now in a position to revisit some of the basic choices behind the
The second choice was to use a fixed-form for
The third choice was a mean-field approximation;
The fourth assumption was that the fixed-form of
The final choice was to include generalised motion under
It is interesting to consider
In summary, any generic inversion scheme needs to induce a lower-bound on the
log-evidence by invoking an approximating conditional density
In conclusion, we have seen how the inversion of a fairly generic hierarchical and dynamical model of sensory inputs can be transcribed onto neuronal quantities that optimise a variational bound on the evidence for that model This optimisation corresponds, under some simplifying assumptions, to suppression of prediction error at all levels in a cortical hierarchy. This suppression rests upon a balance between bottom-up (prediction error) influences and top-down (empirical prior) influences that are balanced by representations of their precision (uncertainty). These representations may be mediated by classical neuromodulatory effects and slow postsynaptic cellular processes that are driven by overall levels of prediction error.
The ideas presented in this paper have a long history, starting with the notion
of neuronal energy
I would like to thank my colleagues for invaluable help in formulating thee ideas, Pedro Valdes-Sosa in for guidance on the relationship between standard and generalised state-space models and the three reviewers for helpful advice and challenging comments.