MINI REVIEW article

Solving the credit assignment problem with the prefrontal cortex.

\r\nAlexandra Stolyarova*

  • Department of Psychology, University of California, Los Angeles, Los Angeles, CA, United States

In naturalistic multi-cue and multi-step learning tasks, where outcomes of behavior are delayed in time, discovering which choices are responsible for rewards can present a challenge, known as the credit assignment problem . In this review, I summarize recent work that highlighted a critical role for the prefrontal cortex (PFC) in assigning credit where it is due in tasks where only a few of the multitude of cues or choices are relevant to the final outcome of behavior. Collectively, these investigations have provided compelling support for specialized roles of the orbitofrontal (OFC), anterior cingulate (ACC), and dorsolateral prefrontal (dlPFC) cortices in contingent learning. However, recent work has similarly revealed shared contributions and emphasized rich and heterogeneous response properties of neurons in these brain regions. Such functional overlap is not surprising given the complexity of reciprocal projections spanning the PFC. In the concluding section, I overview the evidence suggesting that the OFC, ACC and dlPFC communicate extensively, sharing the information about presented options, executed decisions and received rewards, which enables them to assign credit for outcomes to choices on which they are contingent. This account suggests that lesion or inactivation/inhibition experiments targeting a localized PFC subregion will be insufficient to gain a fine-grained understanding of credit assignment during learning and instead poses refined questions for future research, shifting the focus from focal manipulations to experimental techniques targeting cortico-cortical projections.

Introduction

When an animal is introduced to an unfamiliar environment, it will explore the surroundings randomly until an unexpected reward is encountered. Reinforced by this experience, the animal will gradually learn to repeat those actions that produced the desired outcome. The work conducted in the past several decades has contributed a detailed understanding of the psychological and neural mechanisms that support such reinforcement-driven learning ( Schultz and Dickinson, 2000 ; Schultz, 2004 ; Niv, 2009 ). It is now broadly accepted that dopamine (DA) signaling conveys prediction errors, or the degree of surprise brought about by unexpected rewards, and interacts with cortical and basal ganglia circuits to selectively reinforce the advantageous choices ( Schultz, 1998a , b ; Schultz and Dickinson, 2000 ; Niv, 2009 ). Yet, in naturalistic settings, where rewards are delayed in time, and where multiple cues are encountered, or where several decisions are made before the outcomes of behavior are revealed, discovering which choices are responsible for rewards can present a challenge, known as the credit assignment problem ( Mackintosh, 1975 ; Rothkopf and Ballard, 2010 ).

In most everyday situations, the rewards are not immediate consequences of behavior, but instead appear after substantial delays. To influence future choices, the teaching signal conveyed by DA release needs to reinforce synaptic events occurring on a millisecond timescale, frequently seconds before the outcomes of decisions are revealed ( Izhikevich, 2007 ; Fisher et al., 2017 ). This apparent difficulty in linking preceding behaviors caused by transient neuronal activity to a delayed feedback has been termed the distal reward or temporal credit assignment problem ( Hull, 1943 ; Barto et al., 1983 ; Sutton and Barto, 1998 ; Dayan and Abbott, 2001 ; Wörgötter and Porr, 2005 ). Credit for the reward delayed by several seconds can frequently be assigned by establishing an eligibility trace, a molecular memory of the recent neuronal activity, allowing modification of synaptic connections that participated in the behavior ( Pan et al., 2005 ; Fisher et al., 2017 ). On longer timescales, or when multiple actions need to be performed sequentially to reach a final goal, intermediate steps themselves can acquire motivational significance and subsequently reinforce preceding decisions, such as in temporal-difference (TD) learning models ( Sutton and Barto, 1998 ).

Several excellent reviews have summarized the accumulated knowledge on mechanisms that link choices and their outcomes through time, highlighting the advantages of eligibility traces and TD models ( Wörgötter and Porr, 2005 ; Barto, 2007 ; Niv, 2009 ; Walsh and Anderson, 2014 ). Yet these solutions to the distal reward problem can impede learning in multi-choice tasks, or when an animal is presented with many irrelevant stimuli prior to or during the delay. Here, I only briefly overview the work on the distal reward problem to highlight potential complications that can arise in credit assignment based on eligibility traces when learning in multi-cue environments. Instead, I focus on the structural (or spatial ) credit assignment problem, requiring animals to select and learn about the most meaningful features in the environment and ignore irrelevant distractors. Collectively, the reviewed evidence highlights a critical role for the prefrontal cortex (PFC) in such contingent learning.

Recent studies have provided compelling support for specialized functions of the orbitofrontal (OFC) and dorsolateral prefrontal (dlPFC) cortices in credit assignment in multi-cue tasks, with fewer experiments targeting the anterior cingulate cortex (ACC). For example, it has seen suggested that the dlPFC aids reinforcement-driven learning by directing attention to task-relevant cues ( Niv et al., 2015 ), the OFC assigns credit for rewards based on the causal relationship between trial outcomes and choices ( Jocham et al., 2016 ; Noonan et al., 2017 ), whereas the ACC contributes to unlearning of action-outcome associations when the rewards are available for free ( Jackson et al., 2016 ). However, this work has similarly revealed shared contributions and emphasized rich and heterogeneous response properties of neurons in the PFC, with different subregions monitoring and integrating the information about the task (i.e., current context, available options, anticipated rewards, as well as delay and effort costs) at variable times within a trial (upon stimulus presentation, action selection, outcome anticipation, and feedback monitoring; ex., Hunt et al., 2015 ; Khamassi et al., 2015 ). In the concluding section, I overview the evidence suggesting that contingent learning in multi-cue environments relies on dynamic cortico-cortical interactions during decision making and outcome valuation.

Solving the Temporal Credit Assignment Problem

When outcomes follow choices after short delays (Figure 1A ), the credit for distal rewards can frequently be assigned by establishing an eligibility trace, a sustained memory of the recent activity that renders synaptic connections malleable to modification over several seconds. Eligibility traces can persist as elevated levels of calcium in dendritic spines of post-synaptic neurons ( Kötter and Wickens, 1995 ) or as a sustained neuronal activity throughout the delay period ( Curtis and Lee, 2010 ) to allow for synaptic changes in response to reward signals. Furthermore, spike-timing dependent plasticity can be influenced by neuromodulator input ( Izhikevich, 2007 ; Abraham, 2008 ; Fisher et al., 2017 ). For example, the magnitude of short-term plasticity can be modulated by DA, acetylcholine and noradrenaline, which may even revert the sign of the synaptic change ( Matsuda et al., 2006 ; Izhikevich, 2007 ; Seol et al., 2007 ; Abraham, 2008 ; Zhang et al., 2009 ). Sustained neural activity has been observed in the PFC and striatum ( Jog et al., 1999 ; Pasupathy and Miller, 2005 ; Histed et al., 2009 ; Kim et al., 2009 , 2013 ; Seo et al., 2012 ; Her et al., 2016 ), as well as the sensory cortices after experience with consistent pairings between the stimuli and outcomes separated by predictable delays ( Shuler and Bear, 2006 ).

www.frontiersin.org

Figure 1 . Example tasks highlighting the challenge of credit assignment and learning strategies enabling animals to solve this problem. (A) An example of a distal reward task that can be successfully learned with eligibility traces and TD rules, where intermediate choices can acquire motivational significance and subsequently reinforce preceding decisions (ex., Pasupathy and Miller, 2005 ; Histed et al., 2009 ). (B) In this version of the task, multiple cues are present at the time of choice, only one of which is meaningful for obtaining rewards. After a brief presentation, the stimuli disappear, requiring an animal to solve a complex structural and temporal credit assignment problem (ex., Noonan et al., 2010 , 2017 ; Niv et al., 2015 ; Asaad et al., 2017 ; while the schematic of the task captures the challenge of credit assignment, note that in some experimental variants of the behavioral paradigm stimuli disappeared before an animal revealed its choice, whereas in others the cues remained on the screen until the trial outcome was revealed). Under such conditions, learning based on eligibility traces is suboptimal, as non-specific reward signals can reinforce visual cues that did not meaningfully contribute, but occurred close, to beneficial outcomes of behavior. (C) On reward tasks, similar to the one shown in (B) , the impact of previous decisions and associated rewards on current behavior can be assessed by performing regression analyses ( Jocham et al., 2016 ; Noonan et al., 2017 ). Here, the color of each cell in a matrix represents the magnitude of the effect of short-term choice and outcome histories, up to 4 trials into the past (red-strong influence; blue-weak influence on the current decision). Top: an animal learning based on the causal relationship between outcomes and choices (i.e., contingent learning). Middle: each choice is reinforced by a combined history of rewards (i.e., decisions are repeated if beneficial outcomes occur frequently). Bottom: the influence of recent rewards spreads to unrelated choices.

On extended timescales, when multiple actions need to be performed sequentially to reach a final goal, the distal reward problem can be solved by assigning motivational significance to intermediate choices that can subsequently reinforce preceding decisions, such as in TD learning models ( Montague et al., 1996 ; Sutton and Barto, 1998 ; Barto, 2007 ). Assigning values to these intervening steps according to expected future rewards allows to break complex temporal credit assignment problems into smaller and easier tasks. There is ample evidence for TD learning in humans and other animals that on the neural level is supported by transfer of DA responses from the time of reward delivery to preceding cues and actions ( Montague et al., 1996 ; Schultz, 1998a , b ; Walsh and Anderson, 2014 ).

Both TD learning and eligibility traces offer elegant solutions to the distal reward problem, and models based on cooperation between these two mechanisms can predict animal behavior as well as neuronal responses to rewards and predictive stimuli ( Pan et al., 2005 ; Bogacz et al., 2007 ). Yet assigning credit based on eligibility traces can be suboptimal when an animal interacts with many irrelevant stimuli prior to or during the delay (Figure 1B ). Under such conditions sensory areas remain responsive to distracting stimuli and the arrival of non-specific reward signals can reinforce intervening cues that did not meaningfully contribute, but occurred close, to the outcome of behavior ( FitzGerald et al., 2013 ; Xu, 2017 ).

The Role of the PFC in Structural Credit Assignment

Several recent studies have investigated the neural mechanisms of appropriate credit assignment in challenging tasks where only a few of the multitude of cues predict rewards reliably. Collectively, this work has provided compelling support for causal contributions of the PFC to structural credit assignment. For example, Asaad et al. (2017) examined the activity of neurons in monkey dlPFC while subjects were performing a delayed learning task. The arrangement of the stimuli varied randomly between trials and within each block either the spatial location or stimulus identity was relevant for solving the task. The monkeys' goal was to learn by trial-and-error to select one of the four options that led to rewards according to current rules. When stimulus identity was relevant for solving the task, neural activity in the dlPFC at the time of feedback reflected both the relevant cue (regardless of its spatial location) and the trial outcome, thus integrating the information necessary for credit assignment. Such responses were strategy-selective: these neurons did not encode cue identity at the time of feedback when it was not necessary for learning in the spatial location task, in which making a saccade to the same position on the screen was reinforced within a block of trials. Previous research has similarly indicated that neurons in the dlPFC respond selectively to behaviorally-relevant and attended stimuli ( Lebedev et al., 2004 ; Markowitz et al., 2015 ) and integrate information about prediction errors, choice values as well as outcome uncertainty prior to trial feedback ( Khamassi et al., 2015 ).

The activity within the dlPFC has been linked to structural credit assignment through selective attention and representational learning ( Niv et al., 2015 ). Under conditions of reward uncertainty and unknown relevant task features, human participants opt for computational efficiency and engage in a serial-hypothesis-testing strategy ( Wilson and Niv, 2011 ), selecting one cue and its anticipated outcome as the main focus of their behavior, and updating the expectations associated exclusively with that choice upon feedback receipt ( Akaishi et al., 2016 ). Niv and colleagues tested participant on a three-armed bandit task, where relevant stimulus dimensions (i.e., shape, color or texture) predicting outcome probabilities changed between block of trials ( Niv et al., 2015 ). In such multidimensional environment, reinforcement-driven learning was aided by attentional control mechanisms that engaged the dlPFC, intraparietal cortex, and precuneus.

In many tasks, the credit for outcomes can be assigned according to different rules: based on the causal relationship between rewards and choices (i.e., contingent learning), their temporal proximity (i.e., when the reward is received shortly after a response), or their statistical relationship (when an action has been executed frequently before beneficial outcomes; Jocham et al., 2016 ; Figure 1C ). The analyses presented in papers discussed above did not allow for the dissociation between these alternative strategies of credit assignment. By testing human participants on a task with continuous stimulus presentation, instead of a typical trial-by-trial structure, Jocham et al. (2016) demonstrated that the tendency to repeat choices that were immediately followed by rewards and causal learning operate in parallel. In this experiment, activity within another subregion of the PFC, the OFC, was associated with contingent learning. Complementary work in monkeys revealed that the OFC contributes causally to credit assignment ( Noonan et al., 2010 ): animals with OFC lesions were unable to associate a reward with the choice on which it was contingent and instead relied on temporal and statistical learning rules. In another recent paper, Noonan and colleagues (2017) extended these observations to humans, demonstrating causal contributions of the OFC to credit assignment across species. The participants were tested on a three-choice probabilistic learning task. The three options were presented simultaneously and maintained on the screen until the outcome of a decision was revealed, thus requiring participants to ignore irrelevant distractors. Notably, only patients with lateral OFC lesions displayed any difficulty in learning the task, whereas damage to the medial OFC or dorsomedial PFC preserved contingent learning mechanisms. However, it is presently unknown whether lesions to the dlPFC or ACC affect such causal learning.

In another test of credit assignment in learning, contingency degradation, the subjects are required to track causal relationships between the stimuli or actions and rewards. During contingency degradation sessions, the animals are still reinforced for responses, but rewards are also available for free. After experiencing non-contingent rewards, control subjects reliably decrease their choices of the stimuli. However, lesions to both the ACC and OFC inhibit contingency degradation ( Jackson et al., 2016 ). Taken together, these observations demonstrate causal contributions of the PFC to appropriate credit assignment in multi-cue environments.

Cooperation Between PFC Subregions Supports Contingent Learning in Multi-Cue Tasks

Despite the segregation of temporal and structural aspects of credit assignment in earlier sections of this review, in naturalistic settings the brains frequently need to tackle both problems simultaneously. Here, I overview the evidence favoring a network perspective, suggesting that dynamic cortico-cortical interactions during decision making and outcome valuation enable adaptive solutions to complex spatio-temporal credit assignment problems. It has been previously suggested that feedback projections from cortical areas occupying higher levels of processing hierarchy, including the PFC, can aid in attribution of outcomes to individual decisions by implementing attention-gated reinforcement learning ( Roelfsema and van Ooyen, 2005 ). Similarly, recent theoretical work has shown that even complex multi-cue and multi-step problems can be solved by an extended cascade model of synaptic memory traces, in which the plasticity is modulated not only by the activity within a population of neurons, but also by feedback about executed decisions and resulting rewards ( Urbanczik and Senn, 2009 ; Friedrich et al., 2010 , 2011 ). Contingent learning, according to these models, can be supported by the communication between neurons encoding available options, committed choices and outcomes of behavior during decision making and feedback monitoring. For example, at the time of outcome valuation, information about recent choices can be maintained as a memory trace in the neuronal population involved in action selection or conveyed by an efference copy from an interconnected brain region ( Curtis and Lee, 2010 ; Khamassi et al., 2011 , 2015 ). Similarly, reinforcement feedback is likely communicated as a global reward signal (ex., DA release) as well as projections from neural populations engaged in performance monitoring, such as those within the ACC ( Friedrich et al., 2010 ; Khamassi et al., 2011 ). The complexity of reciprocal and recurrent projections spanning the PFC ( Barbas and Pandya, 1989 ; Felleman and Van Essen, 1991 ; Elston, 2000 ) may enable this network to implement such learning rules, integrating the information about the task, executed decisions and performance feedback.

In many everyday decisions, the options are compared across multiple features simultaneously (ex., by considering current context, needs, available reward types, as well as delay and effort costs). Neurons in different subregions of the PFC exhibit rich response properties, signaling these features of the task at various time epochs within a trial. For example, reward selectivity in response to predictive stimuli emerges earlier in the OFC and may then be passed to the dlPFC that encodes both the expected outcome and the upcoming choice ( Wallis and Miller, 2003 ). Similarly, on trials where options are compared based on delays to rewards, choices are dependent on interactions between the OFC and dlPFC ( Hunt et al., 2015 ). Conversely, when effort costs are more meaningful for decisions, it is the ACC that influences choice-related activity in the dlPFC ( Hunt et al., 2015 ). The OFC is required not only for the evaluation of stimuli, but also more complex abstract rules, based on rewards they predict ( Buckley et al., 2009 ). While both the OFC and dlPFC encode abstract strategies (ex., persisting with recent choices or shifting to a new response), such signals appear earlier in the OFC and may be subsequently conveyed to the dlPFC where they are combined with upcoming response (i.e., left vs. right saccade) encoding ( Tsujimoto et al., 2011 ). Therefore, the OFC may be the first PFC subregion to encode task rules and/or potential rewards predicted by sensory cues; via cortico-cortical projections, this information may be subsequently communicated to the dlPFC or ACC ( Kennerley et al., 2009 ; Hayden and Platt, 2010 ) to drive strategy-sensitive response planning.

The behavioral strategy that the animal follows is influenced by recent reward history ( Cohen et al., 2007 ; Pearson et al., 2009 ). If its choices are reinforced frequently, the animal will make similar decisions in the future (i.e., exploit its current knowledge). Conversely, unexpected omission of expected rewards can signal a need for novel behaviors (i.e., exploration). Neurons in the dlPFC carry representations of planned as well as previous choices, anticipate outcomes, and jointly encode the current decisions and their consequences following feedback ( Seo and Lee, 2007 ; Seo et al., 2007 ; Tsujimoto et al., 2009 ; Asaad et al., 2017 ). Similarly, the ACC tracks trial-by-trial outcomes of decisions ( Procyk et al., 2000 ; Shidara and Richmond, 2002 ; Amiez et al., 2006 ; Quilodran et al., 2008 ) as well as reward and choice history ( Seo and Lee, 2007 ; Kennerley et al., 2009 , 2011 ; Sul et al., 2010 ; Kawai et al., 2015 ) and signals errors in outcome prediction ( Kennerley et al., 2009 , 2011 ; Hayden et al., 2011 ; Monosov, 2017 ). At the time of feedback, neurons in the OFC encode committed choices, their values and contingent rewards ( Tsujimoto et al., 2009 ; Sul et al., 2010 ). Notably, while the OFC encodes the identity of expected outcomes and the value of the chosen option after the alternatives are presented to an animal, it does not appear to encode upcoming decisions ( Tremblay and Schultz, 1999 ; Wallis and Miller, 2003 ; Padoa-Schioppa and Assad, 2006 ; Sul et al., 2010 ; McDannald et al., 2014 ), therefore it might be that feedback projections from the dlPFC or ACC are required for such activity to emerge at the time of reward feedback.

To capture the interactions between PFC subregions in reinforcement-driven learning, Khamassi and colleagues have formulated a computation model in which action values are stored and updated in the ACC and then communicated to the dlPFC that decides which action to trigger ( Khamassi et al., 2011 , 2013 ). This model relies on meta-learning principles ( Doya, 2002 ), flexibly adjusting the exploration-exploitation parameter based on performance history and variability in the environment that are monitored by the ACC. The explore-exploit parameter then influences action-selection mechanisms in the dlPFC, prioritizing choice repetition once the rewarded actions are discovered and encouraging switching between different options when environmental conditions change. In addition to highlighting the dynamic interactions between the dlPFC and ACC in learning, the model similarly offers an elegant solution to the credit assignment problem by restricting value updating only to those actions that were selected on a given trial. This is implemented by requiring the prediction error signals in the ACC to coincide with a motor efference copy sent by the premotor cortex. The model is endorsed with an ability to learn meta-values of novel objects in the environment based on the changes in the average reward that follow the presentation of such stimuli. While the authors proposed that such meta-value learning is implemented by the ACC, it is plausible that the OFC also plays a role in this process based on its contributions to stimulus-outcome and state learning ( Wilson et al., 2014 ; Zsuga et al., 2016 ). Intriguingly, this model could reproduce monkey behavior and neural responses on two tasks: four-choice deterministic and two-choice probabilistic paradigms, entailing a complex spatio-temporal credit assignment problem as the stimuli disappeared from the screen prior to action execution and outcome presentation ( Khamassi et al., 2011 , 2013 , 2015 ). Model-based analyses of neuronal responses further revealed that information about prediction errors, action values and outcome uncertainty is integrated both in the dlPFC and ACC, but at different timepoints: before trial feedback in the dlPFC and after feedback in the ACC ( Khamassi et al., 2015 ).

Collectively, these findings highlight the heterogeneity of responses in each PFC subregion that differ in temporal dynamics within a single trial and suggest that the cooperation between the OFC, ACC and dlPFC may support flexible, strategy- and context-dependent choices. This network perspective further suggests that individual PFC subregions may be less specialized in their functions than previously thought. For example, in primates both the ACC and dlPFC participate in decisions based on action values ( Hunt et al., 2015 ; Khamassi et al., 2015 ). And more recently, it has been demonstrated that the OFC is involved in updating action-outcome values as well ( Fiuzat et al., 2017 ). Analogously, while it has been proposed that the OFC is specialized for stimulus-outcome and ACC for action-outcome learning ( Rudebeck et al., 2008 ), lesions to the ACC have been similarly reported to impair stimulus-based reversal learning ( Chudasama et al., 2013 ), supporting shared contributions of the PFC subregions to adaptive behavior. Indeed, these brain regions communicate extensively, sharing the information about presented options, executed decisions and received rewards (Figure 2 ), which can enable them to assign credit for outcomes to choices on which they are contingent ( Urbanczik and Senn, 2009 ; Friedrich et al., 2010 , 2011 ). Attention-gated learning likely relies on the cooperation between PFC subregions as well: for example, coordinated and synchronized activity between the ACC and dlPFC aids in goal-directed attentional shifting and prioritization of task-relevant information ( Womelsdorf et al., 2014 ; Oemisch et al., 2015 ; Voloh et al., 2015 ).

www.frontiersin.org

Figure 2 . Cooperation between PFC subregions in multi-cue tasks. In many everyday decisions, the options are compared across multiple features simultaneously (ex., by considering current context, needs, available reward types, as well as delay and effort costs). Neurons in different subregions of the PFC exhibit rich response properties, integrating many aspects of the task at hand. The OFC, ACC and dlPFC communicate extensively, sharing the information about presented options, executed decisions and received rewards, which can enable them to assign credit for outcomes to choices on which they are contingent.

Functional connectivity within the PFC can support contingent learning on shorter timescales (ex., across trials within the same task), when complex rules or stimulus-action-outcome mappings are switching frequently ( Duff et al., 2011 ; Johnson et al., 2016 ). Under such conditions, the same stimuli can carry different meaning depending on task context or due to changes in the environment (ex., serial discrimination-reversal problems) and the PFC neurons with heterogeneous response properties may be better targets for modification, allowing the brain to exert flexible, rapid and context-sensitive control over behavior ( Asaad et al., 1998 ; Mansouri et al., 2006 ). Indeed, it has been shown that rule and reversal learning induce plasticity in OFC synapses onto the dorsomedial PFC (encompassing the ACC) in rats ( Johnson et al., 2016 ). When motivational significance of reward-predicting cues fluctuates frequently, neuronal responses and synaptic connections within the PFC tend to update more rapidly (i.e., across block of trials) compared to subcortical structures and other cortical regions ( Padoa-Schioppa and Assad, 2008 ; Morrison et al., 2011 ; Xie and Padoa-Schioppa, 2016 ; Fernández-Lamo et al., 2017 ; Saez et al., 2017 ). Similarly, neurons in the PFC promptly adapt their responses to incoming information based on the recent history of inputs ( Freedman et al., 2001 ; Meyers et al., 2012 ; Stokes et al., 2013 ). Critically, changes in the PFC activity closely track behavioral performance ( Mulder et al., 2003 ; Durstewitz et al., 2010 ), and interfering with neural plasticity within this brain area prevents normal responses to contingency degradation ( Swanson et al., 2015 ).

When the circumstances are stable overall and the same cues or actions remain reliable predictors of rewards, long-range connections between the PFC, association and sensory areas can support contingent learning on prolonged timescales. Neurons in the lateral intraparietal area demonstrate larger post-decisional responses and enhanced learning following choices that predict final outcomes of sequential behavior in a multi-step and -cue task ( Gersch et al., 2014 ). Such changes in neuronal activity likely rely on information about task rules conveyed by the PFC directly or via interactions with neuromodulatory systems. These hypotheses could be tested in future work.

In summary, dynamic interactions between subregions of the PFC can support contingent learning in multi-cue environments. Furthermore, via feedback projections, the PFC can guide plasticity in other cortical areas associated with sensory and motor processing ( Cohen et al., 2011 ). This account suggests that lesion experiments targeting a localized PFC subregion will be insufficient to gain fine-grained understanding of credit assignment during learning and instead poses refined questions for future research, shifting the focus from focal manipulations to experimental techniques targeting cortico-cortical projections. To gain novel insights into functional connectivity between PFC subregions, it will be critical to assess neural correlates of contingent learning in the OFC, ACC, and dlPFC simultaneously in the context of the same task. In humans, functional connectivity can be assessed by utilizing coherence, phase synchronization, Granger causality and Bayes network approaches ( Bastos and Schoffelen, 2016 ; Mill et al., 2017 ). Indeed, previous studies have linked individual differences in cortico-striatal functional connectivity to reinforcement-driven learning ( Horga et al., 2015 ; Kaiser et al., 2017 ) and future work could focus on examining cortico-cortical interactions in similar paradigms. To probe causal contributions of projections spanning the PFC, future research may benefit from designing multi-cue tasks for rodents and taking advantage of recently developed techniques (i.e., chemo- and opto-genetic targeting of projection neurons followed by silencing of axonal terminals to achieve pathway-specific inhibition; Deisseroth, 2010 ; Sternson and Roth, 2014 ) that afford increasingly precise manipulations of cortico-cortical connectivity. It should be noted, however, that most experiments to date have probed the contributions of the PFC to credit assignment in primates, and functional specialization across different subregions may be even less pronounced in mice and rats. Finally, as highlighted throughout this review, the recent progress in understanding the neural mechanisms of credit assignment has relied on introduction of more complex tasks, including multi-cue and probabilistic choice paradigms. While such tasks better mimic the naturalistic problems that the brains have evolved to solve, they also produce behavioral patterns that are more difficult to analyze and interpret ( Scholl and Klein-Flügge, 2017 ). As such, computational modeling of the behavior and neuronal activity may prove especially useful in future work on credit assignment.

Author Contributions

The author confirms being the sole contributor of this work and approved it for publication.

This work was supported by UCLA's Division of Life Sciences Recruitment and Retention fund (Izquierdo), as well as the UCLA Distinguished University Fellowship (Stolyarova).

Conflict of Interest Statement

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Acknowledgments

The author thanks her mentor Dr. Alicia Izquierdo for helpful feedback and critiques on the manuscript, and Evan E. Hart, as well as the members of the Center for Brains, Minds and Machines and Lau lab for stimulating conversations on the topic.

Abraham, W. C. (2008). Metaplasticity: tuning synapses and networks for plasticity. Nat. Rev. Neurosci. 9:387 doi: 10.1038/nrn2356

PubMed Abstract | CrossRef Full Text | Google Scholar

Akaishi, R., Kolling, N., Brown, J. W., and Rushworth, M. (2016). Neural mechanisms of credit assignment in a multicue environment. J. Neurosci. 36, 1096–1112. doi: 10.1523/JNEUROSCI.3159-15.2016

Amiez, C., Joseph, J. P., and Procyk, E. (2006). Reward encoding in the monkey anterior cingulate cortex. Cereb. Cortex 16, 1040–1055. doi: 10.1093/cercor/bhj046

Asaad, W. F., Lauro, P. M., Perge, J. A., and Eskandar, E. N. (2017). Prefrontal neurons encode a solution to the credit assignment problem. J. Neurosci. 37, 6995–7007. doi: 10.1523/JNEUROSCI.3311-16.2017.

Asaad, W. F., Rainer, G., and Miller, E. K. (1998). Neural activity in the primate prefrontal cortex during associative learning. Neuron 21, 1399–1407. doi: 10.1016/S0896-6273(00)80658-3

Barbas, H., and Pandya, D. N. (1989). Architecture and intrinsic connections of the prefrontal cortex in the rhesus monkey. J. Comp. Neurol. 286, 353–375 doi: 10.1002/cne.902860306

Barto, A. G. (2007). Temporal difference learning. Scholarpedia J. 2:1604. doi: 10.4249/scholarpedia.1604

CrossRef Full Text | Google Scholar

Barto, A. G., Sutton, R. S., and Anderson, C. W. (1983). “Neuronlike adaptive elements that can solve difficult learning control problems,” in IEEE Transactions on Systems, Man, and Cybernetics, SMC-13 , 834–846

Google Scholar

Bastos, A. M., and Schoffelen, J. M. (2016). A tutorial review of functional connectivity analysis methods and their interpretational pitfalls. Front. Syst. Neurosci . 9:175. doi: 10.3389/fnsys.2015.00175

Bogacz, R., McClure, S. M., Li, J., Cohen, J. D., and Montague, P. R. (2007). Short-term memory traces for action bias in human reinforcement learning. Brain Res. 1153, 111–121. doi: 10.1016/j.brainres.2007.03.057

Buckley, M. J., Mansouri, F. A., Hoda, H., Mahboubi, M., Browning, P. G. F., Kwok, S. C., et al. (2009). Dissociable components of rule-guided behavior depend on distinct medial and prefrontal regions. Science 325, 52–58. doi: 10.1126/science.1172377

Chudasama, Y., Daniels, T. E., Gorrin, D. P., Rhodes, S. E., Rudebeck, P. H., and Murray, E. A. (2013). The role of the anterior cingulate cortex in choices based on reward value and reward contingency. Cereb Cortex 23, 2884–2898. doi: 10.1093/cercor/bhs266

Cohen, J. D., McClure, S. M., and Yu, A. J. (2007). Should I stay or should I go? How the human brain manages the trade-off between exploitation and exploration. Philos. Trans. R. Soc. Lond. Ser. B Biol. Sci. 362, 933–942. doi: 10.1098/rstb.2007.2098

Cohen, M. X., Wilmes, K., and Vijver, I. v. (2011). Cortical electrophysiological network dynamics of feedback learning. Trends Cogn. Sci. 15, 558–566. doi: 10.1016/j.tics.2011.10.004

Curtis, C. E., and Lee, D. (2010). Beyond working memory: the role of persistent activity in decision making. Trends Cogn. Sci. 14, 216–222. doi: 10.1016/j.tics.2010.03.006

Dayan, P., and Abbott, L. F. (2001). Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems. Cambridge, MA: MIT Press.

Deisseroth, K. (2010). Optogenetics. Nat. Methods 8, 26–29. doi: 10.1038/nmeth.f.324

Doya, K. (2002). Metalearning and neuromodulation. Neural. Netw. 15, 495–506. doi: 10.1016/S0893-6080(02)00044-8

Duff, A., Sanchez Fibla, M., and Verschure, P. F. M. J. (2011). A biologically based model for the integration of sensory–motor contingencies in rules and plans: a prefrontal cortex based extension of the distributed adaptive control architecture. Brain Res. Bull. 85, 289–304. doi: 10.1016/j.brainresbull.2010.11.008

Durstewitz, D., Vittoz, N. M., Floresco, S. B., and Seamans, J. K. (2010). Abrupt transitions between prefrontal neural ensemble states accompany behavioral transitions during rule learning. Neuron 66, 438–448. doi: 10.1016/j.neuron.2010.03.029

Elston, G. N. (2000). Pyramidal cells of the frontal lobe: all the more spinous to think with. J. Neurosci. 20:RC95. Available online at: http://www.jneurosci.org/content/20/18/RC95.long

PubMed Abstract | Google Scholar

Felleman, D. J., and Van Essen, D. C. (1991). Distributed hierarchical processing in the primate cerebral cortex. Cereb. Cortex 1, 1–47. doi: 10.1093/cercor/1.1.1

Fernández-Lamo, I., Delgado-García, J. M., and Gruart, A. (2017). When and where learning is taking place: multisynaptic changes in strength during different behaviors related to the acquisition of an operant conditioning task by behaving rats. Cereb. Cortex 14, 1–13. doi: 10.1093/cercor/bhx011

CrossRef Full Text

Fisher, S. D., Robertson, P. B., Black, M. J., Redgrave, P., Sagar, M. A., Abraham, W. C., et al. (2017). Reinforcement determines the timing dependence of corticostriatal synaptic plasticity in vivo . Nat. Commun. 8:334. doi: 10.1038/s41467-017-00394-x

FitzGerald, T. H. B., Friston, K. J., and Dolan, R. J. (2013). Characterising reward outcome signals in sensory cortex. NeuroImage 83, 329–334. doi: 10.1016/j.neuroimage.2013.06.061

Fiuzat, E. C., Rhodes, S. E., and Murray, E. A. (2017). The role of orbitofrontal-amygdala interactions in updating action-outcome valuations in macaques. J. Neurosci. 37, 2463–2470. doi: 10.1523/JNEUROSCI.1839-16.2017

Freedman, D. J., Riesenhuber, M., Poggio, T., and Miller, E. K. (2001). Categorical representation of visual stimuli in the primate prefrontal cortex. Science 291, 312–316. doi: 10.1126/science.291.5502.312

Friedrich, J., Urbanczik, R., and Senn, W. (2010). Learning spike-based population codes by reward and population feedback. Neural. Comput. 22, 1698–1717. doi: 10.1162/neco.2010.05-09-1010

Friedrich, J., Urbanczik, R., and Senn, W. (2011). Spatio-temporal credit assignment in neuronal population learning. PLoS Comput. Biol. 7:e1002092. doi: 10.1371/journal.pcbi.1002092

Gersch, T. M., Foley, N. C., Eisenberg, I., and Gottlieb, J. (2014). Neural correlates of temporal credit assignment in the parietal lobe. PloS One , 9:e88725. doi: 10.1371/journal.pone.0088725

Hayden, B. Y., Heilbronner, S. R., Pearson, J. M., and Platt, M. L. (2011). Surprise signals in anterior cingulate cortex: neuronal encoding of unsigned reward prediction errors driving adjustment in behavior. J. Neurosci. 31, 4178–4187. doi: 10.1523/JNEUROSCI.4652-10.2011

Hayden, B. Y., and Platt, M. L. (2010). Neurons in anterior cingulate cortex multiplex information about reward and action. J. Neurosci. 30, 3339–3346. doi: 10.1523/JNEUROSCI.4874-09.2010

Her, E. S., Huh, N., Kim, J., and Jung, M. W. (2016). Neuronal activity in dorsomedial and dorsolateral striatum under the requirement for temporal credit assignment. Sci. Rep. 6:27056. doi: 10.1038/srep27056

Histed, M. H., Pasupathy, A., and Miller, E. K. (2009). Learning substrates in the primate prefrontal cortex and striatum: sustained activity related to successful actions. Neuron 63, 244–253. doi: 10.1016/j.neuron.2009.06.019

Horga, G., Maia, T. V., Marsh, R., Hao, X., Xu, D., Duan, Y., et al. (2015). Changes in corticostriatal connectivity during reinforcement learning in humans. Hum. Brain Mapp. 36, 793–803. doi: 10.1002/hbm.22665

Hull, C. (1943). Principles of Behavior . New York, NY: Appleton-Century-Crofts.

Hunt, L. T., Behrens, T. E. J., Hosokawa, T., Wallis, J. D., and Kennerley, S. W. (2015). Capturing the temporal evolution of choice across prefrontal cortex. eLife 4:e11945. doi: 10.7554/eLife.11945

Izhikevich, E. M. (2007). Solving the distal reward problem through linkage of STDP and dopamine signaling. Cereb. Cortex 17, 2443–2452. doi: 10.1093/cercor/bhl152

Jackson, S. A. W., Horst, N. K., Pears, A., Robbins, T. W., and Roberts, A. C. (2016). Role of the perigenual anterior cingulate and orbitofrontal cortex in contingency learning in the marmoset. Cereb. Cortex 26, 3273–3284. doi: 10.1093/cercor/bhw067

Jocham, G., Brodersen, K. H., Constantinescu, A. O., Kahn, M. C., Ianni, A. M., Walton, M. E., et al. (2016). Reward-guided learning with and without causal attribution. Neuron 90, 177–190. doi: 10.1016/j.neuron.2016.02.018

Jog, M. S., Kubota, Y., Connolly, C. I., Hillegaart, V., and Graybiel, A. M. (1999). Building neural representations of habits. Science 286, 1745–1749. doi: 10.1126/science.286.5445.1745

Johnson, C. M., Peckler, H., Tai, L. H., and Wilbrecht, L. (2016). Rule learning enhances structural plasticity of long-range axons in frontal cortex. Nat. Commun. 7:10785. doi: 10.1038/ncomms10785

Kaiser, R. H., Treadway, M. T., Wooten, D. W., Kumar, P., Goer, F., Murray, L., et al. (2017). Frontostriatal and dopamine markers of individual differences in reinforcement learning: a multi-modal investigation. Cereb. Cortex . doi: 10.1093/cercor/bhx281. [Epub ahead of print].

Kawai, T., Yamada, H., Sato, N., Takada, M., and Matsumoto, M. (2015). Roles of the lateral habenula and anterior cingulate cortex in negative outcome monitoring and behavioral adjustment in nonhuman primates. Neuron 88, 792–804. doi: 10.1016/j.neuron.2015.09.030

Kennerley, S. W., Behrens, T. E. J., and Wallis, J. D. (2011). Double dissociation of value computations in orbitofrontal and anterior cingulate neurons. Nat. Neurosci. 14, 1581–1589. doi: 10.1038/nn.2961

Kennerley, S. W., Dahmubed, A. F., Lara, A. H., and Wallis, J. D. (2009). Neurons in the frontal lobe encode the value of multiple decision variables. J. Cogn. Neurosci. 21, 1162–1178. doi: 10.1162/jocn.2009.21100

Khamassi, M., Enel, P., Dominey, P. F., and Procyk, E. (2013). Medial prefrontal cortex and the adaptive regulation of reinforcement learning parameters. Prog. Brain Res. 202, 441–464. doi: 10.1016/B978-0-444-62604-2.00022-8

Khamassi, M., Lallée, S., Enel, P., Procyk, E., and Dominey, P. F. (2011). Robot cognitive control with a neurophysiologically inspired reinforcement learning model. Front Neurorobot 5:1. doi: 10.3389/fnbot.2011.00001

Khamassi, M., Quilodran, R., Enel, P., Dominey, P. F., and Procyk, E. (2015). Behavioral regulation and the modulation of information coding in the lateral prefrontal and cingulate cortex. Cereb. Cortex 25, 3197–3218. doi: 10.1093/cercor/bhu114

Kim, H., Lee, D., and Jung, M. W. (2013). Signals for previous goal choice persist in the dorsomedial, but not dorsolateral striatum of rats. J. Neurosci. 33, 52–63. doi: 10.1523/JNEUROSCI.2422-12.2013

Kim, H., Sul, J. H., Huh, N., Lee, D., and Jung, M. W. (2009). Role of striatum in updating values of chosen actions. J. Neurosci. 29, 14701–14712. doi: 10.1523/JNEUROSCI.2728-09.2009

Kötter, R., and Wickens, J. (1995). Interactions of glutamate and dopamine in a computational model of the striatum. J. Comput. Neurosci. 2, 195–214. doi: 10.1007/BF00961434

Lebedev, M. A., Messinger, A., Kralik, J. D., and Wise, S. P. (2004). Representation of attended versus remembered locations in prefrontal cortex. PLoS Biol. 2:e365. doi: 10.1371/journal.pbio.0020365

Mackintosh, N. J. (1975). Blocking of conditioned suppression: role of the first compound trial. J. Exp. Psychol. 1, 335–345. doi: 10.1037/0097-7403.1.4.335

Mansouri, F. A., Matsumoto, K., and Tanaka, K. (2006). Prefrontal cell activities related to monkeys' success and failure in adapting to rule changes in a Wisconsin Card Sorting Test analog. J. Neurosci. 26, 2745–2756. doi: 10.1523/JNEUROSCI.5238-05.2006

Markowitz, D. A., Curtis, C. E., and Pesaran, B. (2015). Multiple component networks support working memory in prefrontal cortex. Proc. Natl. Acad. Sci. U.S.A. 112, 11084–11089. doi: 10.1073/pnas.1504172112

Matsuda, Y., Marzo, A., and Otani, S. (2006). The presence of background dopamine signal converts long-term synaptic depression to potentiation in rat prefrontal cortex. J. Neurosci. 26, 4803–4810. doi: 10.1523/JNEUROSCI.5312-05.2006

McDannald, M. A., Esber, G. R., Wegener, M. A., Wied, H. M., Liu, T.-L., Stalnaker, T. A., et al. (2014). Orbitofrontal neurons acquire responses to “valueless” Pavlovian cues during unblocking. eLife 3:e02653. doi: 10.7554/eLife.02653

Meyers, E. M., Qi, X. L., and Constantinidis, C. (2012). Incorporation of new information into prefrontal cortical activity after learning working memory tasks. Proc. Natl. Acad. Sci. U.S.A. 109, 4651–4656. doi: 10.1073/pnas.1201022109

Mill, R. D., Bagic, A., Bostan, A., Schneider, W., and Cole, M. W. (2017). Empirical validation of directed functional connectivity. Neuroimage 146, 275–287. doi: 10.1016/j.neuroimage.2016.11.037

Monosov, I. E. (2017). Anterior cingulate is a source of valence-specific information about value and uncertainty. Nat. Commun. 8:134. doi: 10.1038/s41467-017-00072-y

Montague, P. R., Dayan, P., and Sejnowski, T. J. (1996). A framework for mesencephalic dopamine systems based on predictive Hebbian learning. J. Neurosci. 16, 1936–1947.

Morrison, S. E., Saez, A., Lau, B., and Salzman, C. D. (2011). Different time courses for learning-related changes in amygdala and orbitofrontal cortex. Neuron 71, 1127–1140. doi: 10.1016/j.neuron.2011.07.016

Mulder, A. B., Nordquist, R. E., Orgüt, O., and Pennartz, C. M. A. (2003). Learning-related changes in response patterns of prefrontal neurons during instrumental conditioning. Behav. Brain Res. 146, 77–88. doi: 10.1016/j.bbr.2003.09.016

Niv, Y. (2009). Reinforcement learning in the brain. J. Math. Psychol. 53, 139–154. doi: 10.1016/j.jmp.2008.12.005

Niv, Y., Daniel, R., Geana, A., Gershman, S. J., Leong, Y. C., Radulescu, A., et al. (2015). Reinforcement learning in multidimensional environments relies on attention mechanisms. J. Neurosci. 35, 8145–8157. doi: 10.1523/JNEUROSCI.2978-14.2015

Noonan, M. P., Chau, B. K. H., Rushworth, M. F. S., and Fellows, L. K. (2017). Contrasting effects of medial and lateral orbitofrontal cortex lesions on credit assignment and decision-making in humans. J. Neurosci . 37, 7023–7035. doi: 10.1523/JNEUROSCI.0692-17.2017

Noonan, M. P., Walton, M. E., Behrens, T. E., Sallet, J., Buckley, M. J., and Rushworth, M. F. (2010). Separate value comparison and learning mechanisms in macaque medial and lateral orbitofrontal cortex. Proc. Natl. Acad. Sci. U.S.A. 107, 20547–20252. doi: 10.1073/pnas.1012246107

Oemisch, M., Westendorff, S., Everling, S., and Womelsdorf, T. (2015). Interareal spike-train correlations of anterior cingulate and dorsal prefrontal cortex during attention shifts. J. Neurosci. 35, 13076–13089. doi: 10.1523/JNEUROSCI.1262-15.2015

Padoa-Schioppa, C., and Assad, J. A. (2006). Neurons in the orbitofrontal cortex encode economic value. Nature 441, 223–226 doi: 10.1038/nature04676

Padoa-Schioppa, C., and Assad, J. A. (2008). The representation of economic value in the orbitofrontal cortex is invariant for changes of menu. Nat. Neurosci. 11, 95–102. doi: 10.1038/nn2020

Pan, W. X., Schmidt, R., Wickens, J. R., and Hyland, B. I. (2005). Dopamine cells respond to predicted events during classical conditioning: evidence for eligibility traces in the reward-learning network. J. Neurosci. 25, 6235–6242. doi: 10.1523/JNEUROSCI.1478-05.2005

Pasupathy, A., and Miller, E. K. (2005). Different time courses of learning-related activity in the prefrontal cortex and striatum. Nature 433, 873–876. doi: 10.1038/nature03287

Pearson, J. M., Hayden, B. Y., Raghavachari, S., and Platt, M. L. (2009). Neurons in posterior cingulate cortex signal exploratory decisions in a dynamic multioption choice task. Curr. Biol. 19, 1532–1537. doi: 10.1016/j.cub.2009.07.048

Procyk, E., Tanaka, Y. L., and Joseph, J. P. (2000). Anterior cingulate activity during routine and non-routine sequential behaviors in macaques. Nat. Neurosci. 3, 502–508. doi: 10.1038/74880

Quilodran, R., Rothe, M., and Procyk, E. (2008). Behavioral shifts and action valuation in the anterior cingulate cortex. Neuron 57, 314–325. doi: 10.1016/j.neuron.2007.11.031

Roelfsema, P. R., and van Ooyen, A. (2005). Attention-gated reinforcement learning of internal representations for classification. Neural. Comput. 17, 2176–2214. doi: 10.1162/0899766054615699

Rothkopf, C. A., and Ballard, D. H. (2010). Credit assignment in multiple goal embodied visuomotor behavior. Front. Psychol. 1:173. doi: 10.3389/fpsyg.2010.00173

Rudebeck, P. H., Behrens, T. E., Kennerley, S. W., Baxter, M. G., Buckley, M. J., Walton, M. E., et al. (2008). Frontal cortex subregions play distinct roles in choices between actions and stimuli. J. Neurosci. 28, 13775–13785. doi: 10.1523/JNEUROSCI.3541-08.2008

Saez, R. A., Saez, A., Paton, J. J., Lau, B., and Salzman, C. D. (2017). Distinct roles for the amygdala and orbitofrontal cortex in representing the relative amount of expected reward. Neuron 95, 70.e3–77.e3. doi: 10.1016/j.neuron.2017.06.012

Scholl, J., and Klein-Flügge, M. (2017). Understanding psychiatric disorder by capturing ecologically relevant features of learning and decision-making. Behav Brain Res . doi: 10.1016/j.bbr.2017.09.050. [Epub ahead of print].

Schultz, W. (1998a). Predictive reward signal of dopamine neurons. J. Neurophysiol. 80, 1–27. doi: 10.1152/jn.1998.80.1.1

Schultz, W. (1998b). The phasic reward signal of primate dopamine neurons. Adv. Pharmacol. 42, 686–690. doi: 10.1016/S1054-3589(08)60841-8

Schultz, W. (2004). Neural coding of basic reward terms of animal learning theory, game theory, microeconomics and behavioural ecology. Curr. Opin. Neurobiol. 14, 139–147. doi: 10.1016/j.conb.2004.03.017

Schultz, W., and Dickinson, A. (2000). Neuronal coding of prediction errors. Ann. Rev. Neurosci. 23, 473–500. doi: 10.1146/annurev.neuro.23.1.473

Seo, H., Barraclough, D. J., and Lee, D. (2007). Dynamic signals related to choices and outcomes in the dorsolateral prefrontal cortex. Cerebral Cortex 17(Suppl. 1), i110–i117. doi: 10.1093/cercor/bhm064

Seo, H., and Lee, D. (2007). Temporal filtering of reward signals in the dorsal anterior cingulate cortex during a mixed-strategy game. J. Neurosci. 27, 8366–8377. doi: 10.1523/JNEUROSCI.2369-07.2007

Seo, M., Lee, E., and Averbeck, B. B. (2012). Action selection and action value in frontal-striatal circuits. Neuron 74, 947–960. doi: 10.1016/j.neuron.2012.03.037

Seol, G. H., Ziburkus, J., Huang, S., Song, L., Kim, I. T., Takamiya, K., et al. (2007). Neuromodulators control the polarity of spike-timing-dependent synaptic plasticity. Neuron 55, 919–929. doi: 10.1016/j.neuron.2007.08.013

Shidara, M., and Richmond, B. J. (2002). Anterior cingulate: single neuronal signals related to degree of reward expectancy. Science 296, 1709–1711. doi: 10.1126/science.1069504

Shuler, M. G., and Bear, M. F. (2006). Reward timing in the primary visual cortex. Science 311, 1606–1609. doi: 10.1126/science.1123513

Sternson, S. M., and Roth, B. L. (2014). Chemogenetic tools to interrogate brain functions. Ann. Rev. Neurosci. 37, 387–407. doi: 10.1146/annurev-neuro-071013-014048

Stokes, M. G., Kusunoki, M., Sigala, N., Nili, H., Gaffan, D., and Duncan, J. (2013). Dynamic coding for cognitive control in prefrontal cortex. Neuron 78, 364–375. doi: 10.1016/j.neuron.2013.01.039

Sul, J. H., Kim, H., Huh, N., Lee, D., and Jung, M. W. (2010). Distinct roles of rodent orbitofrontal and medial prefrontal cortex in decision making. Neuron 66, 449–460. doi: 10.1016/j.neuron.2010.03.033

Sutton, R. S., and Barto, A. G. (1998). Reinforcement Learning: An Introduction Vol. 1 Cambridge: MIT press

Swanson, A. M., Allen, A. G., Shapiro, L. P., and Gourley, S. L. (2015). GABAAα1-mediated plasticity in the orbitofrontal cortex regulates context-dependent action selection. Neuropsychopharmacology 40, 1027–1036. doi: 10.1038/npp.2014.292

Tremblay, L., and Schultz, W. (1999). Relative reward preference in primate orbitofrontal cortex. Nature 398, 704–708. doi: 10.1038/19525

Tsujimoto, S., Genovesio, A., and Wise, S. P. (2009). Monkey orbitofrontal cortex encodes response choices near feedback time. J. Neurosci. 29, 2569–2574. doi: 10.1523/JNEUROSCI.5777-08.2009

Tsujimoto, S., Genovesio, A., and Wise, S. P. (2011). Comparison of strategy signals in the dorsolateral and orbital prefrontal cortex. J. Neurosci. 31, 4583–4592. doi: 10.1523/JNEUROSCI.5816-10.2011

Urbanczik, R., and Senn, W. (2009). Reinforcement learning in populations of spiking neurons. Nat. Neurosci. 12, 250–252. doi: 10.1038/nn.2264

Voloh, B., Valiante, T. A., Everling, S., and Womelsdorf, T. (2015). Theta-gamma coordination between anterior cingulate and prefrontal cortex indexes correct attention shifts. Proc. Natl. Acad. Sci. U.S.A. 112, 8457–8462. doi: 10.1073/pnas.1500438112

Wallis, J. D., and Miller, E. K. (2003). Neuronal activity in primate dorsolateral and orbital prefrontal cortex during performance of a reward preference task. Eur. J. Neurosci. 18, 2069–2081. doi: 10.1046/j.1460-9568.2003.02922.x

Walsh, M. M., and Anderson, J. R. (2014). Navigating complex decision spaces: problems and paradigms in sequential choice. Psychol. Bull. 140, 466–486. doi: 10.1037/a0033455

Wilson, R. C., and Niv, Y. (2011). Inferring relevance in a changing world. Front Hum. Neurosci. 5:189. doi: 10.3389/fnhum.2011.00189

Wilson, R. C., Takahashi, Y. K., Schoenbaum, G., and Niv, Y. (2014). Orbitofrontal cortex as a cognitive map of task space. Neuron 81, 267–279. doi: 10.1016/j.neuron.2013.11.005

Womelsdorf, T., Ardid, S., Everling, S., and Valiante, T. A. (2014). Burst firing synchronizes prefrontal and anterior cingulate cortex during attentional control. Curr. Biol. 24, 2613–2621. doi: 10.1016/j.cub.2014.09.046

Wörgötter, F., and Porr, B. (2005). Temporal sequence learning, prediction, and control: a review of different models and their relation to biological mechanisms. Neural. Comput. 17, 245–319. doi: 10.1162/0899766053011555

Xie, J., and Padoa-Schioppa, C. (2016). Neuronal remapping and circuit persistence in economic decisions. Nat. Neurosci. 19, 855–861. doi: 10.1038/nn.4300

Xu, Y. (2017). Reevaluating the sensory account of visual working memory storage. Trends Cogn. Sci. 21, 794–815 doi: 10.1016/j.tics.2017.06.013

Zhang, J. C., Lau, P.-M., and Bi, G.-Q. (2009). Gain in sensitivity and loss in temporal contrast of STDP by dopaminergic modulation at hippocampal synapses. Proc. Natl. Acad. Sci. U.S.A. 106, 13028–13033 doi: 10.1073/pnas.0900546106

Zsuga, J., Biro, K., Tajti, G., Szilasi, M. E., Papp, C., Juhasz, B., et al. (2016). ‘Proactive’ use of cue-context congruence for building reinforcement learning's reward function. BMC Neurosci. 17:70. doi: 10.1186/s12868-016-0302-7

Keywords: orbitofrontal, dorsolateral prefrontal, anterior cingulate, learning, reward, reinforcement, plasticity, behavioral flexibility

Citation: Stolyarova A (2018) Solving the Credit Assignment Problem With the Prefrontal Cortex. Front. Neurosci . 12:182. doi: 10.3389/fnins.2018.00182

Received: 27 September 2017; Accepted: 06 March 2018; Published: 27 March 2018.

Reviewed by:

Copyright © 2018 Stolyarova. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Alexandra Stolyarova, [email protected]

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Credit Assignment in Neural Networks through Deep Feedback Control

The success of deep learning sparked interest in whether the brain learns by using similar techniques for assigning credit to each synaptic weight for its contribution to the network output. However, the majority of current attempts at biologically-plausible learning methods are either non-local in time, require highly specific connectivity motifs, or have no clear link to any known mathematical optimization method. Here, we introduce Deep Feedback Control (DFC), a new learning method that uses a feedback controller to drive a deep neural network to match a desired output target and whose control signal can be used for credit assignment. The resulting learning rule is fully local in space and time and approximates Gauss-Newton optimization for a wide range of feedback connectivity patterns. To further underline its biological plausibility, we relate DFC to a multi-compartment model of cortical pyramidal neurons with a local voltage-dependent synaptic plasticity rule, consistent with recent theories of dendritic processing. By combining dynamical system theory with mathematical optimization theory, we provide a strong theoretical foundation for DFC that we corroborate with detailed results on toy experiments and standard computer-vision benchmarks.

1 Introduction

The error backpropagation (BP) algorithm [ 1 , 2 , 3 ] is currently the gold standard to perform credit assignment (CA) in deep neural networks. Although deep learning was inspired by biological neural networks, an exact mapping of BP onto biology to explain learning in the brain leads to several inconsistencies with experimental results that are not yet fully addressed [ 4 , 5 , 6 ] . First, BP requires an exact symmetry between the weights of the forward and feedback pathways [ 5 , 6 ] , also called the weight transport problem. Another issue of relevance is that, in biological networks, feedback also changes each neuron’s activation and thus its immediate output [ 7 , 8 ] , which does not occur in BP.

Lillicrap et al. [ 9 ] convincingly showed that the weight transport problem can be sidestepped in modest supervised learning problems by using random feedback connections. However, follow-up studies indicated that random feedback paths cannot provide precise CA in more complex problems [ 10 , 11 , 12 , 13 ] , which can be mitigated by learning feedback weights that align with the forward pathway [ 14 , 15 , 16 , 17 , 18 ] or approximate its inverse [ 19 , 20 , 21 , 22 ] . However, this precise alignment imposes strict constraints on the feedback weights, whereas more flexible constraints could provide the freedom to use feedback also for other purposes besides learning, such as attention and prediction [ 8 ] .

A complementary line of research proposes models of cortical microcircuits which propagate CA signals through the network using dynamic feedback [ 23 , 24 , 25 ] or multiplexed neural codes [ 26 ] , thereby directly influencing neural activations with feedback. However, these models introduce highly specific connectivity motifs and tightly coordinated plasticity mechanisms. Whether these constraints can be fulfilled by cortical networks is an interesting experimental question. Another line of work uses adaptive control theory [ 27 ] to derive learning rules for non-hierarchical recurrent neural networks (RNNs) based on error feedback, which drives neural activity to track a reference output [ 28 , 29 , 30 , 31 ] . These methods have so far only been used to train single-layer RNNs with fixed output and feedback weights, making it unclear whether they can be extended to deep neural networks. Finally, two recent studies [ 32 , 33 ] use error feedback in a dynamical setting to invert the forward pathway, thereby enabling errors to flow backward. These approaches rely on a learning rule that is non-local in time and it remains unclear whether they approximate any known optimization method. Addressing the latter, two recent studies take a first step by relating learned (non-dynamical) inverses of the forward pathway [ 21 ] and iterative inverses restricted to invertible networks [ 22 ] to approximate Gauss-Newton optimization.

Inspired by the Dynamic Inversion method [ 32 ] , we introduce Deep Feedback Control (DFC), a new biologically-plausible CA method that addresses the above-mentioned limitations and extends the control theory approach to learning [ 28 , 29 , 30 , 31 ] to deep neural networks. DFC uses a feedback controller that drives a deep neural network to match a desired output target. For learning, DFC then simply uses the dynamic change in the neuron activations to update their synaptic weights, resulting in a learning rule fully local in space and time. We show that DFC approximates Gauss-Newton (GN) optimization and therefore provides a fundamentally different approach to CA compared to BP. Furthermore, DFC does not require precise alignment between forward and feedback weights, nor does it rely on highly specific connectivity motifs. Interestingly, the neuron model used by DFC can be closely connected to recent multi-compartment models of cortical pyramidal neurons. Finally, we provide detailed experimental results, corroborating our theoretical contributions and showing that DFC does principled CA on standard computer-vision benchmarks in a way that fundamentally differs from standard BP.

2 The Deep Feedback Control method

Here, we introduce the core parts of DFC. In contrast to conventional feedforward neural network models, DFC makes use of a dynamical neuron model (Section 2.1 ). We use a feedback controller to drive the neurons of the network to match a desired output target (Section 2.2 ), while simultaneously updating the synaptic weights using the change in neuronal activities (Section 2.3 ). This combination of dynamical neurons and controller leads to a simple but powerful learning method, that is linked to GN optimization and offers a flexible range of feedback connectivity (see Section 3 ).

2.1 Neuron and network dynamics

The first main component of DFC is a dynamical multilayer network, in which every neuron integrates its forward and feedback inputs according to the following dynamics:

with 𝐯 i subscript 𝐯 𝑖 \mathbf{v}_{i} a vector containing the pre-nonlinearity activations of the neurons in layer i 𝑖 i , W i subscript 𝑊 𝑖 W_{i} the forward weight matrix, ϕ italic-ϕ \phi a smooth nonlinearity, 𝐮 𝐮 \mathbf{u} a feedback input, Q i subscript 𝑄 𝑖 Q_{i} the feedback weight matrix, and τ v subscript 𝜏 𝑣 \tau_{v} a time constant. See Fig. 1 B for a schematic representation of the network. To simplify notation, we define 𝐫 i = ϕ ​ ( 𝐯 i ) subscript 𝐫 𝑖 italic-ϕ subscript 𝐯 𝑖 \mathbf{r}_{i}=\phi(\mathbf{v}_{i}) as the post-nonlinearity activations of layer i 𝑖 i . The input 𝐫 0 subscript 𝐫 0 \mathbf{r}_{0} remains fixed throughout the dynamics ( 1 ). Note that in the absence of feedback, i.e., 𝐮 = 0 𝐮 0 \mathbf{u}=0 , the equilibrium state of the network dynamics ( 1 ) corresponds to a conventional multilayer feedforward network state, which we denote with superscript ‘-’:

2.2 Feedback controller

The second core component of DFC is a feedback controller, which is only active during learning. Instead of a single backward pass for providing feedback, DFC uses a feedback controller to continuously drive the network to an output target 𝐫 L ∗ subscript superscript 𝐫 𝐿 \mathbf{r}^{*}_{L} (see Fig. 1 D). Following the Target Propagation framework [ 20 , 21 , 22 ] , we define 𝐫 L ∗ subscript superscript 𝐫 𝐿 \mathbf{r}^{*}_{L} as the feedforward output nudged towards lower loss:

1 2 𝜆 superscript subscript 𝐫 𝐿 2 𝜆 𝐲 \mathbf{r}_{L}^{*}=(1-2\lambda)\mathbf{r}_{L}^{-}+2\lambda\mathbf{y} .

The feedback controller produces a feedback signal 𝐮 ​ ( t ) 𝐮 𝑡 \mathbf{u}(t) to drive the network output 𝐫 L ​ ( t ) subscript 𝐫 𝐿 𝑡 \mathbf{r}_{L}(t) towards its target 𝐫 L ∗ superscript subscript 𝐫 𝐿 \mathbf{r}_{L}^{*} , using the control error 𝐞 ​ ( t ) ≜ 𝐫 L ∗ − 𝐫 L ​ ( t ) ≜ 𝐞 𝑡 subscript superscript 𝐫 𝐿 subscript 𝐫 𝐿 𝑡 \mathbf{e}(t)\triangleq\mathbf{r}^{*}_{L}-\mathbf{r}_{L}(t) . A standard approach in designing a feedback controller is the Proportional-Integral-Derivative (PID) framework [ 34 ] . While DFC is compatible with various controller types, such as a full PID controller or a pure proportional controller (see Appendix A.8 ), we use a PI controller for a combination of simplicity and good performance, resulting in the following controller dynamics (see also Fig. 1 A):

where a leakage term is added to constrain the magnitude of 𝐮 int superscript 𝐮 int \mathbf{u}^{\text{int}} . For mathematical simplicity, we take the control matrices equal to K I = I subscript 𝐾 𝐼 𝐼 K_{I}=I and K P = k p ​ I subscript 𝐾 𝑃 subscript 𝑘 𝑝 𝐼 K_{P}=k_{p}I with k p ≥ 0 subscript 𝑘 𝑝 0 k_{p}\geq 0 the proportional control constant. This PI controller adds a leaky integration of the error 𝐮 int superscript 𝐮 int \mathbf{u}^{\text{int}} to a scaled version of the error k p ​ 𝐞 subscript 𝑘 𝑝 𝐞 k_{p}\mathbf{e} which could be implemented by a dedicated neural microcircuit (for a discussion see App. I ). Drawing inspiration from the Target Propagation framework [ 19 , 20 , 21 , 22 ] and the Dynamic Inversion framework [ 32 ] , one can think of the controller and network dynamics as performing a dynamic inversion of the output target 𝐫 L ∗ superscript subscript 𝐫 𝐿 \mathbf{r}_{L}^{*} towards the hidden layers, as the controller dynamically changes the activation of the hidden layers until the output target is reached.

Refer to caption

2.3 Forward weight updates

The update rule for the feedforward weights has the form:

This learning rule simply compares the neuron’s controlled activation to its current feedforward input and is thus local in space and time. Furthermore, it can be interpreted most naturally by compartmentalizing the neuron into the central compartment 𝐯 i subscript 𝐯 𝑖 \mathbf{v}_{i} from ( 1 ) and a feedforward compartment 𝐯 i ff ≜ W i ​ 𝐫 i − 1 ≜ superscript subscript 𝐯 𝑖 ff subscript 𝑊 𝑖 subscript 𝐫 𝑖 1 \mathbf{v}_{i}^{\text{ff}}\triangleq W_{i}\mathbf{r}_{i-1} that integrates the feedforward input. Now, the forward weight dynamics ( 5 ) represents a delta rule using the difference between the actual firing rate of the neuron, ϕ ​ ( 𝐯 i ) italic-ϕ subscript 𝐯 𝑖 \phi(\mathbf{v}_{i}) , and its estimated firing rate, ϕ ​ ( 𝐯 i ff ) italic-ϕ superscript subscript 𝐯 𝑖 ff \phi(\mathbf{v}_{i}^{\text{ff}}) , based on the feedforward inputs. Note that we assume τ W subscript 𝜏 𝑊 \tau_{W} to be a large time constant, such that the network ( 1 ) and controller dynamics ( 4 ) are not influenced by the weight dynamics, i.e., the weights are considered fixed in the timescale of the controller and network dynamics.

In Section 5 , we show how the feedback weights Q i subscript 𝑄 𝑖 Q_{i} can also be learned locally in time and space for supporting the stability of the network dynamics and the learning of W i subscript 𝑊 𝑖 W_{i} . This feedback learning rule needs a feedback compartment 𝐯 i fb ≜ Q i ​ 𝐮 ≜ superscript subscript 𝐯 𝑖 fb subscript 𝑄 𝑖 𝐮 \mathbf{v}_{i}^{\text{fb}}\triangleq Q_{i}\mathbf{u} , leading to the three-compartment neuron schematized in Fig. 1 C, inspired by recent multi-compartment models of the pyramidal neuron (see Discussion). Now, that we introduced the DFC model, we will show that (i) the weight updates ( 5 ) can properly optimize a loss function (Section 3 ), (ii) the resulting dynamical system is stable under certain conditions (Section 4 ), and (iii) learning the feedback weights facilitates (i) and (ii) (Section 5 ).

3 Learning theory

To understand how DFC optimizes the feedforward mapping ( 2 ) on a given loss function, we link the weight updates ( 5 ) to mathematical optimization theory. We start by showing that DFC dynamically inverts the output error to the hidden layers (Section 3.1 ), which we link to GN optimization under flexible constraints on the feedback weights Q i subscript 𝑄 𝑖 Q_{i} and on layer activations (Section 3.2 ). In Section 3.3 , we relax some of these constraints, and show that DFC still does principled optimization by using minimum norm (MN) updates for W i subscript 𝑊 𝑖 W_{i} . During this learning theory section, we assume stable dynamics, which we investigate in more detail in Section 4 . All theoretical results of this section are tailored towards a PI controller, and they can be easily extended to pure proportional or integral control (see App. A.8 ).

3.1 DFC dynamically inverts the output error

Assuming stable dynamics, a small target stepsize λ 𝜆 \lambda , and W i subscript 𝑊 𝑖 W_{i} and Q i subscript 𝑄 𝑖 Q_{i} fixed, the steady-state solutions of the dynamical systems ( 1 ) and ( 4 ) can be approximated by:

1 𝛼 subscript 𝑘 𝑝 \tilde{\alpha}=\alpha/(1+\alpha k_{p}) .

subscript superscript 𝐯 ff ss Δ 𝐯 \mathbf{v}_{\mathrm{ss}}=\mathbf{v}^{\text{ff}}_{\mathrm{ss}}+\Delta\mathbf{v} , such that the steady-state network output equals its target 𝐫 L ∗ superscript subscript 𝐫 𝐿 \mathbf{r}_{L}^{*} . With linearized network dynamics, this results in solving the linear system J ​ Δ ​ 𝐯 = 𝜹 L 𝐽 Δ 𝐯 subscript 𝜹 𝐿 J\Delta\mathbf{v}=\boldsymbol{\delta}_{L} . As Δ ​ 𝐯 Δ 𝐯 \Delta\mathbf{v} is of much higher dimension than 𝜹 L subscript 𝜹 𝐿 \boldsymbol{\delta}_{L} , this is an underdetermined system with infinitely many solutions. Constraining the solution to the column space of Q 𝑄 Q leads to the unique solution Δ ​ 𝐯 = Q ​ ( J ​ Q ) − 1 ​ 𝜹 L Δ 𝐯 𝑄 superscript 𝐽 𝑄 1 subscript 𝜹 𝐿 \Delta\mathbf{v}=Q(JQ)^{-1}\boldsymbol{\delta}_{L} , corresponding to the steady-state solution in Lemma 1 minus a small damping constant α ~ ~ 𝛼 \tilde{\alpha} . Hence, similar to Podlaski and Machens [ 32 ] , through an interplay between the network and controller dynamics, the controller dynamically inverts the output error 𝜹 L subscript 𝜹 𝐿 \boldsymbol{\delta}_{L} to produce feedback that exactly drives the network output to its desired target.

3.2 DFC approximates Gauss-Newton optimization

To understand the optimization characteristics of DFC, we show that under flexible conditions on Q i subscript 𝑄 𝑖 Q_{i} and the layer activations, DFC approximates GN optimization. We first briefly review GN optimization and introduce two conditions needed for the main theorem.

Gauss-Newton optimization [ 35 ] is an approximate second-order optimization method used in nonlinear least-squares regression. The GN update for the model parameters 𝜽 𝜽 \boldsymbol{\theta} is computed as:

with J θ subscript 𝐽 𝜃 J_{\theta} the Jacobian of the model output w.r.t. 𝜽 𝜽 \boldsymbol{\theta} concatenated for all minibatch samples, J θ † subscript superscript 𝐽 † 𝜃 J^{\dagger}_{\theta} its Moore-Penrose pseudoinverse, and 𝐞 L subscript 𝐞 𝐿 \mathbf{e}_{L} the output errors.

Condition 1 .

Each layer of the network, except from the output layer, has the same activation norm:

Note that the latter condition considers a statistic ‖ 𝐫 i ‖ 2 subscript norm subscript 𝐫 𝑖 2 \|\mathbf{r}_{i}\|_{2} of a whole layer and does not impose specific constraints on single neural firing rates. This condition can be interpreted as each layer, except the output layer, having the same ‘energy budget’ for firing.

Condition 2 .

The column space of Q 𝑄 Q is equal to the row space of J 𝐽 J .

This more abstract condition imposes a flexible constraint on the feedback weights Q i subscript 𝑄 𝑖 Q_{i} , that generalizes common learning rules with direct feedback connections [ 16 , 21 ] . For instance, besides Q = J T 𝑄 superscript 𝐽 𝑇 Q=J^{T} (BP; [ 16 ] ) and Q = J † 𝑄 superscript 𝐽 † Q=J^{\dagger} [ 21 ] , many other instances of Q 𝑄 Q which have not yet been explored in the literature fulfill Condition 2 (see Fig. 2 ), hence leading to principled optimization (see Theorem 2 ). With these conditions in place, we are ready to state the main theorem of this section (full proof in App. A ).

Theorem 2 .

with η 𝜂 \eta a stepsize parameter, align with the weight updates for W i subscript 𝑊 𝑖 W_{i} for the feedforward network ( 2 ) prescribed by the GN optimization method with a minibatch size of 1.

Refer to caption

In this theorem, we need Condition 2 such that the dynamical inversion Q ​ ( J ​ Q ) − 1 𝑄 superscript 𝐽 𝑄 1 Q(JQ)^{-1} ( 6 ) equals the pseudoinverse of J 𝐽 J and we need Condition 1 to extend this pseudoinverse to the Jacobian of the output w.r.t. the network weights, as in eq. ( 7 ). Theorem 2 links the DFC method to GN optimization, thereby showing that it does principled optimization, while being fundamentally different from BP. In contrast to recent work that connects target propagation to GN [ 21 , 22 ] , we do not need to approximate the GN curvature matrix by a block-diagonal matrix but use the full curvature instead. Hence, one can use Theorem 2 in Cai et al. [ 36 ] to obtain convergence results for this setting of GN with a minibatch size of 1, in highly overparameterized networks. Strikingly, the feedback path of DFC does not need to align with the forward path or its inverse to provide optimally aligned weight updates with GN, as long as it satisfies the flexible Condition 2 (see Fig. 2 ).

The steady-state updates ( 9 ) used in Theorem 2 differ from the actual updates ( 5 ) in two nuanced ways. First, the plasticity rule ( 5 ) uses a nonlinearity, ϕ italic-ϕ \phi , of the compartment activations, whereas in Theorem 2 this nonlinearity is not included. There are two reasons for this: (i) the use of ϕ italic-ϕ \phi in ( 5 ) can be linked to specific biophysical mechanisms in the pyramidal cell [ 37 ] (see Discussion), and (ii) using ϕ italic-ϕ \phi makes sure that saturated neurons do not update their forward weights, which leads to better performance (see App. A.6 ). Second, in Theorem 2 , the weights are only updated at steady state, whereas in ( 5 ) they are continuously updated during the dynamics of the network and controller. Before settling rapidly, the dynamics oscillate around the steady-state value (see Fig. 1 D), and hence, the accumulated continuous updates ( 5 ) will be approximately equal to its steady-state equivalent, since the oscillations approximately cancel each other out and the steady state is quickly reached (see Section 6.1 and App. A.7 ). Theorem 2 needs a L 2 superscript 𝐿 2 L^{2} loss function and Condition 1 and 2 to hold for linking DFC with GN. In the following subsection, we relax these assumptions and show that DFC still does principled optimization.

3.3 DFC uses weighted minimum norm updates

GN optimization with a minibatch size of 1 is equivalent to MN updates [ 21 ] , i.e., it computes the smallest possible weight update such that the network exactly reaches the current output target after the update. These MN updates can be generalized to weighted MN updates for targets using arbitrary loss functions. The following theorem shows the connection between DFC and these weighted MN updates, while removing the need for Condition 1 and an L 2 superscript 𝐿 2 L^{2} loss (full proof in App. A ).

Theorem 3 .

𝑚 1 𝐿 \mathbf{r}^{-(m+1)}_{L} the network output without feedback after the weight update.

Theorem 3 shows that Condition 2 enables the controller to drive the network towards its target 𝐫 L ∗ superscript subscript 𝐫 𝐿 \mathbf{r}_{L}^{*} with MN activation changes, Δ ​ 𝐯 = 𝐯 − 𝐯 ff Δ 𝐯 𝐯 superscript 𝐯 ff \Delta\mathbf{v}=\mathbf{v}-\mathbf{v}^{\text{ff}} , which combined with the steady-state weight update ( 9 ) result in weighted MN updates Δ ​ W i Δ subscript 𝑊 𝑖 \Delta W_{i} (see also App. A.4 ). When the feedback weights do not have the correct column space, the weight updates will not be MN. Nevertheless, the following proposition shows that the weight updates still follow a descent direction given arbitrary feedback weights.

Proposition 4 .

4 stability of dfc.

Until now, we assumed that the network dynamics are stable, which is necessary for DFC, as an unstable network will diverge, making learning impossible. In this section, we investigate the conditions on the feedback weights Q i subscript 𝑄 𝑖 Q_{i} necessary for stability. To gain intuition, we linearize the network around its feedforward values, assume a separation of timescales between the controller and the network ( τ u ≫ τ v much-greater-than subscript 𝜏 𝑢 subscript 𝜏 𝑣 \tau_{u}\gg\tau_{v} ), and only consider integrative control ( k p = 0 subscript 𝑘 𝑝 0 k_{p}=0 ). This results in the following dynamics (see App. B for the derivation):

Hence, in this simplified case, the local stability of the network around the equilibrium point depends on the eigenvalues of J ​ Q 𝐽 𝑄 JQ , which is formalized in the following condition and proposition.

Condition 3 .

Given the network Jacobian evaluated at the steady state, J ss ≜ [ ∂ 𝐫 L − ∂ 𝐯 1 , … , ∂ 𝐫 L − ∂ 𝐯 L ] | 𝐯 = 𝐯 ss J_{\mathrm{ss}}\triangleq\left.\left[\frac{\partial\mathbf{r}^{-}_{L}}{\partial\mathbf{v}_{1}},...,\frac{\partial\mathbf{r}^{-}_{L}}{\partial\mathbf{v}_{L}}\right]\right\rvert_{\mathbf{v}=\mathbf{v}_{\mathrm{ss}}} , the real parts of the eigenvalues of J ss ​ Q subscript 𝐽 ss 𝑄 J_{\mathrm{ss}}Q are all greater than − α 𝛼 -\alpha .

Proposition 5 .

Assuming τ u ≫ τ v much-greater-than subscript 𝜏 𝑢 subscript 𝜏 𝑣 \tau_{u}\gg\tau_{v} and k p = 0 subscript 𝑘 𝑝 0 k_{p}=0 , the network and controller dynamics are locally asymptotically stable around its equilibrium iff Condition 3 holds.

This proposition follows directly from Lyapunov’s Indirect Method [ 38 ] . When assuming the more general case where τ v subscript 𝜏 𝑣 \tau_{v} is not negligible and k p > 0 subscript 𝑘 𝑝 0 k_{p}>0 , the stability criteria quickly become less interpretable (see App. B ). However, experimentally, we see that Condition 3 is a good proxy condition for guaranteeing stability in the general case where τ v subscript 𝜏 𝑣 \tau_{v} is not negligible and k p > 0 subscript 𝑘 𝑝 0 k_{p}>0 (see Section 6 and App. B ).

5 Learning the feedback weights

Condition 2 and 3 emphasize the importance of the feedback weights for enabling efficient learning and ensuring stability of the network dynamics, respectively. As the forward weights, and hence the network Jacobian, J 𝐽 J , change during training, the set of feedback configurations that satisfy Conditions 2 and 3 also change. This creates the need to adapt the feedback weights accordingly to ensure efficient learning and network stability. We solve this challenge by learning the feedback weights, such that they can adapt to the changing network during training. We separate forward and feedback weight training in alternating wake-sleep phases [ 39 ] . Note that in practice, a fast alternation between the two phases is not required (see Section 6 ).

Inspired by the Weight Mirror method [ 14 ] , we learn the feedback weights by inserting independent zero-mean noise ϵ bold-italic-ϵ \boldsymbol{\epsilon} in the system dynamics:

The noise fluctuations propagated to the output carry information from the network Jacobian, J 𝐽 J . To let 𝐞 𝐞 \mathbf{e} , and hence 𝐮 𝐮 \mathbf{u} , incorporate this noise information, we set the output target 𝐫 L ∗ subscript superscript 𝐫 𝐿 \mathbf{r}^{*}_{L} to the average network output 𝐫 L − superscript subscript 𝐫 𝐿 \mathbf{r}_{L}^{-} . As the network is continuously perturbed by noise, the controller will try to counteract the noise and regulate the network towards the output target 𝐫 L − superscript subscript 𝐫 𝐿 \mathbf{r}_{L}^{-} . The feedback weights can then be trained with a simple anti-Hebbian plasticity rule with weight decay, which is local in space and time:

subscript 𝑄 𝑖 𝐮 subscript 𝜎 fb superscript subscript bold-italic-ϵ 𝑖 fb \mathbf{v}^{\text{fb}}_{i}=Q_{i}\mathbf{u}+\sigma_{\text{fb}}\boldsymbol{\epsilon}_{i}^{\text{fb}} . The correlation between the noise in 𝐯 i fb subscript superscript 𝐯 fb 𝑖 \mathbf{v}^{\text{fb}}_{i} and noise fluctuations in 𝐮 𝐮 \mathbf{u} provides the teaching signal for Q i subscript 𝑄 𝑖 Q_{i} . Theorem 6 shows under simplifying assumptions that the feedback learning rule ( 13 ) drives Q i subscript 𝑄 𝑖 Q_{i} to satisfy Condition 2 and 3 (see App. C for the full theorem and its proof).

Theorem 6 (Short version) .

Assume a separation of timescales τ v ≪ τ u ≪ τ Q much-less-than subscript 𝜏 𝑣 subscript 𝜏 𝑢 much-less-than subscript 𝜏 𝑄 \tau_{v}\ll\tau_{u}\ll\tau_{Q} , α 𝛼 \alpha big, k p = 0 subscript 𝑘 𝑝 0 k_{p}=0 , 𝐫 L ∗ = 𝐫 L − superscript subscript 𝐫 𝐿 superscript subscript 𝐫 𝐿 \mathbf{r}_{L}^{*}=\mathbf{r}_{L}^{-} , and Condition 3 holds. Then, for a fixed input sample and σ → 0 → 𝜎 0 \sigma\rightarrow 0 , the first moment of Q 𝑄 Q converges approximately to:

for some γ > 0 𝛾 0 \gamma>0 . Furthermore, 𝔼 ​ [ Q ss ] 𝔼 delimited-[] subscript 𝑄 ss \mathbb{E}[Q_{\mathrm{ss}}] satisfies Conditions 2 and 3 , even if α = 0 𝛼 0 \alpha=0 in the latter.

Theorem 6 shows that under simplifying assumptions, Q 𝑄 Q converges towards a damped pseudoinverse of J 𝐽 J , which satisfies Conditions 2 and 3 . Empirically, we see that this also approximately holds for more general settings where τ v subscript 𝜏 𝑣 \tau_{v} is not negligible, k p > 0 subscript 𝑘 𝑝 0 k_{p}>0 , and small α 𝛼 \alpha (see Section 6 and App. C ).

𝐽 superscript 𝐽 𝑇 𝛾 𝐼 1 J^{T}(JJ^{T}+\gamma I)^{-1} over many samples.

6 Experiments

6.1 empirical verification of the theory.

Figure 3 visualizes the theoretical results of Theorems 2 and 3 and Conditions 1 , 2 and 3 , in an empirical setting of nonlinear student teacher regression, where a randomly initialized teacher network generates synthetic training data for a student network. We see that Condition 2 is approximately satisfied for all DFC variants that learn their feedback weights (Fig. 3 A), leading to close alignment with the ideal weighted MN updates of Theorem 3 (Fig. 3 B). For nonlinear networks and linear direct feedback, it is in general not possible to perfectly satisfy Condition 2 as the network Jacobian J 𝐽 J varies for each datasample, while Q i subscript 𝑄 𝑖 Q_{i} remains the same. However, the results indicate that feedback learning finds a configuration for Q i subscript 𝑄 𝑖 Q_{i} that approximately satisfies Condition 2 for all datasamples. When the feedback weights are fixed, Condition 2 is approximately satisfied in the beginning of training due to a good initialization. However, as the network changes during training, Condition 2 degrades modestly, which results in worse alignment compared to DFC with trained feedback weights (Fig. 3 B).

For having GN updates, both Conditions 1 and 2 need to be satisfied. Although we do not enforce Condition 1 during training, we see in Fig. 3 C that it is crudely satisfied, which can be explained by the saturating properties of the tanh \tanh nonlinearity. This is reflected in the alignment with the ideal GN updates in Fig. 3 D that follows the same trend as the alignment with the MN updates. Fig. 3 E shows that all DFC variants remain stable throughout training, even when the feedback weights are fixed. In App. B , we indicate that Condition 3 is a good proxy for the stability shown in Fig. 3 E. Finally, we see in Fig. 3 F that the weight updates of DFC and DFC-SS align well with the analytical steady-state solution of Lemma 1 , confirming that our learning theory of Section 3 applies to the continuous weight updates ( 5 ) of DFC.

Refer to caption

In Fig. 4, we show that the alignment with MN updates remains robust for λ ∈ [ 10 − 3 : 10 − 1 ] \lambda\in[10^{-3}:10^{-1}] and α ∈ [ 10 − 4 : 10 − 1 ] \alpha\in[10^{-4}:10^{-1}] , highlighting that our theory explains the behavior of DFC robustly when the limit of λ 𝜆 \lambda and α 𝛼 \alpha to zero does not hold. When we clamp the output target to the label ( λ = 0.5 𝜆 0.5 \lambda=0.5 ), the alignment with the MN updates decreases as expected (see Fig. 4), because the linearization of Lemma 1 becomes less accurate and the strong feedback changes the neural activations more significantly, thereby changing the pre-synaptic factor of the update rules (c.f. eq. 9 ). However, performance results on MNIST, provided in Table 2 , show that the performance of DFC remains robust for a wide range of λ 𝜆 \lambda s and α 𝛼 \alpha s, including λ = 0.5 𝜆 0.5 \lambda=0.5 , suggesting that DFC can also provide principled CA in this setting of strong feedback, which motivates future work to design a complementary theory for DFC focused on this extreme case.

[Uncaptioned image]

Figure 4: Comparison of the alignment between the DFC weight updates and the MN updates for variable values of λ 𝜆 \lambda (A) and α 𝛼 \alpha (B), when performing the nonlinear student-teacher regression task described in Fig. 3 . Stars indicate overlapping plots.

6.2 Performance of DFC on computer vision benchmarks

The classification results on MNIST and Fashion-MNIST (Table 1 ) show that the performances of DFC and its variants, but also its controls, lie close to the performance of BP, indicating that they perform proper CA in these tasks. To see significant differences between the methods, we consider the more challenging task of training an autoencoder on MNIST, where it is known that DFA fails to provide precise CA [ 9 , 16 , 32 ] . The results in Table 1 show that the DFC variants with trained feedback weights clearly outperform DFA and have close performance to BP. The low performance of the DFC variants with fixed feedback weights show the importance of learning the feedback weights continuously during training to satisfy Condition 2 . Finally, to disentangle optimization performance from implicit regularization mechanisms, which both influence the test performance, we investigate the performance of all methods in minimizing the training loss of MNIST. 2 2 2 We used separate hyperparameter configurations, selected for minimizing the training loss. The results in Table 1 show improved performance of the DFC method with trained feedback weights compared to BP and controls, suggesting that the approximate MN updates of DFC can faster descend the loss landscape for this simple dataset.

7 Discussion

We introduced DFC as an alternative biologically-plausible learning method for deep neural networks. DFC uses error feedback to drive the network activations to a desired output target. This process generates a neuron-specific learning signal which can be used to learn both forward and feedback weights locally in time and space. In contrast to other recent methods that learn the feedback weights and aim to approximate BP [ 14 , 15 , 16 , 17 , 26 ] , we show that DFC approximates GN optimization, making it fundamentally different from BP approximations.

DFC is optimal – i.e., Conditions 2 and 3 are satisfied – for a wide range of feedback connectivity strengths. Thus, we prove that principled learning can be achieved with local rules and without symmetric feedforward and feedback connectivity by leveraging the network dynamics. This finding has interesting implications for experimental neuroscientific research looking for precise patterns of symmetric connectivity in the brain. Moreover, from a computational standpoint, the flexibility that stems from Conditions 2 and 3 might be relevant for other mechanisms besides learning, such as attention and prediction [ 8 ] .

To present DFC in its simplest form, we used direct feedback mappings from the output controller to all hidden layers. Although numerous anatomical studies of the mammalian neocortex reported the occurrence of such direct feedback connections [ 45 , 46 ] , it is unlikely that all feedback pathways are direct. We note that DFC is also compatible with other feedback mappings, such as layerwise connections or separate feedback pathways with multiple layers of neurons (see App. H ).

Interestingly, the three-compartment neuron is closely linked to recent multi-compartment models of the cortical pyramidal neuron [ 23 , 25 , 26 , 47 ] . In the terminology of these models, our central, feedforward, and feedback compartments, correspond to the somatic, basal dendritic, and apical dendritic compartments of pyramidal neurons, respectively (see Fig. 1 C). In line with DFC, experimental observations [ 48 , 49 ] suggest that feedforward connections converge onto the basal compartment and feedback connections onto the apical compartment. Moreover, our plasticity rule for the forward weights ( 5 ) belongs to a class of dendritic predictive plasticity rules for which a biological implementation based on backpropagating action potentials has been put forward [ 37 ] .

Limitations and future work. In practice, the forward weight updates are not exactly equal to GN or MN updates (Theorems 2 and 3 ), due to (i) the nonlinearity ϕ italic-ϕ \phi in the weight update rule 5 , (ii) non-infinitesimal values for α 𝛼 \alpha and λ 𝜆 \lambda , (iii) limited training iterations for the feedback weights, and (iv) the limited capacity of linear feedback mappings to satisfy Condition 2 for each datasample. Figs. 3 and 4, and Table 2 show that DFC approximates the theory well in practice and has robust performance, however, future work can improve the results further by investigating new feedback architectures (see App. H ). We note that, even though GN optimization has desirable approximate second-order optimization properties, it is presently unclear whether these second-order characteristics translate to our setting with a minibatch size of 1. Currently, our proposed feedback learning rule ( 13 ) aims to approximate one specific configuration and hence does not capitalize on the increased flexibility of DFC and Condition 2 . Therefore, an interesting future direction is to design more flexible feedback learning rules that aim to satisfy Conditions 2 and 3 without targeting one specific configuration. Furthermore, DFC needs two separate phases for training the forward weights and feedback weights. Interestingly, if the feedback plasticity rule ( 13 ) uses a high-passed filtered version of the presynaptic input 𝐮 𝐮 \mathbf{u} , both phases can be merged into one, with plasticity always on for both forward and feedback weights (see App. C.3 ). Finally, as DFC is dynamical in nature, it is costly to simulate on commonly used hardware for deep learning, prohibiting us from testing DFC on large-scale problems such as those considered by Bartunov et al. [ 10 ] . A promising alternative is to implement DFC on analog hardware, where the dynamics of DFC can correspond to real physical processes on a chip. This would not only make DFC resource-efficient, but also position DFC as an interesting training method for analog implementations of deep neural networks, commonly used in Edge AI and other applications where low energy consumption is key [ 50 , 51 ] .

To conclude, we show that DFC can provide principled CA in deep neural networks by actively using error feedback to drive neural activations. The flexible requirements for feedback mappings combined with the strong link between DFC and GN, underline that it is possible to do principled CA in neural networks without adhering to the symmetric layer-wise feedback structure imposed by BP.

Acknowledgments and Disclosure of Funding

This work was supported by the Swiss National Science Foundation (B.F.G. CRSII5-173721 and 315230_189251), ETH project funding (B.F.G. ETH-20 19-01), the Human Frontiers Science Program (RGY0072/2019) and funding from the Swiss Data Science Center (B.F.G, C17-18, J. v. O. P18-03). João Sacramento was supported by an Ambizione grant (PZ00P3_186027) from the Swiss National Science Foundation. Pau Vilimelis Aceituno was supported by an ETH Zürich Postdoc fellowship. Javier García Ordóñez received support from La Caixa Foundation through the Postgraduate Studies in Europe scholarship. We would like to thank Anh Duong Vo and Nicolas Zucchet for feedback, William Podlaski, Jean-Pascal Pfister and Aditya Gilra for insightful discussions, and Simone Surace for his detailed feedback on Appendix C .1.

  • Rumelhart et al. [1986] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back-propagating errors. Nature , 323(6088):533, 1986.
  • Werbos [1982] Paul J Werbos. Applications of advances in nonlinear sensitivity analysis. In System modeling and optimization , pages 762–770. Springer, 1982.
  • Linnainmaa [1970] Seppo Linnainmaa. The representation of the cumulative rounding error of an algorithm as a taylor expansion of the local rounding errors. Master’s Thesis (in Finnish), Univ. Helsinki , pages 6–7, 1970.
  • Crick [1989] Francis Crick. The recent excitement about neural networks. Nature , 337(6203):129–132, 1989.
  • Grossberg [1987] Stephen Grossberg. Competitive learning: From interactive activation to adaptive resonance. Cognitive Science , 11(1):23–63, 1987.
  • Lillicrap et al. [2020] Timothy P Lillicrap, Adam Santoro, Luke Marris, Colin J Akerman, and Geoffrey Hinton. Backpropagation and the brain. Nature Reviews Neuroscience , pages 1–12, 2020.
  • Larkum et al. [2009] Matthew E Larkum, Thomas Nevian, Maya Sandler, Alon Polsky, and Jackie Schiller. Synaptic integration in tuft dendrites of layer 5 pyramidal neurons: a new unifying principle. Science , 325(5941):756–760, 2009.
  • Gilbert and Li [2013] Charles D Gilbert and Wu Li. Top-down influences on visual processing. Nature Reviews Neuroscience , 14(5):350–363, 2013.
  • Lillicrap et al. [2016] Timothy P Lillicrap, Daniel Cownden, Douglas B Tweed, and Colin J Akerman. Random synaptic feedback weights support error backpropagation for deep learning. Nature Communications , 7:13276, 2016.
  • Bartunov et al. [2018] Sergey Bartunov, Adam Santoro, Blake Richards, Luke Marris, Geoffrey E Hinton, and Timothy Lillicrap. Assessing the scalability of biologically-motivated deep learning algorithms and architectures. In Advances in Neural Information Processing Systems 31 , pages 9368–9378, 2018.
  • Launay et al. [2019] Julien Launay, Iacopo Poli, and Florent Krzakala. Principled training of neural networks with direct feedback alignment. arXiv preprint arXiv:1906.04554 , 2019.
  • Moskovitz et al. [2018] Theodore H Moskovitz, Ashok Litwin-Kumar, and LF Abbott. Feedback alignment in deep convolutional networks. arXiv preprint arXiv:1812.06488 , 2018.
  • Crafton et al. [2019] Brian Alexander Crafton, Abhinav Parihar, Evan Gebhardt, and Arijit Raychowdhury. Direct feedback alignment with sparse connections for local learning. Frontiers in Neuroscience , 13:525, 2019.
  • Akrout et al. [2019] Mohamed Akrout, Collin Wilson, Peter Humphreys, Timothy Lillicrap, and Douglas B Tweed. Deep learning without weight transport. In Advances in Neural Information Processing Systems 32 , pages 974–982, 2019.
  • Kunin et al. [2020] Daniel Kunin, Aran Nayebi, Javier Sagastuy-Brena, Surya Ganguli, Jonathan Bloom, and Daniel Yamins. Two routes to scalable credit assignment without weight symmetry. In International Conference on Machine Learning , pages 5511–5521. PMLR, 2020.
  • Lansdell et al. [2020] Benjamin James Lansdell, Prashanth Prakash, and Konrad Paul Kording. Learning to solve the credit assignment problem. In International Conference on Learning Representations , 2020.
  • Guerguiev et al. [2020] Jordan Guerguiev, Konrad Kording, and Blake Richards. Spike-based causal inference for weight alignment. In International Conference on Learning Representations , 2020.
  • Golkar et al. [2020] Siavash Golkar, David Lipshutz, Yanis Bahroun, Anirvan M. Sengupta, and Dmitri B. Chklovskii. A biologically plausible neural network for local supervision in cortical microcircuits, 2020.
  • Bengio [2014] Yoshua Bengio. How auto-encoders could provide credit assignment in deep networks via target propagation. arXiv preprint arXiv:1407.7906 , 2014.
  • Lee et al. [2015] Dong-Hyun Lee, Saizheng Zhang, Asja Fischer, and Yoshua Bengio. Difference target propagation. In Joint european conference on machine learning and knowledge discovery in databases , pages 498–515. Springer, 2015.
  • Meulemans et al. [2020] Alexander Meulemans, Francesco Carzaniga, Johan Suykens, João Sacramento, and Benjamin F. Grewe. A theoretical framework for target propagation. Advances in Neural Information Processing Systems , 33:20024–20036, 2020.
  • Bengio [2020] Yoshua Bengio. Deriving differential target propagation from iterating approximate inverses. arXiv preprint arXiv:2007.15139 , 2020.
  • Sacramento et al. [2018] João Sacramento, Rui Ponte Costa, Yoshua Bengio, and Walter Senn. Dendritic cortical microcircuits approximate the backpropagation algorithm. In Advances in Neural Information Processing Systems 31 , pages 8721–8732, 2018.
  • Whittington and Bogacz [2017] James CR Whittington and Rafal Bogacz. An approximation of the error backpropagation algorithm in a predictive coding network with local hebbian synaptic plasticity. Neural computation , 29(5):1229–1262, 2017.
  • Guerguiev et al. [2017] Jordan Guerguiev, Timothy P Lillicrap, and Blake A Richards. Towards deep learning with segregated dendrites. ELife , 6:e22901, 2017.
  • Payeur et al. [2021] Alexandre Payeur, Jordan Guerguiev, Friedemann Zenke, Blake Richards, and Richard Naud. Burst-dependent synaptic plasticity can coordinate learning in hierarchical circuits. Nature neuroscience , 24(5):1546, 2021.
  • Slotine et al. [1991] Jean-Jacques E Slotine, Weiping Li, et al. Applied nonlinear control , volume 199. Prentice hall Englewood Cliffs, NJ, 1991.
  • Gilra and Gerstner [2017] Aditya Gilra and Wulfram Gerstner. Predicting non-linear dynamics by stable local learning in a recurrent spiking neural network. Elife , 6:e28295, 2017.
  • Denève et al. [2017] Sophie Denève, Alireza Alemi, and Ralph Bourdoukan. The brain as an efficient and robust adaptive learner. Neuron , 94(5):969–977, 2017.
  • Alemi et al. [2018] Alireza Alemi, Christian Machens, Sophie Denève, and Jean-Jacques Slotine. Learning arbitrary dynamics in efficient, balanced spiking networks using local plasticity rules. AAAI Conference on Artificial Intelligence (AAAI) , 2018.
  • Bourdoukan and Deneve [2015] Ralph Bourdoukan and Sophie Deneve. Enforcing balance allows local supervised learning in spiking recurrent networks. Advances in Neural Information Processing Systems , 28:982–990, 2015.
  • Podlaski and Machens [2020] William F Podlaski and Christian K Machens. Biological credit assignment through dynamic inversion of feedforward networks. Advances in Neural Information Processing Systems 33 , 2020.
  • Kohan et al. [2018] Adam A Kohan, Edward A Rietman, and Hava T Siegelmann. Error forward-propagation: Reusing feedforward connections to propagate errors in deep learning. arXiv preprint arXiv:1808.03357 , 2018.
  • Franklin et al. [2015] Gene F Franklin, J David Powell, and Abbas Emami-Naeini. Feedback control of dynamic systems . Pearson London, 2015.
  • Gauss [1809] Carl Friedrich Gauss. Theoria motus corporum coelestium in sectionibus conicis solem ambientium , volume 7. Perthes et Besser, 1809.
  • Cai et al. [2019] Tianle Cai, Ruiqi Gao, Jikai Hou, Siyu Chen, Dong Wang, Di He, Zhihua Zhang, and Liwei Wang. A gram-gauss-newton method learning overparameterized deep neural networks for regression problems. arXiv preprint arXiv:1905.11675 , 2019.
  • Urbanczik and Senn [2014] Robert Urbanczik and Walter Senn. Learning by the dendritic prediction of somatic spiking. Neuron , 81(3):521–528, 2014.
  • Lyapunov [1992] A. M. Lyapunov. The general problem of the stability of motion. International Journal of Control , 55(3):531–534, 1992. doi: 10.1080/00207179208934253 .
  • Hinton et al. [1995] Geoffrey E Hinton, Peter Dayan, Brendan J Frey, and Radford M Neal. The" wake-sleep" algorithm for unsupervised neural networks. Science , 268(5214):1158–1161, 1995.
  • LeCun [1998] Yann LeCun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/ , 1998.
  • Xiao et al. [2017] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747 , 2017.
  • Nøkland [2016] Arild Nøkland. Direct feedback alignment provides learning in deep neural networks. In Advances in neural information processing systems , pages 1037–1045, 2016.
  • Särkkä and Solin [2019] Simo Särkkä and Arno Solin. Applied stochastic differential equations , volume 10. Cambridge University Press, 2019.
  • Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings , 2014.
  • Ungerleider et al. [2008] Leslie G Ungerleider, Thelma W Galkin, Robert Desimone, and Ricardo Gattass. Cortical connections of area v4 in the macaque. Cerebral Cortex , 18(3):477–499, 2008.
  • Rockland and Van Hoesen [1994] Kathleen S Rockland and Gary W Van Hoesen. Direct temporal-occipital feedback connections to striate cortex (v1) in the macaque monkey. Cerebral cortex , 4(3):300–313, 1994.
  • Richards and Lillicrap [2019] Blake A Richards and Timothy P Lillicrap. Dendritic solutions to the credit assignment problem. Current opinion in neurobiology , 54:28–36, 2019.
  • Larkum [2013] Matthew Larkum. A cellular mechanism for cortical associations: an organizing principle for the cerebral cortex. Trends in neurosciences , 36(3):141–151, 2013.
  • Spruston [2008] Nelson Spruston. Pyramidal neurons: dendritic structure and synaptic integration. Nature Reviews Neuroscience , 9(3):206–221, 2008.
  • Xiao et al. [2020] T Patrick Xiao, Christopher H Bennett, Ben Feinberg, Sapan Agarwal, and Matthew J Marinella. Analog architectures for neural network acceleration based on non-volatile memory. Applied Physics Reviews , 7(3):031301, 2020.
  • Misra and Saha [2010] Janardan Misra and Indranil Saha. Artificial neural networks in hardware: A survey of two decades of progress. Neurocomputing , 74(1-3):239–255, 2010.
  • Moore [1920] Eliakim H Moore. On the reciprocal of the general algebraic matrix. Bull. Am. Math. Soc. , 26:394–395, 1920.
  • Penrose [1955] Roger Penrose. A generalized inverse for matrices. In Mathematical proceedings of the Cambridge philosophical society , volume 51, pages 406–413. Cambridge University Press, 1955.
  • Levenberg [1944] Kenneth Levenberg. A method for the solution of certain non-linear problems in least squares. Quarterly of applied mathematics , 2(2):164–168, 1944.
  • Campbell and Meyer [2009] Stephen L Campbell and Carl D Meyer. Generalized inverses of linear transformations . SIAM, 2009.
  • Schraudolph [2002] Nicol N Schraudolph. Fast curvature matrix-vector products for second-order gradient descent. Neural computation , 14(7):1723–1738, 2002.
  • Zhang et al. [2019] Guodong Zhang, James Martens, and Roger B Grosse. Fast convergence of natural gradient descent for over-parameterized neural networks. In Advances in Neural Information Processing Systems 32 , pages 8080–8091, 2019.
  • Seung [1996] H Sebastian Seung. How the brain keeps the eyes still. Proceedings of the National Academy of Sciences , 93(23):13339–13344, 1996.
  • Koulakov et al. [2002] Alexei A Koulakov, Sridhar Raghavachari, Adam Kepecs, and John E Lisman. Model for a robust neural integrator. Nature neuroscience , 5(8):775–782, 2002.
  • Goldman et al. [2003] Mark S Goldman, Joseph H Levine, Guy Major, David W Tank, and HS Seung. Robust persistent neural activity in a model integrator with multiple hysteretic dendrites per neuron. Cerebral cortex , 13(11):1185–1195, 2003.
  • Goldman et al. [2010] Mark S Goldman, A Compte, and Xiao-Jing Wang. Neural integrator models. Encyclopedia of neuroscience , pages 165–178, 2010.
  • Lim and Goldman [2013] Sukbin Lim and Mark S Goldman. Balanced cortical microcircuitry for maintaining information in working memory. Nature neuroscience , 16(9):1306–1314, 2013.
  • Bejarano et al. [2018] D Bejarano, Eduardo Ibargüen-Mondragón, and Enith Amanda Gómez-Hernández. A stability test for non linear systems of ordinary differential equations based on the gershgorin circles. Contemporary Engineering Sciences , 11(91):4541–4548, 2018.
  • Martens and Grosse [2015] James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In Proceedings of the 32nd International Conference on Machine Learning , pages 2408–2417, 2015.
  • Botev et al. [2017] Aleksandar Botev, Hippolyt Ritter, and David Barber. Practical gauss-newton optimisation for deep learning. In Proceedings of the 34th International Conference on Machine Learning , pages 557–565. JMLR. org, 2017.
  • Glorot and Bengio [2010] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics , pages 249–256. JMLR Workshop and Conference Proceedings, 2010.
  • Paszke et al. [2017] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
  • Bergstra et al. [2011] James S Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. Algorithms for hyper-parameter optimization. In Advances in neural information processing systems , pages 2546–2554, 2011.
  • Bergstra et al. [2013] James Bergstra, Dan Yamins, and David D Cox. Hyperopt: A python library for optimizing the hyperparameters of machine learning algorithms. In Proceedings of the 12th Python in science conference , pages 13–20. Citeseer, 2013.
  • Liaw et al. [2018] Richard Liaw, Eric Liang, Robert Nishihara, Philipp Moritz, Joseph E Gonzalez, and Ion Stoica. Tune: A research platform for distributed model selection and training. arXiv preprint arXiv:1807.05118 , 2018.
  • Paszke et al. [2019] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32 , pages 8024–8035. Curran Associates, Inc., 2019.
  • Silver [2010] R Angus Silver. Neuronal arithmetic. Nature Reviews Neuroscience , 11(7):474–489, 2010.
  • Ferguson and Cardin [2020] Katie A Ferguson and Jessica A Cardin. Mechanisms underlying gain modulation in the cortex. Nature Reviews Neuroscience , 21(2):80–92, 2020.
  • Larkum et al. [2004] Matthew E Larkum, Walter Senn, and Hans-R Lüscher. Top-down dendritic input increases the gain of layer 5 pyramidal neurons. Cerebral cortex , 14(10):1059–1070, 2004.
  • Naud and Sprekeler [2017] Richard Naud and Henning Sprekeler. Burst ensemble multiplexing: A neural code connecting dendritic spikes with microcircuits. bioRxiv , page 143636, 2017.
  • Bengio et al. [2015] Yoshua Bengio, Dong-Hyun Lee, Jorg Bornschein, Thomas Mesnard, and Zhouhan Lin. Towards biologically plausible deep learning. arXiv preprint arXiv:1502.04156 , 2015.

Supplementary Material

Alexander Meulemans ∗ , Matilde Tristany Farinha ∗ , Javier García Ordóñez, Pau Vilimelis Aceituno, João Sacramento, Benjamin F. Grewe Institute of Neuroinformatics, University of Zürich and ETH Zürich [email protected]

Appendix A Proofs and extra information for Section 3 : Learning theory

A.1 linearized dynamics and fixed points.

In this section, we linearize the network dynamics around the feedforward voltage levels 𝐯 i − superscript subscript 𝐯 𝑖 \mathbf{v}_{i}^{-} (i.e., the equilibrium of the network when no feedback is present) and study the equilibrium points resulting from the feedback input from the controller.

First, we introduce some shorthand notations:

To investigate the steady state of the network and controller dynamics, we start by proving Lemma 1 , which we restate here for convenience.

Assuming stable dynamics, a small target stepsize λ 𝜆 \lambda , and W i subscript 𝑊 𝑖 W_{i} and Q i subscript 𝑄 𝑖 Q_{i} fixed, the steady-state solutions of the dynamical systems ( 1 ) and ( 4 ) can be approximated by

The proof is ordered as follows: first, we linearize the network dynamics around the feedforward equilibrium of ( 2 ). Then, we solve the algebraic set of linear equilibrium equations.

By linearizing the dynamics, we can derive the control error 𝐞 ​ ( t ) ≜ 𝐫 L ∗ − 𝐫 L ​ ( t ) ≜ 𝐞 𝑡 superscript subscript 𝐫 𝐿 subscript 𝐫 𝐿 𝑡 \mathbf{e}(t)\triangleq\mathbf{r}_{L}^{*}-\mathbf{r}_{L}(t) as an affine transformation of Δ ​ 𝐯 Δ 𝐯 \Delta\mathbf{v} . First, note that

By recursion, we have that

with Δ ​ 𝐯 1 = Δ − ​ 𝐯 1 = 𝐯 1 − 𝐯 1 − Δ subscript 𝐯 1 superscript Δ subscript 𝐯 1 subscript 𝐯 1 superscript subscript 𝐯 1 \Delta\mathbf{v}_{1}=\Delta^{-}\mathbf{v}_{1}=\mathbf{v}_{1}-\mathbf{v}_{1}^{-} because the input to the network is not influenced by the controller, i.e., 𝐯 0 = 𝐯 0 − subscript 𝐯 0 superscript subscript 𝐯 0 \mathbf{v}_{0}=\mathbf{v}_{0}^{-} .

The control error given by

The controller dynamics are given by

By differentiating ( 39 ) and using 𝐮 int = 𝐮 − k p ​ 𝐞 superscript 𝐮 int 𝐮 subscript 𝑘 𝑝 𝐞 \mathbf{u}^{\text{int}}=\mathbf{u}-k_{p}\mathbf{e} we get the following controller dynamics for 𝐮 𝐮 \mathbf{u} :

The system of equations ( 29 ) and ( 41 ) can be solved in steady state as follows. From ( 29 ) at steady state, we have

Substituting Δ ​ 𝐯 ss Δ subscript 𝐯 ss \Delta\mathbf{v}_{\mathrm{ss}} into the steady state of ( 41 ) while using the linearized control error ( 34 ) gives

superscript 𝐯 ff Δ 𝐯 \mathbf{v}=\mathbf{v}^{\text{ff}}+\Delta\mathbf{v} concludes the proof. ∎

In the next section, we will investigate how this steady-state solution can result in useful weight updates (plasticity) for the forward weights W i subscript 𝑊 𝑖 W_{i} .

A.2 DFC approximates Gauss-Newton optimization

Assuming J 𝐽 J has full rank,

iff Condition 2 holds, i.e., Col ​ ( Q ) = Row ​ ( J ) Col 𝑄 Row 𝐽 \text{Col}(Q)=\text{Row}(J) .

We begin by stating the Moore-Penrose conditions [ 53 ] :

Condition S1 .

B = A † 𝐵 superscript 𝐴 † B=A^{\dagger} iff

A ​ B ​ A = A 𝐴 𝐵 𝐴 𝐴 ABA=A

B ​ A ​ B = B 𝐵 𝐴 𝐵 𝐵 BAB=B

A ​ B = ( A ​ B ) T 𝐴 𝐵 superscript 𝐴 𝐵 𝑇 AB=(AB)^{T}

B ​ A = ( B ​ A ) T 𝐵 𝐴 superscript 𝐵 𝐴 𝑇 BA=(BA)^{T}

In this proof, we need to consider 2 general cases: (i) J 𝐽 J has full rank and Q 𝑄 Q does not and (ii) Q 𝑄 Q and J 𝐽 J have both full rank. As J T superscript 𝐽 𝑇 J^{T} and Q 𝑄 Q have much more rows than columns, they will almost always be of full rank, however, we consider both cases for completeness.

𝐽 𝑄 𝛼 𝐼 1 Q(JQ+\alpha I)^{-1} can never be the pseudoinverse of J 𝐽 J , thereby proving that a necessary condition for ( 45 ) is that rank ​ ( Q ) ≥ rank ​ ( J ) rank 𝑄 rank 𝐽 \text{rank}(Q)\geq\text{rank}(J) (note that this condition is satisfied by Condition 2 ). Now, that we showed that it is a necessary condition that Q 𝑄 Q is full rank (as J 𝐽 J is full rank by assumption of the lemma) for eq. ( 45 ) to hold, we proceed with the second case.

𝐽 𝑄 𝛼 𝐼 1 S\triangleq\lim_{\alpha\rightarrow 0}Q(JQ+\alpha I)^{-1} is equal to J † superscript 𝐽 † J^{\dagger} . As Q 𝑄 Q and J T superscript 𝐽 𝑇 J^{T} have both full rank, J ​ Q 𝐽 𝑄 JQ is of full rank and we have

Hence, conditions S1 .1, S1 .2 and S1 .3 are trivially satisfied:

J ​ S ​ J = I ​ J = J 𝐽 𝑆 𝐽 𝐼 𝐽 𝐽 JSJ=IJ=J

S ​ J ​ S = S ​ I = S 𝑆 𝐽 𝑆 𝑆 𝐼 𝑆 SJS=SI=S

J ​ S = I = I T = ( J ​ S ) T 𝐽 𝑆 𝐼 superscript 𝐼 𝑇 superscript 𝐽 𝑆 𝑇 JS=I=I^{T}=(JS)^{T}

Condition S1 .4 will only be satisfied under certain constraints on Q 𝑄 Q . We first assume Condition 2 holds to show its sufficiency after which we continue to show its necessity.

Consider U J subscript 𝑈 𝐽 U_{J} as an orthogonal basis of the column space of J T superscript 𝐽 𝑇 J^{T} . Then, we can write

for some full rank square matrix M J subscript 𝑀 𝐽 M_{J} . As we assume Condition 2 holds, we can similarly write Q 𝑄 Q as

for some full rank square matrix M Q subscript 𝑀 𝑄 M_{Q} . Condition S1 .4 can now be written as

showing that S is indeed the pseudoinverse of J 𝐽 J if Condition 2 holds, proving its sufficiency.

For showing the necessity of Condition 2 , we use a proof by contradiction. We now assume that Condition 2 does not hold and hence the column space of Q 𝑄 Q is not equal to that of J 𝐽 J . Similar as before, consider U Q subscript 𝑈 𝑄 U_{Q} and orthogonal basis of the column space of Q 𝑄 Q . Furthermore, consider the square orthogonal matrix U ¯ J ≜ [ U J ​ U ~ J ] ≜ subscript ¯ 𝑈 𝐽 delimited-[] subscript 𝑈 𝐽 subscript ~ 𝑈 𝐽 \bar{U}_{J}\triangleq[U_{J}\tilde{U}_{J}] with U J subscript 𝑈 𝐽 U_{J} as defined in ( 47 ) and U ~ J subscript ~ 𝑈 𝐽 \tilde{U}_{J} orthogonal on U J subscript 𝑈 𝐽 U_{J} . We can now decompose Q 𝑄 Q into a part inside the column space of J T superscript 𝐽 𝑇 J^{T} and outside of that column space:

𝐽 𝑄 𝛼 𝐼 1 \lim_{\alpha\rightarrow 0}Q(JQ+\alpha I)^{-1} will project Q 𝑄 Q onto something of lower rank, making it impossible for S 𝑆 S to approximate J † superscript 𝐽 † J^{\dagger} , thereby showing that it is necessary that P Q subscript 𝑃 𝑄 P_{Q} is full rank. , which is true in all but degenerate cases. Note that P ~ Q subscript ~ 𝑃 𝑄 \tilde{P}_{Q} is different from zero, as we assume Condition 2 does not hold in this proof by contradiction. Using this decomposition of Q 𝑄 Q , we can write S ​ J 𝑆 𝐽 SJ used in Condition S1 .4 as

The first part of the last equation is always symmetric, hence Condition S1 .4 boils down to the second part being symmetric:

As U J subscript 𝑈 𝐽 U_{J} has a zero-dimensional null space and P Q subscript 𝑃 𝑄 P_{Q} is full rank, S1 .4 can only hold when P ~ Q = 0 subscript ~ 𝑃 𝑄 0 \tilde{P}_{Q}=0 . This contradicts with our initial assumption in this proof by contradiction, stating that Condition 2 does not hold and consequently Q 𝑄 Q has components outside of the column space of J 𝐽 J , thereby proving that Condition 2 is necessary.

Theorem 2 states that the updates for W i subscript 𝑊 𝑖 W_{i} in DFC at steady-state align with the updates W i subscript 𝑊 𝑖 W_{i} prescribed by the GN optimization method for a feedforward neural network. We first formalize a feedforward fully connected neural network .

Definition S1 .

A feedforward fully connected neural network with L 𝐿 L layers, input dimension n 0 subscript 𝑛 0 n_{0} , output dimension n L subscript 𝑛 𝐿 n_{L} and hidden layer dimensions n i subscript 𝑛 𝑖 n_{i} , 0 < i < L 0 𝑖 𝐿 0<i<L is defined by the following sequence of mappings:

with ϕ italic-ϕ \phi and ϕ L subscript italic-ϕ 𝐿 \phi_{L} activation functions, 𝐫 0 subscript 𝐫 0 \mathbf{r}_{0} the input of the network, and 𝐫 L subscript 𝐫 𝐿 \mathbf{r}_{L} the output of the network.

The Lemma below shows that the network dynamics ( 1 ) at steady-state are equal to a feedforward neural network corresponding to Definition S1 in the absence of feedback.

In the absence of feedback ( 𝐮 ​ ( t ) = 0 𝐮 𝑡 0 \mathbf{u}(t)=0 ), the system dynamics ( 1 ) at steady-state are equivalent to a feedforward neural network defined by Definition S1 .

The proof is trivial upon noting that Q ​ 𝐮 = 0 𝑄 𝐮 0 Q\mathbf{u}=0 without feedback and computing the steady-state of ( 1 ) using 𝐫 i ≜ ϕ ​ ( 𝐯 i ) ≜ subscript 𝐫 𝑖 italic-ϕ subscript 𝐯 𝑖 \mathbf{r}_{i}\triangleq\phi(\mathbf{v}_{i}) . ∎

Following the notation of eq. ( 2 ), we denote with 𝐫 i − superscript subscript 𝐫 𝑖 \mathbf{r}_{i}^{-} the firing rates of the network in steady-state when feedback is absent, hence corresponding to the activations of a conventional feedforward neural network. The following Lemma investigates what the GN parameter updates are for a feedforward neural network. Later, we then show that the updates at equilibrium of DFC approximate these GN updates. For clarity, we assume that the network has only weights and no biases in all the following theorems and proofs, however, all proofs can be easily extended to comprise both weights and biases. First, we need to introduce some new notation for vectorized matrices.

where vec ​ ( W i ) vec subscript 𝑊 𝑖 \text{vec}(W_{i}) denotes the concatenation of the columns of W i subscript 𝑊 𝑖 W_{i} in a column vector.

Assuming an L 2 superscript 𝐿 2 L^{2} task loss and Condition 1 holds, the Gauss-Newton parameter updates for the weights of a feedforward network defined by Definition S1 for a minibatch size of 1 is given by

with R 𝑅 R defined in eq. ( 72 ).

Consider the Jacobian of the output w.r.t. the network weights W 𝑊 W (in vectorized form as defined above), evaluated at the feedforward activation:

For a minibatch size of 1, the GN update for the parameters W ¯ ¯ 𝑊 \bar{W} , assuming an L 2 superscript 𝐿 2 L^{2} output loss, is given by [ 35 , 54 ]

with 𝐫 L true superscript subscript 𝐫 𝐿 true \mathbf{r}_{L}^{\text{true}} the true supervised output (e.g., the class label). The remainder of this proof will manipulate expression ( 69 ) in order to reach ( 67 ). Using J W → i ≜ ∂ 𝐫 L ∂ W → i | 𝐫 L = 𝐫 L − J_{\vec{W}_{i}}\triangleq\frac{\partial\mathbf{r}_{L}}{\partial\vec{W}_{i}}\big{\rvert}_{\mathbf{r}_{L}=\mathbf{r}_{L}^{-}} , J W ¯ subscript 𝐽 ¯ 𝑊 J_{\bar{W}} can be restructured as:

Moreover, J W → i = J i ∂ 𝐯 i ∂ W → i | 𝐯 i = 𝐯 i − J_{\vec{W}_{i}}=J_{i}\frac{\partial\mathbf{v}_{i}}{\partial\vec{W}_{i}}\big{\rvert}_{\mathbf{v}_{i}=\mathbf{v}_{i}^{-}} . Using Kronecker products, this becomes 4 4 4 The Kronecker product leads to the following equality: vec ​ ( A ​ B ​ C ) = ( C T ⊗ A ) ​ vec ​ ( B ) vec 𝐴 𝐵 𝐶 tensor-product superscript 𝐶 𝑇 𝐴 vec 𝐵 \text{vec}(ABC)=(C^{T}\otimes A)\text{vec}(B) . Applied to our situation, this leads to the following equality: 𝐯 i = W i ​ 𝐫 i − 1 = ( 𝐫 i − 1 T ⊗ I ) ​ W → i subscript 𝐯 𝑖 subscript 𝑊 𝑖 subscript 𝐫 𝑖 1 tensor-product superscript subscript 𝐫 𝑖 1 𝑇 𝐼 subscript → 𝑊 𝑖 \mathbf{v}_{i}=W_{i}\mathbf{r}_{i-1}=(\mathbf{r}_{i-1}^{T}\otimes I)\vec{W}_{i}

Using the structure of J W ¯ subscript 𝐽 ¯ 𝑊 J_{\bar{W}} , this leads to

with the dimensions of I 𝐼 I such that the equality J W ¯ = J ​ R T subscript 𝐽 ¯ 𝑊 𝐽 superscript 𝑅 𝑇 J_{\bar{W}}=JR^{T} holds. What remains to be proven is that J W ¯ † = 1 ‖ 𝐫 ‖ 2 2 ​ R ​ J † superscript subscript 𝐽 ¯ 𝑊 † 1 superscript subscript norm 𝐫 2 2 𝑅 superscript 𝐽 † J_{\bar{W}}^{\dagger}=\frac{1}{\|\mathbf{r}\|_{2}^{2}}RJ^{\dagger} , assuming that Condition 1 holds and knowing that J W ¯ = J ​ R T subscript 𝐽 ¯ 𝑊 𝐽 superscript 𝑅 𝑇 J_{\bar{W}}=JR^{T} . To prove this, we need to know under which conditions ( J ​ R T ) † = ( R T ) † ​ J † superscript 𝐽 superscript 𝑅 𝑇 † superscript superscript 𝑅 𝑇 † superscript 𝐽 † (JR^{T})^{\dagger}=(R^{T})^{\dagger}J^{\dagger} . The following condition specifies when a pseudoinverse of a matrix product can be factorized [ 55 ] .

Condition S2 .

The Moore-Penrose pseudoinverse of a matrix product ( A ​ B ) † superscript 𝐴 𝐵 † (AB)^{\dagger} can be factorized as ( A ​ B ) † = B † ​ A † superscript 𝐴 𝐵 † superscript 𝐵 † superscript 𝐴 † (AB)^{\dagger}=B^{\dagger}A^{\dagger} if one of the following conditions hold:

A 𝐴 A has orthonormal columns

B 𝐵 B has orthonormal rows

B = A T 𝐵 superscript 𝐴 𝑇 B=A^{T}

A 𝐴 A has all columns linearly independent and B 𝐵 B has all rows linearly independent

In our case, J has more columns than rows, hence conditions S2 .1 and S2 .4 can never be satisfied. Furthermore, condition S2 .3 does not hold, which leaves us with condition S2 .2. To investigate whether R T superscript 𝑅 𝑇 R^{T} has orthonormal rows, we compute R T ​ R superscript 𝑅 𝑇 𝑅 R^{T}R :

If Condition 1 holds, we have ‖ 𝐫 0 − ‖ 2 2 = … = ‖ 𝐫 L − 1 − ‖ 2 2 ≜ ‖ 𝐫 ‖ 2 2 superscript subscript norm subscript superscript 𝐫 0 2 2 … superscript subscript norm subscript superscript 𝐫 𝐿 1 2 2 ≜ superscript subscript norm 𝐫 2 2 \|\mathbf{r}^{-}_{0}\|_{2}^{2}=\ldots=\|\mathbf{r}^{-}_{L-1}\|_{2}^{2}\triangleq\|\mathbf{r}\|_{2}^{2} such that:

Hence, 1 ‖ 𝐫 ‖ 2 ​ R T 1 subscript norm 𝐫 2 superscript 𝑅 𝑇 \frac{1}{\|\mathbf{r}\|_{2}}R^{T} has orthonormal rows iff Condition 1 holds. From now on, we assume that Condition 1 holds. Next, we will compute ( R T ) † superscript superscript 𝑅 𝑇 † (R^{T})^{\dagger} . Consider R T = U ​ Σ ​ V T superscript 𝑅 𝑇 𝑈 Σ superscript 𝑉 𝑇 R^{T}=U\Sigma V^{T} , the singular value decomposition (SVD) of R T superscript 𝑅 𝑇 R^{T} . Its pseudoinverse is given by ( R T ) † = V ​ Σ † ​ U T superscript superscript 𝑅 𝑇 † 𝑉 superscript Σ † superscript 𝑈 𝑇 (R^{T})^{\dagger}=V\Sigma^{\dagger}U^{T} . As the SVD is unique and 1 ‖ 𝐫 ‖ 2 ​ R T 1 subscript norm 𝐫 2 superscript 𝑅 𝑇 \frac{1}{\|\mathbf{r}\|_{2}}R^{T} has orthonormal rows, we can construct the SVD manually:

with V ~ T superscript ~ 𝑉 𝑇 \tilde{V}^{T} being a basis orthonormal to 1 ‖ 𝐫 ‖ 2 ​ R T 1 subscript norm 𝐫 2 superscript 𝑅 𝑇 \frac{1}{\|\mathbf{r}\|_{2}}R^{T} . Hence, we have that

Putting everything together and assuming that Condition 1 holds, we have that

thereby concluding the proof. ∎

Now, we are ready to prove Theorem 2 .

Theorem S5 (Theorem 2 in main manuscript) .

iff Condition 2 holds. Taking η = 1 2 ​ λ ​ ‖ 𝐫 ‖ 2 2 𝜂 1 2 𝜆 superscript subscript norm 𝐫 2 2 \eta=\frac{1}{2\lambda\|\mathbf{r}\|_{2}^{2}} and assuming an L 2 superscript 𝐿 2 L^{2} task loss, we have (using Lemma S4 ):

This theorem shows that for tasks with an L 2 superscript 𝐿 2 L^{2} loss and when Conditions 1 and 2 hold, DFC approximates Gauss-Newton updates with a minibatch size of 1, which becomes an exact equivalence in the limit of α 𝛼 \alpha and λ 𝜆 \lambda to zero.

A.3 DFC uses minimum norm updates

To remove the need for Condition 1 and a L2 task loss, 5 5 5 The Gauss-Newton method can be generalized to other loss functions by using the Generalized Gauss-Newton method [ 56 ] . we show that the learning behavior of our network is mathematically sound under more relaxed conditions. Theorem 3 (restated below for convenience) shows that for arbitrary loss functions and without the need for Condition 1 , our synaptic plasticity rule can be interpreted as a weighted minimum norm (MN) parameter update for reaching the output target, assuming linearized dynamics (which becomes exact in the limit of λ → 0 → 𝜆 0 \lambda\rightarrow 0 ).

Theorem S6 .

Rewriting the optimization problem using

and the concatenated vectorized weights W ¯ ¯ 𝑊 \bar{W} , we get:

Linearizing the feedforward dynamics around the current parameter values W ¯ ( m ) superscript ¯ 𝑊 𝑚 \bar{W}^{(m)} and using Lemma S3 , we get:

We will now assume that 𝒪 ​ ( ‖ Δ ​ W ¯ ‖ 2 2 ) 𝒪 superscript subscript norm Δ ¯ 𝑊 2 2 \mathcal{O}(\|\Delta\bar{W}\|_{2}^{2}) vanishes in the limit of λ → 0 → 𝜆 0 \lambda\rightarrow 0 , relative to the other terms in this Taylor expansion, and check this assumption at the end of the proof. Using ( 90 ) to rewrite the constraints ( 89 ), we get:

To solve the optimization problem, we construct its Lagrangian:

with 𝝁 𝝁 \boldsymbol{\mu} the Lagrange multipliers. As this is a convex optimization problem, the optimal solution can be found by solving the following set of equations:

assuming J W ¯ ​ M − 2 ​ J W ¯ T subscript 𝐽 ¯ 𝑊 superscript 𝑀 2 superscript subscript 𝐽 ¯ 𝑊 𝑇 J_{\bar{W}}M^{-2}J_{\bar{W}}^{T} is invertible, which is highly likely, as J W ¯ subscript 𝐽 ¯ 𝑊 J_{\bar{W}} is a skinny horizontal matrix and M 𝑀 M full rank. As 𝒪 ​ ( ‖ Δ ​ W ¯ ‖ 2 ) = 𝒪 ​ ( λ ) 𝒪 subscript norm Δ ¯ 𝑊 2 𝒪 𝜆 \mathcal{O}(\|\Delta\bar{W}\|_{2})=\mathcal{O}(\lambda) and 𝒪 ​ ( ‖ Δ ​ W ¯ ‖ 2 2 ) = 𝒪 ​ ( λ 2 ) 𝒪 subscript superscript norm Δ ¯ 𝑊 2 2 𝒪 superscript 𝜆 2 \mathcal{O}(\|\Delta\bar{W}\|^{2}_{2})=\mathcal{O}(\lambda^{2}) , the Taylor expansion error 𝒪 ​ ( ‖ Δ ​ W ¯ ‖ 2 2 ) 𝒪 subscript superscript norm Δ ¯ 𝑊 2 2 \mathcal{O}(\|\Delta\bar{W}\|^{2}_{2}) vanishes in the limit of λ → 0 → 𝜆 0 \lambda\rightarrow 0 , relative to the zeroth and first order terms, thereby confirming our assumption.

Now, we proceed by factorizing ( J W ¯ ​ M − 1 ) † superscript subscript 𝐽 ¯ 𝑊 superscript 𝑀 1 † \big{(}J_{\bar{W}}M^{-1}\big{)}^{\dagger} into J † superscript 𝐽 † J^{\dagger} and some other term, similar as in Lemma S4 . First, we note that J W ¯ ​ M − 1 = J ​ R T ​ M − 1 subscript 𝐽 ¯ 𝑊 superscript 𝑀 1 𝐽 superscript 𝑅 𝑇 superscript 𝑀 1 J_{\bar{W}}M^{-1}=JR^{T}M^{-1} , with R T superscript 𝑅 𝑇 R^{T} defined in eq. ( 72 ). Furthermore, we have that ( R T ​ M − 1 ) ​ ( R T ​ M − 1 ) T = I superscript 𝑅 𝑇 superscript 𝑀 1 superscript superscript 𝑅 𝑇 superscript 𝑀 1 𝑇 𝐼 \big{(}R^{T}M^{-1}\big{)}\big{(}R^{T}M^{-1}\big{)}^{T}=I , hence R T ​ M − 1 superscript 𝑅 𝑇 superscript 𝑀 1 R^{T}M^{-1} has orthonormal rows. Following Condition S2 , we can factorize ( J W ¯ ​ M − 1 ) † superscript subscript 𝐽 ¯ 𝑊 superscript 𝑀 1 † \big{(}J_{\bar{W}}M^{-1}\big{)}^{\dagger} as follows:

with [ J † ​ 𝜹 L ] i subscript delimited-[] superscript 𝐽 † subscript 𝜹 𝐿 𝑖 \big{[}J^{\dagger}\boldsymbol{\delta}_{L}\big{]}_{i} the entries of the vector J † ​ 𝜹 L superscript 𝐽 † subscript 𝜹 𝐿 J^{\dagger}\boldsymbol{\delta}_{L} corresponding to 𝐯 i subscript 𝐯 𝑖 \mathbf{v}_{i} . We used ( R T ​ M − 1 ) † = M − 1 ​ R superscript superscript 𝑅 𝑇 superscript 𝑀 1 † superscript 𝑀 1 𝑅 \big{(}R^{T}M^{-1}\big{)}^{\dagger}=M^{-1}R , which has a similar derivation as the one used for ( R T ) † superscript superscript 𝑅 𝑇 † \big{(}R^{T}\big{)}^{\dagger} in Lemma S4 .

We continue by showing that the weight update at equilibrium of DFC aligns with the MN solutions Δ ​ W i ∗ Δ subscript superscript 𝑊 𝑖 \Delta W^{*}_{i} . Adapting ( 85 ) from Theorem 2 to arbitrary loss functions, assuming 2 holds, and taking a layer-specific learning rate η i = 1 ‖ 𝐫 i − 1 ‖ 2 2 subscript 𝜂 𝑖 1 superscript subscript norm subscript 𝐫 𝑖 1 2 2 \eta_{i}=\frac{1}{\|\mathbf{r}_{i-1}\|_{2}^{2}} , we get that

for which we used the same notation as in eq. ( 98 ) to divide the vector J † ​ 𝜹 L superscript 𝐽 † subscript 𝜹 𝐿 J^{\dagger}\boldsymbol{\delta}_{L} in layerwise components. As the DFC update ( 101 ) is equal to the MN solution ( 98 ), we can conclude the proof. Note that because we used layer-specific learning rates η i = 1 ‖ 𝐫 i − 1 ‖ 2 2 subscript 𝜂 𝑖 1 superscript subscript norm subscript 𝐫 𝑖 1 2 2 \eta_{i}=\frac{1}{\|\mathbf{r}_{i-1}\|_{2}^{2}} only the layerwise updates Δ ​ W i Δ subscript 𝑊 𝑖 \Delta W_{i} and Δ ​ W i ∗ Δ superscript subscript 𝑊 𝑖 \Delta W_{i}^{*} align, not their concatenated versions Δ ​ W ¯ Δ ¯ 𝑊 \Delta\bar{W} and Δ ​ W ¯ ∗ Δ superscript ¯ 𝑊 \Delta\bar{W}^{*} . ∎

Finally, we will remove Condition 2 and show in Proposition 4 (here repeated in Proposition S8 for convenience) that the weight updates still follow a descent direction for arbitrary feedback weights. Before proving Proposition 4 , we need to introduce and prove the following Lemma.

Assuming J ~ 1 subscript ~ 𝐽 1 \tilde{J}_{1} is full rank,

with U Q subscript 𝑈 𝑄 U_{Q} , V Q subscript 𝑉 𝑄 V_{Q} the left and right singular vectors of Q 𝑄 Q and J ~ 1 subscript ~ 𝐽 1 \tilde{J}_{1} as defined as follows: consider J ~ = V Q T ​ J ​ U Q ~ 𝐽 superscript subscript 𝑉 𝑄 𝑇 𝐽 subscript 𝑈 𝑄 \tilde{J}=V_{Q}^{T}JU_{Q} , the linear transformation of J 𝐽 J by the singular vectors of Q 𝑄 Q which can be written in blockmatrix form J ~ = [ J ~ 1 ​ J ~ 2 ] ~ 𝐽 delimited-[] subscript ~ 𝐽 1 subscript ~ 𝐽 2 \tilde{J}=[\tilde{J}_{1}\tilde{J}_{2}] with J ~ 1 subscript ~ 𝐽 1 \tilde{J}_{1} a square matrix.

𝐽 𝑄 ~ 𝛼 𝐼 1 Q(JQ+\tilde{\alpha}I)^{-1} . The SVD is given by Q = U Q ​ Σ Q ​ V Q T 𝑄 subscript 𝑈 𝑄 subscript Σ 𝑄 superscript subscript 𝑉 𝑄 𝑇 Q=U_{Q}\Sigma_{Q}V_{Q}^{T} , with V Q subscript 𝑉 𝑄 V_{Q} and U Q subscript 𝑈 𝑄 U_{Q} square orthogonal matrices and Σ Q subscript Σ 𝑄 \Sigma_{Q} a rectangular diagonal matrix:

with Σ Q D superscript subscript Σ 𝑄 𝐷 \Sigma_{Q}^{D} a square diagonal matrix, containing the singular values of Q 𝑄 Q . Now, let us define J ~ ~ 𝐽 \tilde{J} as

𝐽 𝑄 𝛼 𝐼 1 Q(JQ+\alpha I)^{-1} as

Assuming J ~ 1 subscript ~ 𝐽 1 \tilde{J}_{1} and Σ Q D superscript subscript Σ 𝑄 𝐷 \Sigma_{Q}^{D} to be invertible (i.e., no zero singular values), this leads to:

𝐽 𝑄 𝛼 𝐼 1 \lim_{\alpha\rightarrow 0}Q(JQ+\alpha I)^{-1} is a generalized inverse of the forward Jacobian J 𝐽 J , constrained by the column space of Q 𝑄 Q , which is represented by U Q subscript 𝑈 𝑄 U_{Q} .

Proposition S8 .

First, we show that the steady-state weight update lies within 90 degrees of the loss gradient, after which we continue to prove convergence for linear networks. We define Δ ​ 𝐯 ss ≜ 𝐯 ss − 𝐯 ss ff ≜ Δ subscript 𝐯 ss subscript 𝐯 ss subscript superscript 𝐯 ff ss \Delta\mathbf{v}_{\mathrm{ss}}\triangleq\mathbf{v}_{\mathrm{ss}}-\mathbf{v}^{\text{ff}}_{\mathrm{ss}} , which allows us to rewrite the steady-state update ( 9 ) as

where we use the vectorized notation, R ss subscript 𝑅 ss R_{\mathrm{ss}} defined in eq. ( 72 ) with steady-state activations, and M 𝑀 M defined in eq. ( 87 ) to represent the layer-specific learning rate η i = η / ‖ r i − 1 ‖ 2 2 subscript 𝜂 𝑖 𝜂 superscript subscript norm subscript 𝑟 𝑖 1 2 2 \eta_{i}=\eta/\|r_{i-1}\|_{2}^{2} . Using Lemma 1 and S7 , we have that

Using the same vectorized notation, the negative gradient of the loss with respect to the network weights (i.e., the BP updates) can be written as:

A.4 An intuitive interpretation of Condition 2

In the previous sections, we showed that Condition 2 is needed to enable precise CA through GN or MN optimization. Here, we discuss a more intuitive interpretation of why Condition 2 is needed.

DFC has three main components that influence the feedback signals given to each neuron. First, we have the network dynamics ( 1 ) (here repeated for convenience).

subscript 𝐯 𝑖 𝑡 subscript 𝑊 𝑖 italic-ϕ subscript 𝐯 𝑖 1 𝑡 -\mathbf{v}_{i}(t)+W_{i}\phi\big{(}\mathbf{v}_{i-1}(t)\big{)} pull the neural activation 𝐯 i subscript 𝐯 𝑖 \mathbf{v}_{i} close to its feedforward compartment 𝐯 i ff subscript superscript 𝐯 ff 𝑖 \mathbf{v}^{\mathrm{ff}}_{i} , while the third term Q i ​ 𝐮 ​ ( t ) subscript 𝑄 𝑖 𝐮 𝑡 Q_{i}\mathbf{u}(t) provides an extra push such that the network output is driven to its target. This interplay between pulling and pushing is important, as it makes sure that 𝐯 i subscript 𝐯 𝑖 \mathbf{v}_{i} and 𝐯 i ff subscript superscript 𝐯 ff 𝑖 \mathbf{v}^{\mathrm{ff}}_{i} remain as close as possible together, while driving the output towards its target.

Second, we have the feedback weights Q 𝑄 Q . As Q 𝑄 Q is of dimensions ∑ i = 1 L n i × n L superscript subscript 𝑖 1 𝐿 subscript 𝑛 𝑖 subscript 𝑛 𝐿 \sum_{i=1}^{L}n_{i}\times n_{L} , with n i subscript 𝑛 𝑖 n_{i} the layer size, it has always much more rows than columns. Hence, the few but long columns of Q 𝑄 Q can be seen as the ‘modes’ that the controller 𝐮 𝐮 \mathbf{u} can use to change network activations 𝐯 𝐯 \mathbf{v} . Due to the low-dimensionality of 𝐮 𝐮 \mathbf{u} compared to 𝐯 𝐯 \mathbf{v} , Q ​ 𝐮 𝑄 𝐮 Q\mathbf{u} cannot change the activations 𝐯 𝐯 \mathbf{v} in arbitrary directions, but is constrained by the column space of Q 𝑄 Q , i.e., the ‘modes’ of Q 𝑄 Q .

Third, we have the feedback controller, that through its own dynamics, combined with the network dynamics ( 1 ) and Q 𝑄 Q , selects an ‘optimal’ configuration for 𝐮 𝐮 \mathbf{u} , i.e., 𝐮 ss = ( J ​ Q ) − 1 ​ 𝜹 L subscript 𝐮 ss superscript 𝐽 𝑄 1 subscript 𝜹 𝐿 \mathbf{u}_{\mathrm{ss}}=(JQ)^{-1}\boldsymbol{\delta}_{L} , that selects and weights the different modes (columns) of Q 𝑄 Q to push the output to its target in the ‘most efficient manner’.

To make ‘most efficient manner’ more concrete, we need to define the nullspace of the network. As the dimension of 𝐯 𝐯 \mathbf{v} is much bigger than the output dimension, there exist changes in activation Δ ​ 𝐯 Δ 𝐯 \Delta\mathbf{v} that do not result in a change of output Δ ​ 𝐫 L Δ subscript 𝐫 𝐿 \Delta\mathbf{r}_{L} , because they lie in the nullspace of the network. In a linearized network, this is reflected by the network Jacobian J 𝐽 J , as we have that Δ ​ 𝐫 L = J ​ Δ ​ 𝐯 Δ subscript 𝐫 𝐿 𝐽 Δ 𝐯 \Delta\mathbf{r}_{L}=J\Delta\mathbf{v} . As J is of dimensions n L × ∑ i = 1 L n i subscript 𝑛 𝐿 superscript subscript 𝑖 1 𝐿 subscript 𝑛 𝑖 n_{L}\times\sum_{i=1}^{L}n_{i} , it has many more columns than rows and thus a non-zero nullspace. When Δ ​ 𝐯 Δ 𝐯 \Delta\mathbf{v} lies inside the nullspace of J 𝐽 J , it will result in Δ ​ 𝐫 L = 0 Δ subscript 𝐫 𝐿 0 \Delta\mathbf{r}_{L}=0 . Now, if the column space of Q 𝑄 Q overlaps partially with the nullspace of J 𝐽 J , one could make 𝐮 𝐮 \mathbf{u} , and hence Δ ​ 𝐯 = Q ​ 𝐮 Δ 𝐯 𝑄 𝐮 \Delta\mathbf{v}=Q\mathbf{u} , arbitrarily big, while still making sure that the output is pushed exactly to its target, when the ‘arbitrarily big’ parts of Δ ​ 𝐯 Δ 𝐯 \Delta\mathbf{v} lie inside the nullspace of J 𝐽 J and hence do not influence 𝐫 L subscript 𝐫 𝐿 \mathbf{r}_{L} . Importantly, the feedback controller combined with the network dynamics ensure that this does not happen, as 𝐮 ss = ( J ​ Q ) − 1 ​ 𝜹 L subscript 𝐮 ss superscript 𝐽 𝑄 1 subscript 𝜹 𝐿 \mathbf{u}_{\mathrm{ss}}=(JQ)^{-1}\boldsymbol{\delta}_{L} selects the smallest possible 𝐮 ss subscript 𝐮 ss \mathbf{u}_{\mathrm{ss}} to push the output to its target.

However, when the column space of Q 𝑄 Q partially overlaps with the nullspace of J 𝐽 J , there will inevitably be parts of Δ ​ 𝐯 Δ 𝐯 \Delta\mathbf{v} that lie inside the nullspace of J 𝐽 J , even though the controller selects the smallest possible 𝐮 ss subscript 𝐮 ss \mathbf{u}_{\mathrm{ss}} . This can easily be seen as in general, each column of Q 𝑄 Q overlaps partially with the nullspace of J 𝐽 J , so Δ ​ 𝐯 = Q ​ 𝐮 Δ 𝐯 𝑄 𝐮 \Delta\mathbf{v}=Q\mathbf{u} , which is a linear combination of the columns of Q 𝑄 Q , will also overlap partially with the nullspace of J 𝐽 J . This is where Condition 2 comes into play.

Condition 2 states that the column space of Q 𝑄 Q is equal to the row space of J 𝐽 J . When this condition is fulfilled, the column space of Q 𝑄 Q does not overlap with the nullspace of J 𝐽 J . Hence, all the feedback Q ​ 𝐮 𝑄 𝐮 Q\mathbf{u} produces a change in the network output and no unnecessary changes in activations Δ ​ 𝐯 Δ 𝐯 \Delta\mathbf{v} take place. With Condition 2 satisfied, the occurring changes in activations Δ ​ 𝐯 Δ 𝐯 \Delta\mathbf{v} are MN, as they lie fully in the row-space of J 𝐽 J and push the output exactly to its target. This interpretation lies at the basis of Theorem 3 and is also an important part of Theorem 2 .

A.5 Gauss-Newton optimization with a mini-batch size of 1

In this section, we review the GN optimization method and discuss the unique properties that arise when a mini-batch size of 1 is taken.

Review of GN optimization.

Gauss-Newton (GN) optimization is an iterative optimization method used for non-linear regression problems with an L 2 superscript 𝐿 2 L^{2} output loss, defined as follows:

with B the minibatch size, 𝜹 𝜹 \boldsymbol{\delta} the regression error, 𝐫 𝐫 \mathbf{r} the model output, and 𝐲 𝐲 \mathbf{y} the corresponding regression target. There exist two main derivations of the GN optimization method: (i) through an approximation of the Newton-Raphson method and (ii) through linearizing the parametric model that is being optimized. We focus on the latter, as this derivation is closely connected to DFC.

GN is an iterative optimization method and hence aims to find a parameter update Δ ​ 𝜽 Δ 𝜽 \Delta\boldsymbol{\theta} that leads to a lower regression loss:

with m 𝑚 m indicating the iteration number. The end goal of the optimization scheme is to find a local minimum of ℒ ℒ \mathcal{L} , hence, finding 𝜽 ∗ superscript 𝜽 \boldsymbol{\theta}^{*} for which holds

with 𝜹 𝜹 \boldsymbol{\delta} and 𝐫 𝐫 \mathbf{r} the concatenation of all 𝜹 ( b ) superscript 𝜹 𝑏 \boldsymbol{\delta}^{(b)} and 𝐫 ( b ) superscript 𝐫 𝑏 \mathbf{r}^{(b)} , respectively. To obtain a closed-form expression for 𝜽 ∗ superscript 𝜽 \boldsymbol{\theta}^{*} that fulfills eq. ( 119 ) approximately, one can make a first-order Taylor approximation of the parameterize model around the current parameter setting 𝜽 ( m ) superscript 𝜽 𝑚 \boldsymbol{\theta}^{(m)} :

Filling this approximation into eq. ( 119 ), we get:

In an under-parameterized setting, i.e., the dimension of 𝜹 𝜹 \boldsymbol{\delta} is bigger than the dimension of 𝜽 𝜽 \boldsymbol{\theta} , J 𝜽 T ​ J 𝜽 superscript subscript 𝐽 𝜽 𝑇 subscript 𝐽 𝜽 J_{\boldsymbol{\theta}}^{T}J_{\boldsymbol{\theta}} can be interpreted as an approximation of the loss Hessian matrix used in the Newton-Raphson method and is known as the Gauss-Newton curvature matrix . In the under-parameterized setting, J 𝜽 T ​ J 𝜽 superscript subscript 𝐽 𝜽 𝑇 subscript 𝐽 𝜽 J_{\boldsymbol{\theta}}^{T}J_{\boldsymbol{\theta}} is invertible, leading to the update

with J 𝜽 † superscript subscript 𝐽 𝜽 † J_{\boldsymbol{\theta}}^{\dagger} the Moore-Penrose pseudoinverse of J 𝜽 subscript 𝐽 𝜽 J_{\boldsymbol{\theta}} . In the under-parameterized setting, eq. ( 124 ) can be interpreted as a linear least-squares regression for finding a parameter update Δ ​ 𝜽 Δ 𝜽 \Delta\boldsymbol{\theta} that results in a least-squares solution on the linearized parametric model ( 121 ). Until now we considered the under-parameterized case. However, DFC is related to GN optimization with a mini-batch size of 1, which concerns the over-parameterized case.

GN optimization with a mini-batch size of 1.

When the minibatch size B = 1 𝐵 1 B=1 , the dimension of 𝜹 𝜹 \boldsymbol{\delta} is smaller than the dimension of 𝜽 𝜽 \boldsymbol{\theta} in neural networks, hence we need to consider the over-parameterized case of GN [ 36 , 57 ] . Now, the matrix J 𝜽 T ​ J 𝜽 superscript subscript 𝐽 𝜽 𝑇 subscript 𝐽 𝜽 J_{\boldsymbol{\theta}}^{T}J_{\boldsymbol{\theta}} is not of full rank and hence an infinite amount of solutions exist for eq. ( 124 ). To enforce a unique solution for the parameter update Δ ​ 𝜽 Δ 𝜽 \Delta\boldsymbol{\theta} , a common approach is to take the MN solution, i.e., the smallest possible solution Δ ​ 𝜽 Δ 𝜽 \Delta\boldsymbol{\theta} that satisfies ( 124 ). Using the MN properties of the Moore-Penrose pseudoinverse, this results in:

𝑚 1 \boldsymbol{\delta}^{(m+1)} exactly to zero, and GN picks the MN solution ( 127 ).

DFC updates with larger batch sizes.

For computational efficiency, we average the DFC updates over a minibatch size bigger than 1. However, this averaging over a minibatch is distinct from doing Gauss-Newton optimization on a minibatch. The GN iteration with minibatch size B 𝐵 B is given by

with J W ¯ ( b ) superscript subscript 𝐽 ¯ 𝑊 𝑏 J_{\bar{W}}^{(b)} the Jacobian of the output w.r.t. the concatenated weights W ¯ ¯ 𝑊 \bar{W} for batch sample b 𝑏 b , and γ 𝛾 \gamma a damping parameter. Note that we accumulate the GN curvature J W ¯ ( b ) ​ T ​ J W ¯ ( b ) superscript subscript 𝐽 ¯ 𝑊 𝑏 𝑇 superscript subscript 𝐽 ¯ 𝑊 𝑏 J_{\bar{W}}^{(b)T}J_{\bar{W}}^{(b)} over all minibatch samples before taking the inverse.

When the assumptions of Theorem 2 hold, the DFC updates with a minibatch size B 𝐵 B can be written by

For B = 1 𝐵 1 B=1 , the DFC update ( 129 ) overlaps with the GN update ( 128 ). However, for B > 1 𝐵 1 B>1 these are not equal anymore, due to the order of summation and inversion being reversed.

A.6 Effects of the nonlinearity ϕ italic-ϕ \phi in the weight update

In this section, we study in detail the experimental consequences of using the nonlinear learning rule ( 2.3 ) instead of the linear learning rule ( 9 ). First, we investigate the case where the assumptions in Theorem 3 are perfectly satisfied and then we investigate the more realistic case where the assumptions are not perfectly satisfied.

When considering the ideal case where Condition 2 is perfectly satisfied and in the limit of λ 𝜆 \lambda and α 𝛼 \alpha to zero, MN updates ( 216 ) are obtained if the linear learning rule is used, and the following updates are obtained when the nonlinear learning rule is used:

subscript 𝑣 𝑗 \partial\phi(v_{j})/\partial(v_{j}) for each neuron in the network on its diagonal and R 𝑅 R as defined in eq. ( 216 ). For this ideal case, we performed experiments on MNIST comparing the linear to the nonlinear learning rules, and obtained a test error of 2.18 ± 0.14 % percent superscript 2.18 plus-or-minus 0.14 2.18^{\pm 0.14}\% and 2.11 ± 0.10 % percent superscript 2.11 plus-or-minus 0.10 2.11^{\pm 0.10}\% , respectively. These experiments demonstrate that for this ideal case the nonlinear learning rule ( 2.3 ) has no significant benefit over the linear learning rule ( 9 ).

On the other hand, to investigate the influence of the nonlinear learning rule for the practical case where Condition 2 is not perfectly satisfied, we performed a new hyperparameter search on MNIST for DFC-SSA with the linear learning rule ( 9 ). This resulted in a test error of 5.28 ± 0.14 % percent superscript 5.28 plus-or-minus 0.14 5.28^{\pm 0.14}\% . Comparing this result with the corresponding test performance in Table 1 ( 2.29 ± 0.097 % percent superscript 2.29 plus-or-minus 0.097 2.29^{\pm 0.097}\% test error), we conclude that DFC benefits from the introduction of the chosen nonlinearities in the learning rule ( 2.3 ), as the results improve significantly. Hence, we can infer that this increase in performance is due to the way the introduction of the nonlinearity in the learning rule compensates for when the feedback weights do not perfectly satisfy Condition 2 .

Lastly, to investigate where this performance gap originates from, we performed another toy experiment similar to Fig. 3 (see Fig. S1 ) for the linear versus nonlinear learning rule in DFC. The new results show that the updates resulting from the nonlinear learning rule are much better aligned with the MN and GN updates, compared to the linear learning rule, explaining its better performance. Overall, we conclude that introducing the nonlinearity in the learning rule, which prevents saturated neurons from updating their weights, is a useful heuristic to improve the alignment of DFC with the MN and GN updates and consequently improve its performance, when Condition 2 is not perfectly satisfied.

Refer to caption

A.7 Relation between continuous DFC weight updates and steady-state DFC weight updates

All developed learning theory in section 3 considers an update Δ ​ W i Δ subscript 𝑊 𝑖 \Delta W_{i} at the steady-state of the network ( 1 ) and controller ( 4 ) dynamics instead of a continuous update as defined in ( 5 ). Fig. 3 F shows that the accumulated continuous updates ( 5 ) of DFC align well with the analytical steady-state updates. Here, we indicate why this steady-state update is a good approximation of the accumulated continuous updates ( 5 ). We consider two main reasons: (i) the network and controller dynamics settle quickly to their steady-state and (ii) when the dynamics are not settled yet, they oscillate around the steady-state, thereby causing oscillations to cancel each other out approximately.

Addressing the first reason, consider an input that is presented to the network from time T 1 subscript 𝑇 1 T_{1} until T 2 subscript 𝑇 2 T_{2} and that the network and controller dynamics converge at T s ​ s < T 2 subscript 𝑇 𝑠 𝑠 subscript 𝑇 2 T_{ss}<T_{2} . The change in weight prescribed by ( 5 ) is then equal to

A.8 DFC is compatible with various controller types

Throughout the main manuscript, we focused on a proportional-integral (PI) controller. However, the DFC framework is compatible with various other controller types. In the following, we show that the results on learning theory (Section 3 can be generalized to pure integral control, pure proportional control or any combination thereof with derivative control added. Note that for each new controller type, a new stability analysis is needed and whether the feedback learning rule is still compatible with the controller also needs to be checked, which we leave to future work.

A.8.1 Pure integral control

For pure integral control, the steady-state solutions of Lemma 1 still apply, with α ~ = α ~ 𝛼 𝛼 \tilde{\alpha}=\alpha . Hence, all learning theory results of Section 3 directly apply to this case. Furthermore, Proposition 5 and Theorem 6 are already designed for pure integral control.

A.8.2 Pure proportional control

By making a first-order Taylor approximation of the network dynamics with only proportional control (putting K I = 0 subscript 𝐾 𝐼 0 K_{I}=0 in eq. ( 4 )), we obtain the following steady-state solution:

𝑄 𝐽 1 subscript 𝑘 𝑝 𝐼 1 𝑄 superscript 𝐽 † \lim_{k_{p}\rightarrow\infty}(QJ+\frac{1}{k_{p}}I)^{-1}Q=J^{\dagger} iff Condition 2 holds. 6 6 6 We leave the proof as an exercise for the interested reader. The proof follows the same approach as Lemma S2 and uses l’Hôpital’s rule for taking the correct limit of k → ∞ → 𝑘 k\rightarrow\infty . Consequently, Theorems 2 and 3 and Proposition 4 hold also for proportional control, if the limit of α 𝛼 \alpha to zero is replaced by the limit of k p subscript 𝑘 𝑝 k_{p} to infinity. Furthermore, the main intuitions of Theorem 6 for training the feedback can be applied to proportional control, given that one finds a way to keep the network stable during the initial feedback weights training phase.

Despite these theoretical similarities between proportional and PI control in DFC, there are some significant practical differences. First, for finite k p subscript 𝑘 𝑝 k_{p} in proportional control, there is always a residual error that remains and hence the output target will never be exactly reached. Second, if noise is present in the network, it gets amplified by the same factor k p subscript 𝑘 𝑝 k_{p} . Hence, using a high k p subscript 𝑘 𝑝 k_{p} in proportional control makes the controlled network sensitive to noise. Adding an integral control component can alleviate these issues by replacing the need for a large gain, k p subscript 𝑘 𝑝 k_{p} , with the need for a good integrator circuit (i.e., low α 𝛼 \alpha ) [ 34 ] , for which a rich neuroscience literature exists [ 58 , 59 , 60 , 61 , 62 ] . This way, we can use a smaller gain, k p subscript 𝑘 𝑝 k_{p} , without increasing the residual error and consequently make the network less sensitive to noise. This is also interesting from a biological point of view since biological networks are considered to be substantially noisy.

A.8.3 Adding derivative control

Proportional, integral or proportional-integral control can be combined with derivative control. As the derivative term disappears at the steady state, the steady-state solutions of Lemma 1 remain unaltered and the learning theory results can be directly applied. However, note that the derivative control term can significantly impact the stability and feedback learning of the network.

Appendix B Proofs and extra information for Section 4 : Stability of DFC

B.1 stability analysis with instantaneous system dynamics.

In this section, we first derive eq. ( 11 ), which corresponds to the dynamics of the controller obtained when assuming a separation of timescales between the controller and the network ( τ u ≫ τ v much-greater-than subscript 𝜏 𝑢 subscript 𝜏 𝑣 \tau_{u}\gg\tau_{v} ), and only having integrative control ( k p = 0 subscript 𝑘 𝑝 0 k_{p}=0 ).

Let us recall that 𝐯 ss subscript 𝐯 ss \mathbf{v}_{\mathrm{ss}} and 𝐯 − superscript 𝐯 \mathbf{v}^{-} are the steady-state solutions of the dynamical system ( 1 ) with and without control, respectively. Now, by linearizing the network dynamics ( 1 ) around the feedforward steady-state, 𝐯 − superscript 𝐯 \mathbf{v}^{-} , we can write

with J ≜ [ ∂ 𝐫 L − ∂ 𝐯 1 , … , ∂ 𝐫 L − ∂ 𝐯 L ] | 𝐯 = 𝐯 − J\triangleq\left.\left[\frac{\partial\mathbf{r}^{-}_{L}}{\partial\mathbf{v}_{1}},...,\frac{\partial\mathbf{r}^{-}_{L}}{\partial\mathbf{v}_{L}}\right]\right\rvert_{\mathbf{v}=\mathbf{v}^{-}} the network Jacobian evaluated at the steady state, and where we dropped the time dependence ( t ) 𝑡 (t) for conciseness.

Taking into account the results of equations ( 3 ) and ( 134 ), the control error can then be rewritten as

Consequently, eq. ( 11 ) follows:

where we changed the notation d d ​ t ​ 𝐮 d d 𝑡 𝐮 \frac{\text{d}}{\text{d}t}\mathbf{u} to 𝐮 ˙ ˙ 𝐮 \dot{\mathbf{u}} for conciseness. Now, we continue by proving Proposition 5 , restated below for convenience.

Proposition S9 (Proposition 5 in main manuscript) .

Assuming instantaneous system dynamics ( τ u ≫ τ v much-greater-than subscript 𝜏 𝑢 subscript 𝜏 𝑣 \tau_{u}\gg\tau_{v} ), then the stability of the system is entirely up to the controller dynamics. To prove that the system’s equilibrium is locally asymptotically stable, we need to guarantee that the Jacobian associated to the controller dynamics evaluated at its steady-state solution, 𝐯 ss subscript 𝐯 ss \mathbf{v}_{\mathrm{ss}} , has only eigenvalues with a strictly negative real part [ 38 ] . This Jacobian can be obtained in a similar fashion to that of eq. ( 11 ), and is given by

subscript 𝐽 ss 𝑄 𝛼 𝐼 J_{\mathrm{ss}}Q+\alpha I can only have eigenvalues with strictly positive real parts. As adding α ​ I 𝛼 𝐼 \alpha I to J ss ​ Q subscript 𝐽 ss 𝑄 J_{\mathrm{ss}}Q results in adding α 𝛼 \alpha to the eigenvalues of J ss ​ Q subscript 𝐽 ss 𝑄 J_{\mathrm{ss}}Q , the local asymptotic stability condition requires that the real parts of the eigenvalues of J ss ​ Q subscript 𝐽 ss 𝑄 J_{\mathrm{ss}}Q are all greater than − α 𝛼 -\alpha , corresponding to Condition 3 . ∎

B.2 Stability of the full system

In this section, we derive a concise representation of the full dynamics of the network ( 1 ) and controller dynamics ( 4 ) in the general case where the timescale of the neuronal dynamics, τ v subscript 𝜏 𝑣 \tau_{v} , is not negligible and we have proportional control ( k p > 0 subscript 𝑘 𝑝 0 k_{p}>0 ). Proposition S10 provides the abstract conditions that guarantee local asymptotic stability of the steady states of the full dynamical system.

Proposition S10 .

The network and controller dynamics are locally asymptotically stable around its equilibrium iff the following matrix has strictly negative eigenvalues:

1 subscript 𝑘 𝑝 𝛼 \tilde{\tau}_{u}=\frac{\alpha}{1+k_{p}\alpha} , J ss = ∂ 𝐫 L ∂ 𝐯 | 𝐯 = 𝐯 ss J_{\mathrm{ss}}=\frac{\partial\mathbf{r}_{L}}{\partial\mathbf{v}}\big{\rvert}_{\mathbf{v}=\mathbf{v}_{\mathrm{ss}}} and J ^ ss subscript ^ 𝐽 ss \hat{J}_{\mathrm{ss}} defined in equations ( 145 ) and ( 150 ).

Recall that the controller is given by ( 4 )

where τ u ​ 𝐮 ˙ int = 𝐞 − α ​ 𝐮 int subscript 𝜏 𝑢 superscript ˙ 𝐮 int 𝐞 𝛼 superscript 𝐮 int \tau_{u}\dot{\mathbf{u}}^{\text{int}}=\mathbf{e}-\alpha\mathbf{u}^{\text{int}} . Then, the controller dynamics can be written as

Recall that the network dynamics are given by ( 1 )

with Δ ​ 𝐯 i = 𝐯 i − W i ​ ϕ ​ ( 𝐯 i − 1 ) Δ subscript 𝐯 𝑖 subscript 𝐯 𝑖 subscript 𝑊 𝑖 italic-ϕ subscript 𝐯 𝑖 1 \Delta\mathbf{v}_{i}=\mathbf{v}_{i}-W_{i}\phi(\mathbf{v}_{i-1}) . Which allows us to write

We can now obtain the network dynamics in terms of Δ ​ 𝐯 ˙ Δ ˙ 𝐯 \Delta\dot{\mathbf{v}} as

which for the entire system is

Let us now proceed to linearize the network and controller dynamical systems by defining

The controller dynamics ( 140 ) can now be rewritten as

When the network and the controller are at equilibrium, eq. ( 140 ) yields

and we can rewrite eq. ( 147 ) as

Once again, when the network and the controller are at equilibrium, incorporating the definitions in ( 146 ) into eq. ( 144 ), it follows that

At steady-state, eq. ( 144 ) yields

which allows us to rewrite eq. ( 150 ) as

Using the results from eq. ( 152 ), we can write eq. ( 149 ) as

Finally, as Δ ~ ​ 𝐯 ˙ = Δ ​ 𝐯 ˙ = 𝐯 ˙ ~ Δ ˙ 𝐯 Δ ˙ 𝐯 ˙ 𝐯 \tilde{\Delta}\dot{\mathbf{v}}=\Delta\dot{\mathbf{v}}=\dot{\mathbf{v}} and Δ ~ ​ 𝐮 ˙ = 𝐮 ˙ ~ Δ ˙ 𝐮 ˙ 𝐮 \tilde{\Delta}\dot{\mathbf{u}}=\dot{\mathbf{u}} ( 146 ), this allows us to to infer local stability results for the full system dynamics by looking into the dynamics of Δ ~ ​ 𝐯 ˙ ~ Δ ˙ 𝐯 \tilde{\Delta}\dot{\mathbf{v}} and Δ ~ ​ 𝐮 ˙ ~ Δ ˙ 𝐮 \tilde{\Delta}\dot{\mathbf{u}} around the steady state:

Now, to guarantee local asymptotic stability of the system’s equilibrium, then the eigenvalues of A P ​ I subscript 𝐴 𝑃 𝐼 A_{PI} must have strictly negative real parts [ 38 ] . ∎

The current form of the system matrix A P ​ I subscript 𝐴 𝑃 𝐼 A_{PI} provides no straightforward intuition on finding interpretable conditions for the feedback weights Q 𝑄 Q such that local stability is reached. One can apply Gershgoring’s circle theorem to infer sufficient restrictions on J 𝐽 J and Q 𝑄 Q to ensure local asymptotic stability [ 63 ] . However, the resulting conditions are too conservative and do not provide intuition in which types of feedback learning rules are needed to ensure stability.

B.3 Toy experiments for relation of Condition 3 and full system dynamics

𝐽 𝑄 𝛼 𝐼 JQ+\alpha I (Condition 3 , see Fig. S2 .a) and of A P ​ I subscript 𝐴 𝑃 𝐼 A_{PI} (the actual dynamics, see eq. ( 138 ) and Fig. S2 .b). We used the same student-teacher regression setting and configuration as in the toy experiments of Fig. 3 .

𝐽 𝑄 𝛼 𝐼 JQ+\alpha I . Although they differ in exact value, both eigenvalue trajectories are slowly decreasing during training and are strictly negative, thereby indicating that Condition 3 is a good proxy for the local stability of the actual dynamics.

When we only consider leaky integral control ( k p = 0 subscript 𝑘 𝑝 0 k_{p}=0 , see Fig. S2 .c), the dynamics become unstable during late training, highlighting that adding proportional control is crucial for the stability of the dynamics. Interestingly, training the feedback weights (blue curve) does not help in this case for making the system stable, on the contrary, it pushes the network to become unstable more quickly. These leaky integral control dynamics are equal to the simplified dynamics used in Condition 3 in the limit of τ v / τ u → 0 → subscript 𝜏 𝑣 subscript 𝜏 𝑢 0 \tau_{v}/\tau_{u}\rightarrow 0 , which are stable (see Fig. S2 .a). Hence, slower network dynamics (finite time constant τ v subscript 𝜏 𝑣 \tau_{v} ) cause the leaky integral control to become unstable, due to a communication delay between controller and network, causing unstable oscillations. For this toy experiment, we used τ v / τ u = 0.2 subscript 𝜏 𝑣 subscript 𝜏 𝑢 0.2 \tau_{v}/\tau_{u}=0.2 .

Refer to caption

Appendix C Proofs and extra information for Section 5 : Learning the feedback weights

C.1 learning the feedback weights in a sleep phase.

In this section, we show that the plasticity rule for the apical synapses ( 13 ) drives the feedback weights to fulfill Conditions 2 and 3 . We first sketch an intuitive argument on why the feedback learning rule works. Next, we state the full Theorem and give its proof.

C.1.1 Intuition behind the feedback learning rule

Inspired by the Weight Mirroring Method [ 14 ] we use white noise in the network to carry information from the network Jacobian J 𝐽 J into the output 𝐫 L subscript 𝐫 𝐿 \mathbf{r}_{L} . To gain intuition, we first consider a normal feedforward neural network

Now, we perturb each layer’s pre-nonlinearity activation with white noise 𝝃 i subscript 𝝃 𝑖 \boldsymbol{\xi}_{i} and propagate the perturbations forward:

with 𝐫 ~ 0 − = 𝐫 0 − superscript subscript ~ 𝐫 0 superscript subscript 𝐫 0 \tilde{\mathbf{r}}_{0}^{-}=\mathbf{r}_{0}^{-} . For small σ 𝜎 \sigma , a first-order Taylor approximation of the perturbed output gives

with 𝝃 𝝃 \boldsymbol{\xi} the concatenated vector of all 𝝃 i subscript 𝝃 𝑖 \boldsymbol{\xi}_{i} . If we now take as output target 𝐫 L ∗ = 𝐫 L − superscript subscript 𝐫 𝐿 superscript subscript 𝐫 𝐿 \mathbf{r}_{L}^{*}=\mathbf{r}_{L}^{-} , the output error is equal to

We now define a simple learning rule Δ ​ Q = − σ ​ 𝝃 ​ 𝐞 T − β ​ Q Δ 𝑄 𝜎 𝝃 superscript 𝐞 𝑇 𝛽 𝑄 \Delta Q=-\sigma\boldsymbol{\xi}\mathbf{e}^{T}-\beta Q , which is a simple anti-Hebbian rule with as presynaptic signal the output error 𝐞 𝐞 \mathbf{e} and as postsynaptic signal the noise inside the neuron σ ​ 𝝃 𝜎 𝝃 \sigma\boldsymbol{\xi} , combined with weight decay. If 𝝃 𝝃 \boldsymbol{\xi} is uncorrelated white noise with correlation matrix equal to the identity matrix, the expectation of this learning rule is

We see that this learning rule lets the feedback weights Q 𝑄 Q align with the transpose of the networks Jacobian J 𝐽 J and has a weight decay term to prevent Q 𝑄 Q from diverging.

There are three important differences between this simplified intuitive argumentation for the feedback learning rule and the actual feedback learning rule ( 13 ) used by DFC, which we will address in the next section.

DFC considers continuous dynamics, hence, the incorporation of noise leads to stochastic differential equations (SDEs) instead of a discrete perturbation of the network layers. The handling of SDEs needs special care, leading to the use of exponentially filtered white noise instead of purely white noise (see next section).

The postsynaptic part of the feedback learning rule ( 13 ) for DFC is the control signal 𝐮 𝐮 \mathbf{u} instead of the output error 𝐞 𝐞 \mathbf{e} . The control signal integrates the output error over time, causing correlations over time to arise in the feedback learning rule.

𝐽 superscript 𝐽 𝑇 𝛾 𝐼 1 J^{T}(JJ^{T}+\gamma I)^{-1} , γ > 0 𝛾 0 \gamma>0 instead of J T superscript 𝐽 𝑇 J^{T} .

C.1.2 Theorem and proof

Noise dynamics..

The network dynamics ( 1 ) are now given by

If we now assume that τ v fb ≪ τ u much-less-than subscript 𝜏 superscript 𝑣 fb subscript 𝜏 𝑢 \tau_{v^{\mathrm{fb}}}\ll\tau_{u} , and hence the dynamics of the feedback compartment is much faster than 𝐮 𝐮 \mathbf{u} , 𝐯 fb superscript 𝐯 fb \mathbf{v}^{\mathrm{fb}} can be approximated by

In the remainder of the section, we assume this approximation to be exact. The network dynamics ( 161 ) can then be written as

Now, we are ready to state and prove the main theorem of this section, which shows that the feedback weight plasticity rule ( 13 ) pushes the feedback weights to align with a damped pseudoinverse of the forward Jacobian J 𝐽 J of the network.

Theorem S11 .

and the first moment converges to:

with γ = α ​ β ​ τ u 𝛾 𝛼 𝛽 subscript 𝜏 𝑢 \gamma=\alpha\beta\tau_{u} . Furthermore, Q M ss superscript subscript 𝑄 𝑀 ss Q_{M}^{\mathrm{ss}} satisfies Conditions 2 and 3 , even if α = 0 𝛼 0 \alpha=0 in the latter.

Linearizing the system dynamics (which becomes exact in the limit of σ → 0 → 𝜎 0 \sigma\rightarrow 0 and assuming stable dynamics), results in the following dynamical equation for the controller, recalling that 𝐫 L ∗ = 𝐫 L − superscript subscript 𝐫 𝐿 superscript subscript 𝐫 𝐿 \mathbf{r}_{L}^{*}=\mathbf{r}_{L}^{-} (c.f. App. A.1 ):

with Δ ​ 𝐯 i ≜ 𝐯 i − W i ​ ϕ ​ ( 𝐯 i − 1 ) ≜ Δ subscript 𝐯 𝑖 subscript 𝐯 𝑖 subscript 𝑊 𝑖 italic-ϕ subscript 𝐯 𝑖 1 \Delta\mathbf{v}_{i}\triangleq\mathbf{v}_{i}-W_{i}\phi(\mathbf{v}_{i-1}) and Δ ​ 𝐯 Δ 𝐯 \Delta\mathbf{v} the concatenation of all Δ ​ 𝐯 i Δ subscript 𝐯 𝑖 \Delta\mathbf{v}_{i} . When we have a separation of timescales between the network and controller, i.e., τ v ≪ τ u much-less-than subscript 𝜏 𝑣 subscript 𝜏 𝑢 \tau_{v}\ll\tau_{u} , which corresponds with instant system dynamics of the network ( 166 ), we get

where the latter is the concatenated version of the former. Combining this with eq. ( 169 ) gives the following stochastic differential equation for the controller dynamics:

When we have a separation of timescales between the synaptic plasticity and controller dynamics, i.e., τ u ≪ τ Q much-less-than subscript 𝜏 𝑢 subscript 𝜏 𝑄 \tau_{u}\ll\tau_{Q} , we can treat Q 𝑄 Q as constant and therefore eq. ( 172 ) represents a linear time-invariant stochastic differential equation, which has as solution [ 43 ]

Using the approximate solution of the feedback compartment ( 163 ) (which we consider exact due to the separation of timescales τ v fb ≪ τ u much-less-than subscript 𝜏 superscript 𝑣 fb subscript 𝜏 𝑢 \tau_{v^{\mathrm{fb}}}\ll\tau_{u} ), we can write the expectation of the first part of the feedback learning rule ( 13 ) as

Focusing on (a) and using the covariance of ϵ bold-italic-ϵ \boldsymbol{\epsilon} ( 165 ), we get:

where we used in the last step that τ v fb ≪ τ u much-less-than subscript 𝜏 superscript 𝑣 fb subscript 𝜏 𝑢 \tau_{v^{\mathrm{fb}}}\ll\tau_{u} , hence 1 τ v fb ​ I − 1 τ u ​ A T ≈ 1 τ v fb ​ I 1 subscript 𝜏 superscript 𝑣 fb 𝐼 1 subscript 𝜏 𝑢 superscript 𝐴 𝑇 1 subscript 𝜏 superscript 𝑣 fb 𝐼 \frac{1}{\tau_{v^{\mathrm{fb}}}}I-\frac{1}{\tau_{u}}A^{T}\approx\frac{1}{\tau_{v^{\mathrm{fb}}}}I and 1 τ v fb ​ ∫ − t 1 0 e − 1 τ v fb ​ τ ​ d ​ τ ≈ 1 1 subscript 𝜏 superscript 𝑣 fb superscript subscript subscript 𝑡 1 0 superscript 𝑒 1 subscript 𝜏 superscript 𝑣 fb 𝜏 d 𝜏 1 \frac{1}{\tau_{v^{\mathrm{fb}}}}\int_{-t_{1}}^{0}e^{-\frac{1}{\tau_{v^{\mathrm{fb}}}}\tau}\text{d}\tau\approx 1 when τ v fb ≪ t 1 much-less-than subscript 𝜏 superscript 𝑣 fb subscript 𝑡 1 \tau_{v^{\mathrm{fb}}}\ll t_{1} for t 1 > 0 subscript 𝑡 1 0 t_{1}>0 . If we further assume that α ≫ max ⁡ ( { | λ i ​ ( J ​ Q ) | } ) much-greater-than 𝛼 subscript 𝜆 𝑖 𝐽 𝑄 \alpha\gg\max\big{(}\{|\lambda_{i}(JQ)|\}\big{)} with λ i ​ ( J ​ Q ) subscript 𝜆 𝑖 𝐽 𝑄 \lambda_{i}(JQ) the eigenvalues of J ​ Q 𝐽 𝑄 JQ , we have that

Focusing on part (b), we get

Taking everything together, we get the following approximate dynamics for the first moment of Q 𝑄 Q :

Assuming the approximation exact and solving for the steady state, we get:

The only thing remaining to show is that the dynamics of Q M subscript 𝑄 𝑀 Q_{M} are convergent. By vectorizing eq. ( 189 ), we get

𝐽 superscript 𝐽 𝑇 𝛼 𝛽 subscript 𝜏 𝑢 𝐼 1 Q_{M}^{\mathrm{ss}}=\frac{\alpha}{2}J^{T}(JJ^{T}+\alpha\beta\tau_{u}I)^{-1} satisfies Conditions 2 and 3 , even if α = 0 𝛼 0 \alpha=0 in the latter.

Lemma S12 .

𝐽 superscript 𝐽 𝑇 𝛾 𝐼 1 JJ^{T}(JJ^{T}+\gamma I)^{-1} with γ ≥ 0 𝛾 0 \gamma\geq 0 has strictly positive eigenvalues if J 𝐽 J is of full rank.

𝐽 superscript 𝐽 𝑇 𝛾 𝐼 1 Q=J^{T}(JJ^{T}+\gamma I)^{-1} with γ ≥ 0 𝛾 0 \gamma\geq 0 satisfies Condition 2 .

Next, consider the singular value decomposition of J 𝐽 J :

𝐽 superscript 𝐽 𝑇 𝛾 𝐼 1 JJ^{T}(JJ^{T}+\gamma I)^{-1} can be written as

superscript subscript 𝜎 𝑖 2 𝛾 0 \frac{\sigma_{i}^{2}}{\sigma_{i}^{2}+\gamma}>0 , thereby concluding the proof. ∎

C.2 Toy experiments corroborating the theory

To test whether Theorem S11 can also provide insight into more realistic settings, we conducted a series of student-teacher toy regression experiments with a one-hidden-layer network of size 20 − 10 − 5 20 10 5 20-10-5 for more realistic values of τ v fb subscript 𝜏 superscript 𝑣 fb \tau_{v^{\mathrm{fb}}} , τ v subscript 𝜏 𝑣 \tau_{v} , α 𝛼 \alpha and k p > 0 subscript 𝑘 𝑝 0 k_{p}>0 . For details about the simulation implementation, see App. E . We investigate the learning of Q 𝑄 Q during pre-training, hence, when the forward weights W i subscript 𝑊 𝑖 W_{i} are fixed. In contrast to Theorem S11 , we use multiple batch samples for training the feedback weights. When the network is linear, J 𝐽 J remains the same for each batch sample, hence mimicking the situation of Theorem S11 where Q 𝑄 Q is trained on only one sample to convergence. When the network is nonlinear, however, J 𝐽 J will be different for each sample, causing Q 𝑄 Q to align with an average configuration over the batch samples.

𝐽 superscript 𝐽 𝑇 𝛾 𝐼 1 J^{T}(JJ^{T}+\gamma I)^{-1} for different damping values γ 𝛾 \gamma in a linear network. Interestingly, the damping value that optimally describes the alignment of Q 𝑄 Q is γ = 5 𝛾 5 \gamma=5 , which is much larger than would be predicted by Theorem S11 which uses simplified conditions. Hence, the more realistic settings used in the simulation of these toy experiments result in a larger damping value γ 𝛾 \gamma . For nonlinear networks, similar conclusions can be drawn (see Fig. S3 .b), however, with slightly worse alignment due to J 𝐽 J changing for each batch sample. Note that almost perfect compliance to Condition 2 is reached for both the linear and nonlinear case (not shown here).

Refer to caption

Next, we investigate how big α 𝛼 \alpha needs to be for good alignment. Surprisingly, Fig. S4 shows that Q 𝑄 Q reaches almost perfect alignment for all values of α ∈ [ 0 , 1 ] 𝛼 0 1 \alpha\in[0,1] , both for linear and nonlinear networks. We hypothesize that this is due to the short simulation window (300 steps of Δ ​ t = 0.001 Δ 𝑡 0.001 \Delta t=0.001 ) that we used to reduce computational costs, preventing the dynamics from diverging, even when they are unstable. Interestingly, this hypothesis leads to another case where the feedback learning rule ( 13 ) can be used besides for big α 𝛼 \alpha : when the network activations can be ‘reset’ when they start diverging, e.g., by inhibition from other brain areas, the feedback weights can be learned properly, even with unstable dynamics.

Refer to caption

𝐽 superscript 𝐽 𝑇 𝛾 𝐼 1 J^{T}(JJ^{T}+\gamma I)^{-1} .

Refer to caption

C.3 Learning the forward and feedback weights simultaneously

In this section, we show that the forward and feedback weights can be learned simultaneously, when noise is added to the feedback compartment, resulting in the noisy dynamics of eq. ( 166 ), and when the feedback plasticity rule ( 13 ) uses a high-pass filtered version of 𝐮 𝐮 \mathbf{u} as presynaptic plasticity signal.

We make the same assumptions as in Theorem S11 , except now the output target 𝐫 L ∗ superscript subscript 𝐫 𝐿 \mathbf{r}_{L}^{*} is the one for learning the forward weights, hence given by eq. ( 3 ). Linearizing the network dynamics, gives us the following expression for the control error

and for the controller dynamics (with k p = 0 subscript 𝑘 𝑝 0 k_{p}=0 )

𝑄 𝐮 𝑡 𝜎 bold-italic-ϵ 𝑡 \Delta\mathbf{v}(t)=Q\mathbf{u}(t)+\sigma\boldsymbol{\epsilon}(t) , giving us:

We now continue by investigating the dynamics of newly defined signal Δ ​ 𝐮 ​ ( t ) Δ 𝐮 𝑡 \Delta\mathbf{u}(t) that subtracts a baseline from the control signal 𝐮 ​ ( t ) 𝐮 𝑡 \mathbf{u}(t) :

with 𝐮 ss subscript 𝐮 ss \mathbf{u}_{\mathrm{ss}} being the steady state of 𝐮 𝐮 \mathbf{u} in the dynamics without noise (see Lemma 1 ). Rewriting the dynamics ( 197 ) for Δ ​ 𝐮 Δ 𝐮 \Delta\mathbf{u} gives us

We now recovered exactly the same dynamics for Δ ​ 𝐮 Δ 𝐮 \Delta\mathbf{u} as was the case for 𝐮 𝐮 \mathbf{u} ( 172 ) during the sleep phase where 𝐫 L ∗ = 𝐫 L − superscript subscript 𝐫 𝐿 superscript subscript 𝐫 𝐿 \mathbf{r}_{L}^{*}=\mathbf{r}_{L}^{-} in Theorem S11 . Now, we introduce a new plasticity rule for Q 𝑄 Q using Δ ​ 𝐮 Δ 𝐮 \Delta\mathbf{u} instead of 𝐮 𝐮 \mathbf{u} as presynaptic plasticity signal:

Upon noting that Δ ​ 𝐮 Δ 𝐮 \Delta\mathbf{u} (representing the noise fluctuations in 𝐮 𝐮 \mathbf{u} ) is independent of 𝐮 ss subscript 𝐮 ss \mathbf{u}_{\mathrm{ss}} (representing the control input needed to drive the network to 𝐫 L ∗ superscript subscript 𝐫 𝐿 \mathbf{r}_{L}^{*} ), the approximate first moment dynamics described in Theorem S11 also hold for the new plasticity rule ( 200 ). Furthermore, when the controller dynamics ( 197 ) have settled, 𝐮 ss subscript 𝐮 ss \mathbf{u}_{\mathrm{ss}} is the average of 𝐮 ​ ( t ) 𝐮 𝑡 \mathbf{u}(t) (which has zero-mean noise fluctuations on top of 𝐮 ss subscript 𝐮 ss \mathbf{u}_{\mathrm{ss}} ), hence, Δ ​ 𝐮 Δ 𝐮 \Delta\mathbf{u} can be seen as a high-pass filtered version of 𝐮 ​ ( t ) 𝐮 𝑡 \mathbf{u}(t) .

To conclude, we have shown that the sleep phase for training the feedback weights Q 𝑄 Q can be merged with the phase for training the forward weights with 𝐫 L ∗ superscript subscript 𝐫 𝐿 \mathbf{r}_{L}^{*} as defined in eq. ( 3 ), if the plasticity rule for Q 𝑄 Q ( 200 ) uses a high-pass filtered version Δ ​ 𝐮 Δ 𝐮 \Delta\mathbf{u} of 𝐮 𝐮 \mathbf{u} as presynaptic plasticity signal and when the network and controller are fluctuating around their equilibrium, as we did not take initial conditions into account. We hypothesize that even with initial dynamics that have not yet converged to the steady-state, the plasticity rule for Q ( 200 ) with Δ ​ 𝐮 Δ 𝐮 \Delta\mathbf{u} a high-pass filtered version of 𝐮 𝐮 \mathbf{u} will result in proper feedback learning, as high-pass filtering 𝐮 ​ ( t ) 𝐮 𝑡 \mathbf{u}(t) will extract high-frequency noise fluctuations 8 8 8 Not all noise fluctuations are high-frequency. However, the important part of the hypothesis is that the high-pass filtering selects noise components that are zero-mean and correlate with 𝐯 fb superscript 𝐯 fb \mathbf{v}^{\mathrm{fb}} . out of it which are correlated with 𝐯 fb superscript 𝐯 fb \mathbf{v}^{\mathrm{fb}} and can hence be used for learning Q 𝑄 Q . We leave it to future work to experimentally verify this hypothesis. Merging the two phases into one has as a consequence that there is also noise present during the learning of the forward weights ( 5 ), which we investigate in the next subsection.

C.4 Influence of noisy dynamics on learning the forward weights

When there is noise present in the dynamics during learning the forward weights, this will have an influence on the updates of W i subscript 𝑊 𝑖 W_{i} . It turns out that the same noise correlations that we used in the previous sections to learn the feedback weights will cause bias terms to appear in the updates of the forward weights W i subscript 𝑊 𝑖 W_{i} ( 5 ). This issue is not unique to our DFC setting with a feedback controller but appears in general in methods that use error feedback and have realistic noise dynamics in their hidden layers. In this section, we lay down the issues caused by noise dynamics for learning forward weights for general methods that use error feedback. At the end of the section, we comment on the implications of these issues for DFC.

For simplicity, we consider a normal feedforward neural network

To incorporate the notion of noisy dynamics, we perturb each layer’s pre-nonlinearity activation with zero-mean noise ϵ i subscript bold-italic-ϵ 𝑖 \boldsymbol{\epsilon}_{i} and propagate the perturbations forward:

with ϵ bold-italic-ϵ \boldsymbol{\epsilon} the concatenated vector of all ϵ i subscript bold-italic-ϵ 𝑖 \boldsymbol{\epsilon}_{i} . If the task loss is an L 2 superscript 𝐿 2 L^{2} loss and we have the training label 𝐫 L ∗ superscript subscript 𝐫 𝐿 \mathbf{r}_{L}^{*} , the output error is equal to

with 𝜹 L = 𝐫 L ∗ − 𝐫 L − subscript 𝜹 𝐿 superscript subscript 𝐫 𝐿 superscript subscript 𝐫 𝐿 \boldsymbol{\delta}_{L}=\mathbf{r}_{L}^{*}-\mathbf{r}_{L}^{-} , the output error without noise perturbations. To remain general, we define the feedback path 𝐞 i = g i ​ ( 𝐞 L ) subscript 𝐞 𝑖 subscript 𝑔 𝑖 subscript 𝐞 𝐿 \mathbf{e}_{i}=g_{i}(\mathbf{e}_{L}) that transports the output error 𝐞 L subscript 𝐞 𝐿 \mathbf{e}_{L} to the hidden layer i 𝑖 i , at the level of the pre-nonlinearity activations. E.g., for BP, 𝐞 i = g i ​ ( 𝐞 L ) = J i T ​ 𝐞 L subscript 𝐞 𝑖 subscript 𝑔 𝑖 subscript 𝐞 𝐿 superscript subscript 𝐽 𝑖 𝑇 subscript 𝐞 𝐿 \mathbf{e}_{i}=g_{i}(\mathbf{e}_{L})=J_{i}^{T}\mathbf{e}_{L} , and for direct linear feedback mappings such as DFA, 𝐞 i = g i ​ ( 𝐞 L ) = Q i ​ 𝐞 L subscript 𝐞 𝑖 subscript 𝑔 𝑖 subscript 𝐞 𝐿 subscript 𝑄 𝑖 subscript 𝐞 𝐿 \mathbf{e}_{i}=g_{i}(\mathbf{e}_{L})=Q_{i}\mathbf{e}_{L} . Now, the commonly used update rule of postsynaptic error signal multiplied with presynaptic input gives (after a first-order Taylor expansion of all terms)

with 𝜹 i = g i ​ ( 𝜹 L ) subscript 𝜹 𝑖 subscript 𝑔 𝑖 subscript 𝜹 𝐿 \boldsymbol{\delta}_{i}=g_{i}(\boldsymbol{\delta}_{L}) , J g i = ∂ g i ​ ( 𝐞 L ) ∂ 𝐞 L | 𝐞 L = 𝜹 L J_{g_{i}}=\frac{\partial g_{i}(\mathbf{e}_{L})}{\partial\mathbf{e}_{L}}\big{\rvert}_{\mathbf{e}_{L}=\boldsymbol{\delta}_{L}} and D i = ∂ 𝐫 i − ∂ 𝐯 i | 𝐯 i = 𝐯 i − D_{i}=\frac{\partial\mathbf{r}^{-}_{i}}{\partial\mathbf{v}_{i}}\big{\rvert}_{\mathbf{v}_{i}=\mathbf{v}_{i}^{-}} . Taking the expectation of Δ ​ W i Δ subscript 𝑊 𝑖 \Delta W_{i} , we get

with Σ i − 1 subscript Σ 𝑖 1 \Sigma_{i-1} the covariance matrix of ϵ i − 1 subscript bold-italic-ϵ 𝑖 1 \boldsymbol{\epsilon}_{i-1} . We see that besides the desired update η ​ 𝜹 𝒊 ​ 𝐫 i − 1 − T 𝜂 subscript 𝜹 𝒊 superscript subscript 𝐫 𝑖 1 𝑇 \eta\boldsymbol{\delta_{i}}\mathbf{r}_{i-1}^{-T} , there also appears a bias term due to the noise, which scales with σ 2 superscript 𝜎 2 \sigma^{2} and cannot be avoided by averaging over weight updates. The noise bias arises from the correlation between the noise in the presynaptic input 𝐫 ~ i − 1 subscript ~ 𝐫 𝑖 1 \tilde{\mathbf{r}}_{i-1} and the postsynaptic error 𝐞 i subscript 𝐞 𝑖 \mathbf{e}_{i} . Note that it is not a valid strategy to assume that the noise in 𝐞 i subscript 𝐞 𝑖 \mathbf{e}_{i} is uncorrelated from the noise in 𝐫 ~ i − 1 subscript ~ 𝐫 𝑖 1 \tilde{\mathbf{r}}_{i-1} due to a time delay between the two signals, as in more realistic cases, ϵ bold-italic-ϵ \boldsymbol{\epsilon} originates from stochastic dynamics that integrate noise over time (e.g., one can think of ϵ bold-italic-ϵ \boldsymbol{\epsilon} as an Ornstein-Uhlenbeck process [ 43 ] ) and is hence always correlated over time.

In DFC, similar noise biases arise in the average updates of W i subscript 𝑊 𝑖 W_{i} . To reduce the relative impact of the noise bias on the weight update, the ratio ‖ 𝜹 i ‖ 2 / σ 2 subscript norm subscript 𝜹 𝑖 2 superscript 𝜎 2 \|\boldsymbol{\delta}_{i}\|_{2}/\sigma^{2} must be big enough, hence strong error feedback is needed. In DFC, ‖ 𝜹 L ‖ 2 subscript norm subscript 𝜹 𝐿 2 \|\boldsymbol{\delta}_{L}\|_{2} , and hence also the postsynaptic error term in the weight updates for W i subscript 𝑊 𝑖 W_{i} , scales with the target stepsize λ 𝜆 \lambda . Interestingly, this causes a trade-off to appear in DFC: on the one hand, λ 𝜆 \lambda needs to be small such that the weight updates ( 5 ) approximate GN and MN optimization (the theorems used Taylor approximations which become exact for λ → 0 → 𝜆 0 \lambda\rightarrow 0 ), and on the other hand, λ 𝜆 \lambda needs to be big to prevent the forward weight updates from being buried in the noise bias.

A possible solution for removing the noise bias from the average forward weight updates is to either buffer the postsynaptic error term or the presynaptic input 𝐫 i − 1 subscript 𝐫 𝑖 1 \mathbf{r}_{i-1} , or both (e.g., accumulating them or low-pass filtering them), before they are multiplied with each other to produce the weight update. This procedure would average the noise out in the signals, before they have the chance to correlate with each other in the weight update. Whether this procedure could correspond with biophysical mechanisms in a neuron is an interesting question for future work.

Appendix D Related work

Our learning theory analysis that connects DFC to Gauss-Newton (GN) optimization was inspired by three independent recent studies that, on the one hand, connect Target Propagation (TP) to GN optimization [ 21 , 22 ] and, on the other hand, point to a possible connection between Dynamic Inversion (DI) and GN optimization [ 32 ] . There are however important distinctions between how DFC approximates GN and how TP and DI approximate GN. In the following subsections, we discuss these related lines of work in detail.

D.1 Comparison of DFC to TP and variants

Recent work [ 21 , 22 ] discovered that learning through inverses of the forward pathway can in certain cases lead to an approximation of GN optimization. Although this finding inspired our theoretical results on the CA capabilities of DFC, there are fundamental differences between DFC and TP. The main conceptual difference between DFC and the variants of TP [ 19 , 20 , 21 , 22 ] is that DFC uses the combination of network dynamics and a controller to dynamically invert the forward pathway for CA, whereas TP and its variants learn parametric inverses of the forward pathway, encoded in the feedback weights. Although dynamic and parametric inversion seem closely related, they lead to major methodological and theoretical differences.

Methodological differences between DFC and TP.

First, for TP and its variants, the task of approximating the inverse of the forward pathway is completely put onto the feedback weights, resulting in the need for a strict relation between the feedforward and feedback pathway at all times during training. DFC, in contrast, reuses the forward pathway to dynamically compute its inverse, resulting in a more flexible relation between the feedforward and feedback pathway, described by Condition 2 . To the best of our knowledge, DFC is the first method that approximates a principled optimization method for feedforward neural networks of arbitrary dimensions, compatible with a wide range of feedback connectivity. The recent work of Bengio [ 22 ] iteratively improves the inverse and, hence, can compensate for imperfect parametric inverses. However, this method is developed only for invertible networks, which require all layers to have equal dimensions.

Second, DFC drives the hidden neural activations to target values simultaneously, hence letting ‘target activations’ from upstream layers influence ‘target activations’ from downstream layers. TP, in contrast, computes each target as a (pseudo)inverse of the output target independently. This is a subtle yet important difference between DFC and TP, which leads to significant theoretical differences, on which we will expand later. To gain intuition, consider the case where we update the weights of both DFC and TP to reach exactly the local layer targets. In TP, if we update the weights of a hidden layer to reach its target, all downstream layers will also reach their target without updating the weights. Hence, if we update all weights simultaneously, the output will overshoot its target. DFC, in contrast, takes the effect of the updated target values of upstream layers already into account, hence, when all weight updates are done simultaneously, the output target is reached exactly (in the linearized dynamics, c.f. Theorem 3 ).

Third, DFC needs significantly less external coordination compared to the recent TP variants. The new variants of TP with a link to GN [ 21 ] need highly coordinated noise phases for computing the Difference Reconstruction Loss (one separated noise phase for each layer). For DTP [ 20 ] , similar coordination is needed if noisy activations are used for computing the reconstruction loss, as proposed by the authors. The iterative variant of TP [ 22 ] needs coordination in propagating the target values, as the target iterations for a layer can only start when the iterations of the downstream layer have converged. As DFC uses dynamic inversion instead of parametric inversion, possible learning rules for the feedback weights do not need to use the Difference Reconstruction Loss [ 21 ] or variants thereof, opening the route to alternative, more biologically realistic learning rules. We propose a first feedback learning rule compatible with DFC, that makes use of noise and Hebbian learning, without the need for extensive external coordination (see also App. C.3 that merges feedforward and feedback weight training in a single-phase).

Finally, DFC uses a multi-compartment neuron model closely corresponding to recent models of cortical pyramidal neurons, to obtain plasticity rules fully local in space and time. Presently, it is unclear whether there exist similar neuron and network models for TP that result in plasticity rules local in time.

Theoretical differences between DFC and TP.

First, computing layerwise inverses, as is done in TP [ 19 ] , DTP [ 20 ] , and iterative TP [ 22 ] , can only be linked to GN for invertible networks but breaks down for non-invertible networks, as shown by Meulemans et al. [ 21 ] . Both DFC and the DRL variants of TP [ 21 ] establish a link to GN for both invertible and non-invertible feedforward networks of arbitrary dimensions. However, the DRL variants of TP are linked to a hybrid version of GN and gradient descent, whereas DFC, under appropriate conditions, is linked to pure GN optimization on the parameters. Our Theorems 2 and 3 differ from the theoretical results on the DRL variants of TP [ 21 ] due to the fact that: (i) the DRL variants compute targets for the post-nonlinearity activations and the DFC target activations, 𝐯 i subscript 𝐯 𝑖 \mathbf{v}_{i} , are pre-nonlinearity activations; and (ii) the DRL variants compute the targets for each layer independently, whereas DFC dynamically computes the targets while taking into account the changed target activations of other layers. We continue with expanding on this second point.

As explained intuitively before, TP and its variants compute each layer target independently from the other layer targets. Consequently, to link their variants of TP to GN optimization, Meulemans et al. [ 21 ] and Bengio [ 22 ] need to make a block-diagonal approximation of the GN curvature matrix, with each block corresponding to a single layer. As off-diagonal blocks are put to zero, influences of upstream target values on the downstream targets are ignored. The block-diagonal approximation of the GN curvature matrix was proposed in studies that used GN optimization to train deep neural networks with big minibatch sizes [ 64 , 65 ] . However, similar to DFC, TP is connected to GN with a minibatch size of 1. In this case, the GN curvature matrix is of low rank, and a block-diagonal approximation of this matrix will change its rank and hence its properties. In the analysis of DFC, in contrast, we do not need to make this block-diagonal approximation, as the target activations, 𝐯 i subscript 𝐯 𝑖 \mathbf{v}_{i} , influence each other. Consequently, DFC has a closer connection to GN optimization than the TP variants [ 21 , 22 ] .

Finally, DFC does not use a reconstruction loss to train the feedback weights but instead uses noise and Hebbian learning.

Empirical comparison of DFC to TP and variants

Table S1 shows the results for DTP [ 20 ] , and DDTP-linear [ 21 ] (the best performing variant of TP in [ 21 ] ) on MNIST, Fashion MNIST, MNIST-autoencoder, and MNIST (train), for the same architectures as used for Table 1 .

Comparing these results to the ones in Table 1 , we see that DFC outperforms DTP on all datasets and DDTP-linear on MNIST-autoencoder, while having similar performance on the other datasets. These encouraging results suggest that the closer connection of DFC to GN, when compared to the one of DDTP-linear to GN (see section D.1 ), leads to practical improvements in performance in some more challenging datasets.

D.2 Comparison of DFC to Dynamic Inversion

subscript 𝐫 𝑖 J_{i}=\frac{\partial\mathbf{r}_{L}}{\partial\mathbf{r}_{i}} since: (i) the pseudoinverse cannot be factorized over the layers [ 21 ] ; and (ii) in nonlinear networks, the Jacobians are evaluated at a wrong value because DI transmits errors instead of controlled layer activations through the forward path of the network during the dynamical inversion phase.

D.3 The core contributions of DFC

In summary, we see that DFC merges various insights from different fields resulting in a novel biologically plausible CA technique with unique and interesting properties that transcend the sheer sum of its parts. To clarify the novelty of our work, we summarize here again the core contributions of DFC:

DFC extends the idea of using a feedback controller to adjust network activations to also provide CA to DNNs by using it to track the desired output target, opening a new route for designing principled CA methods for DNNs.

To the best of our knowledge, DFC is the first method that approximates a principled optimization method for feedforward neural networks of arbitrary dimensions, while allowing for a wide and flexible range of feedback connectivity, in contrast to a single allowed feedback configuration.

The learning rules of DFC for the forward and feedback weights are fully local both in time and space, in contrast to many other biologically plausible learning rules. Furthermore, DFC does not need highly specific connectivity motives nor tightly coordinated plasticity mechanisms and can have all weights plastic simultaneously, if the adaptations explained in appendix C.3 are used.

The multi-compartment neuron model needed for DFC naturally corresponds to recent multi-compartment models of pyramidal neurons.

Appendix E Simulations and algorithms of DFC

In this section, we provide details on the simulation and algorithms used for DFC, DFC-SS, DFC-SSA and for training the feedback weights.

E.1 Simulating DFC and DFC-SS for training the forward weights

For simulating the network dynamics ( 1 ) and controller dynamics ( 4 ) without noise, we used the forward Euler method with some slight modifications. First, we implemented the controller dynamics ( 4 ) as follows:

1 subscript 𝑘 𝑝 𝛼 \frac{\alpha}{1+k_{p}\alpha} . Hence, this is just an implementation strategy to gain direct control over α ~ ~ 𝛼 \tilde{\alpha} as a hyperparameter independent from k p subscript 𝑘 𝑝 k_{p} .

𝑘 1 subscript 𝑄 𝑖 𝐮 delimited-[] 𝑘 \mathbf{v}^{\mathrm{fb}}_{i}[k+1]=Q_{i}\mathbf{u}[k] , such that the control error 𝐞 ​ [ k ] 𝐞 delimited-[] 𝑘 \mathbf{e}[k] of the previous timestep is used to provide feedback, instead of the control error 𝐞 ​ [ k − 1 ] 𝐞 delimited-[] 𝑘 1 \mathbf{e}[k-1] of two timesteps ago. 10 10 10 In the code repository, this modification to Euler’s method is indicated with the command line argument proactive_controller Again, this modification has almost no effect for small stepsizes Δ ​ t Δ 𝑡 \Delta t , but better reflects the underlying continuous dynamics for bigger stepsizes. In our simulations, the stepsize Δ ​ t Δ 𝑡 \Delta t that worked best for the experiments was small, hence, the discussed modifications had only minor effects on the simulation.

For DFC-SS, the same simulation strategy is used, with as only difference that the weight updates Δ ​ W i Δ subscript 𝑊 𝑖 \Delta W_{i} only use the network activations of the last simulation step (see Algorithm 2 ). Finally, for DFC-SSA, we directly compute the steady-state solutions according to Lemma 1 (see Algorithm 3 ).

E.2 Simulating DFC with noisy dynamics for training the feedback weights

For simulating the noisy dynamics during the training of the feedback weights, we use the Euler-Maruyama method [ 43 ] , which is the stochastic version of the forward Euler method. As discussed in App. C , we let white noise 𝝃 𝝃 \boldsymbol{\xi} enter the dynamics of the feedback compartment and we now take a finite time constant τ v fb subscript 𝜏 superscript 𝑣 fb \tau_{v^{\mathrm{fb}}} for the feedback compartment, as the instantaneous form with τ v fb → 0 → subscript 𝜏 superscript 𝑣 fb 0 \tau_{v^{\mathrm{fb}}}\rightarrow 0 (that we used for simulating the network dynamics without noise) is not well defined when noise enters the dynamics:

The dynamics for the network then becomes

and, as before, eq. ( 208 ) is taken for the controller dynamics. Using the Euler-Maruyama method [ 43 ] , the feedback compartment dynamics ( 209 ) can be simulated as

As all other dynamical equations do not have noise, their simulation remains equivalent to the simulation with the forward Euler method. Algorithm 4 provides the pseudo code of the simulation of DFC during the feedback weight training phase.

Appendix F Experiments

F.1 description of the alignment measures.

In this section, we describe the alignment measures used in Fig. 3 in detail.

Condition 2.

Fig. 3 A describes how well the network satisfies Condition 2 . For this, we project Q 𝑄 Q onto the column space of J T superscript 𝐽 𝑇 J^{T} , for which we use a projection matrix P J T subscript 𝑃 superscript 𝐽 𝑇 P_{J^{T}} :

Then, we compare the Frobenius norm of the projection of Q 𝑄 Q with the norm of Q 𝑄 Q , via its ratio:

Notice that a ratio Con2 = 1 subscript ratio Con2 1 \mathrm{ratio}_{\mathrm{Con2}}=1 indicates that the column space of Q 𝑄 Q lies fully inside the column space of J T superscript 𝐽 𝑇 J^{T} , hence indicating that Condition 2 is satisfied. 11 11 11 Note that in degenerate cases, Q 𝑄 Q could be lower rank and still have ratio Con2 = 1 subscript ratio Con2 1 \mathrm{ratio}_{\mathrm{Con2}}=1 if its (reduced) column space lies inside the column space of J T superscript 𝐽 𝑇 J^{T} . As Q 𝑄 Q is a skinny matrix, we assume it is always of full rank and do not consider this degenerate scenario. At the opposite extreme, ratio Con2 = 0 subscript ratio Con2 0 \mathrm{ratio}_{\mathrm{Con2}}=0 indicates that the column space of Q is orthogonal on the column space of J T superscript 𝐽 𝑇 J^{T} .

Condition 1.

Fig. 3 C describes how well the network satisfies Condition 1 . This condition states that all layers (except the output layer) have an equal L 2 superscript 𝐿 2 L^{2} norm. To measure how well Condition 1 is satisfied, we compute the standard deviation of the layer norms over the layers, and normalize it by the average layer norm:

We take 𝐫 i = 𝐫 i − subscript 𝐫 𝑖 superscript subscript 𝐫 𝑖 \mathbf{r}_{i}=\mathbf{r}_{i}^{-} to compute this measure, but other values of 𝐫 i subscript 𝐫 𝑖 \mathbf{r}_{i} during the dynamics would also work, as they remain close together for a small target stepsize λ 𝜆 \lambda . Now, notice that ratio Con1 = 0 subscript ratio Con1 0 \mathrm{ratio}_{\mathrm{Con1}}=0 indicates perfect compliance with Condition 1 , as then all layers have the same norm, and ratio Con1 = 1 subscript ratio Con1 1 \mathrm{ratio}_{\mathrm{Con1}}=1 indicates that the layer norms vary by mean ​ ( ‖ 𝐫 ‖ 2 ) mean subscript norm 𝐫 2 \mathrm{mean}(\|\mathbf{r}\|_{2}) on average, hence indicating that Condition 1 is not at all satisfied.

Stability measure.

Fig. 3 E describes the stability of DFC during training. For this, we plot the maximum real part of the eigenvalues of the total system matrix A P ​ I subscript 𝐴 𝑃 𝐼 A_{PI} around the steady state (see eq. ( 138 )), which describes the dynamics of DFC around the steady state (incorporating k p subscript 𝑘 𝑝 k_{p} and the actual time constants, in contrast to Condition 3 ).

Alignment with MN updates.

Fig. 3 B describes the alignment of the DFC updates with the ideal weighted MN updates. The MN updates are computed as follows:

with R 𝑅 R defined in eq. ( 72 ) and W ¯ ¯ 𝑊 \bar{W} the concatenated vectorized form of all weights W i subscript 𝑊 𝑖 W_{i} . For the alignment measurements in the computer vision experiments (see Section F.5.3 ) we use a damped variant of the MN updates:

with γ 𝛾 \gamma some positive damping constant. The damping constant is needed to incorporate the damping effect of the leakage constant, α 𝛼 \alpha , into the dynamical inversion, but also to reflect an implicit damping effect. Meulemans et al. [ 21 ] showed that introducing a higher damping constant, γ 𝛾 \gamma , in the pseudoinverse ( 216 ) reflected better the updates made by TP, which uses learned inverses. We found empirically that a higher damping constant, γ 𝛾 \gamma , also reflects better the updates made by DFC. Using a similar argumentation, we hypothesize that this implicit damping in DFC originates from the fact that, in nonlinear networks, J 𝐽 J changes for each batch sample and hence Q 𝑄 Q cannot satisfy Condition 2 for each batch sample. Consequently, Q 𝑄 Q tries to satisfy Condition 2 as good as possible for all batch samples, but does not satisfy it perfectly, resulting in a phenomenon that can be partially described by implicit damping.

Alignment with GN updates.

Fig. 3 D describes the alignment of the DFC updates with the ideal GN updates. The GN updates are computed as follows:

¯ 𝑊 J_{\bar{W}}=\frac{\partial\mathbf{r}_{L}^{-}}{\partial\bar{W}} , evaluated at the feedforward activations 𝐫 i − superscript subscript 𝐫 𝑖 \mathbf{r}_{i}^{-} . Similarly to the MN updates, we also introduce a damped variant of the GN updates, which is used in the computer vision alignment experiments (Section F.5.3 ):

where the damping constants, γ 𝛾 \gamma and α 𝛼 \alpha , reflect the leakage constant and the implicit damping effects, respectively.

Alignment with DFC-SSA updates.

Finally, Fig. 3 F describes the alignment of the DFC updates with the DFC-SSA updates which use the linearized analytical steady-state solution of the dynamics. The DFC-SSA updates are computed as follows (see also Algorithm 3 ):

F.2 Description of training

Training phases., student-teacher toy regression..

For the toy experiments of Fig. 3 , we use the student-teacher regression paradigm. Here, a randomly initialized teacher generates a synthetic regression dataset using random inputs. A separate randomly initialized student is then trained on this synthetic dataset. We used more hidden layers and neurons for the teacher network compared to the student network, such that the student network cannot get ‘lucky’ by being initialized close to the teacher network.

In student-teacher toy regression experiments, we use vanilla SGD without momentum as an optimizer. In the computer vision experiments, we use a separate Adam optimizer [ 44 ] for the forward and feedback weights, as this improves training results compared to vanilla SGD. As Adam was designed for BP updates, it will likely not be an optimal optimizer for DFC, which uses MN updates. An interesting future research direction is to design new optimizers that are tailored towards the MN updates of DFC, to further improve its performance. We used gradient clipping for all DFC experiments to prevent too large updates when the inverse of J 𝐽 J is poorly conditioned.

Training length and reported test results.

For the classification experiments, we used 100 epochs of training for the forward weights (and a corresponding amount of feedback training epochs, depending on X 𝑋 X ). As the autoencoder experiment was more resource-intensive, we trained the models for only 25 epochs there, as this was sufficient for getting near-perfect autoencoding performance when visually inspected (see Fig. S14 ). For all experiments, we split the 60000 training samples into a validation set of 5000 samples and a training set of 55000 samples. The hyperparameter searches are done based on the validation accuracy (validation loss for MNIST-autoencoder and train loss for MNIST-train) and we report the test results corresponding to the epoch with best validation results in Table 1 .

Weight initializations.

All network weights are initialized with the Glorot-Bengio normal initialization [ 66 ] , except when stated otherwise.

Initialization of the fixed feedback weights.

For the variants of DFC with fixed feedback weights, we use the following initialization:

For tanh \tanh networks, this initialization approximately satisfies Conditions 2 and 3 at the beginning of training. This is because Q 𝑄 Q will approximate J T superscript 𝐽 𝑇 J^{T} , as the forward weights are initialized by Glorot-Bengio normal initialization [ 66 ] , and the network will consequently be in the approximate linear regime of the tanh \tanh nonlinearities.

Freeze Q L subscript 𝑄 𝐿 Q_{L} .

For the MNIST-autoencoder experiments, we fixed the output feedback weights to Q L = I subscript 𝑄 𝐿 𝐼 Q_{L}=I , i.e., one-to-one connections between 𝐫 L ​ a ​ n ​ d ​ 𝐮 subscript 𝐫 𝐿 𝑎 𝑛 𝑑 𝐮 \mathbf{r}_{L}and\mathbf{u} . As we did not train Q L subscript 𝑄 𝐿 Q_{L} , we also did not introduce noise in the output layer during the training of the feedback weights. Freezing Q L subscript 𝑄 𝐿 Q_{L} prevents the noise in the high-dimensional output layer from burying the noise information originating from the small bottleneck layer and hence enabling better feedback weight training. This measure modestly improved the performance of DFC on MNIST-autoencoder (without fixing Q L subscript 𝑄 𝐿 Q_{L} , the performance of all DFC variants was around 0.13 test loss – c.f. Table 1 – which is not a big decrease in performance). Freezing Q L subscript 𝑄 𝐿 Q_{L} does not give us any advantages over BP or DFA, as these methods implicitly assume to have direct access to the output error, i.e., also having fixed feedback connections between the error neurons and output neurons equal to the identity matrix. We provided the option to freeze Q L subscript 𝑄 𝐿 Q_{L} into the hyperparameter searches of all experiments but this is not necessary for optimal performance of DFC in general, as this option was not always selected by the hyperparameter searches.

Double precision.

We noticed that the standard data type float32 of PyTorch [ 67 ] caused numerical errors to appear during the last epochs of training when the output error 𝜹 L subscript 𝜹 𝐿 \boldsymbol{\delta}_{L} is very small. For small 𝜹 L subscript 𝜹 𝐿 \boldsymbol{\delta}_{L} , the difference ϕ ​ ( 𝐯 i ) − ϕ ​ ( 𝐯 i ff ) italic-ϕ subscript 𝐯 𝑖 italic-ϕ subscript superscript 𝐯 ff 𝑖 \phi(\mathbf{v}_{i})-\phi(\mathbf{v}^{\mathrm{ff}}_{i}) in the forward weight updates ( 5 ) is very small and can result in numerical underflow. We solved this numerical problem by using float64 (double precision) as data type.

F.3 Architecture details

We use fully connected (FC) architectures for all experiments.

Classification experiments (MNIST, Fashion-MNIST, MNIST-train): 3 FC hidden layers of 256 neurons with tanh \tanh nonlinearity and 1 softmax output layer of 10 neurons.

MNIST-autoencoder: 256-32-256 FC hidden layers with tanh-linear-tanh nonlinearities and a linear output layer of 784 neurons.

Student-teacher regression (Fig. 3 ): 2 FC hidden layers of 10 neurons and tanh nonlinearities, a linear output layer of 5 neurons, and input dimension 15.

Absorbing softmax into the cross-entropy loss.

For the classification experiments (MNIST, Fashion-MNIST, and MNIST-train), we used a softmax output nonlinearity in combination with the cross-entropy loss. As the softmax nonlinearity and cross-entropy loss cancel out each others curvatures originating from the exponential and log terms, respectively, it is best to combine them into one output loss:

with 𝐲 ( b ) superscript 𝐲 𝑏 \mathbf{y}^{(b)} the one-hot vector representing the class label of sample b 𝑏 b , and log \log the element-wise logarithm. Now, as the softmax is absorbed into the loss function, the network output 𝐫 L subscript 𝐫 𝐿 \mathbf{r}_{L} can be taken linear and the output target is computed with eq. ( 3 ) using ℒ combined superscript ℒ combined \mathcal{L}^{\text{combined}} .

F.4 Hyperparameter searches

All hyperparameter searches were based on the best validation accuracy (best validation loss for MNIST-autoencoder and last train loss for MNIST-train) over all training epochs, using 5000 validation datasamples extracted from the training set. We use the Tree of Parzen Estimators hyperparameter optimization algorithm [ 68 ] based on the Hyperopt [ 69 ] and Ray Tune [ 70 ] Python libraries.

Due to the heavy computational cost of simulating DFC, we performed only hyperparameter searches for DFC-SSA, DFC-SSA (fixed), BP and DFA (200 hyperparameter samples for all methods). We used the hyperparameters found for DFC-SSA and DFC-SSA (fixed) for DFC and DFC-SS, and DFC (fixed) and DFC-SS (fixed), respectively, together with standard simulation hyperparameters for the forward weight training that proved to work well ( k p = 2 subscript 𝑘 𝑝 2 k_{p}=2 , τ u = 1 subscript 𝜏 𝑢 1 \tau_{u}=1 , τ v = 0.2 subscript 𝜏 𝑣 0.2 \tau_{v}=0.2 , forward Euler stepsize Δ ​ t = 0.02 Δ 𝑡 0.02 \Delta t=0.02 and 1000 simulation steps).

Tables S2 and S3 provide the hyperparameters and search intervals that we used for DFC-SSA in all experiments. We included the simulation hyperparameters for the feedback training phase in the search to prevent us from fine-tuning the simulations by hand. Note that we use different simulation hyperparameters for the forward training phase (see paragraph above) and the feedback training phase (see Table S3 ). This is because the simulation of the feedback training phase needs a small stepsize, Δ ​ t fb Δ subscript 𝑡 fb \Delta t_{\mathrm{fb}} , and a small network time constant, τ v subscript 𝜏 𝑣 \tau_{v} , to properly simulate the stochastic dynamics. For the forward phase, however, we need to simulate over a much longer time interval, so taking small Δ ​ t Δ 𝑡 \Delta t and τ v subscript 𝜏 𝑣 \tau_{v} 12 12 12 The simulation stepsize, Δ ​ t Δ 𝑡 \Delta t , needs to be smaller than the time constants. would be too resource-intensive. When using k p = 2 subscript 𝑘 𝑝 2 k_{p}=2 , τ u = 1 subscript 𝜏 𝑢 1 \tau_{u}=1 , and τ v = 0.2 subscript 𝜏 𝑣 0.2 \tau_{v}=0.2 during the simulation of the forward training phase, much bigger timesteps such as Δ ​ t = 0.02 Δ 𝑡 0.02 \Delta t=0.02 can be used. Note that these simulation parameters do not change the steady state of the controller and network, as α ~ ~ 𝛼 \tilde{\alpha} is independent from k p subscript 𝑘 𝑝 k_{p} in our implementation. We also differentiated α ~ ~ 𝛼 \tilde{\alpha} in the forward training phase from α ~ fb subscript ~ 𝛼 fb \tilde{\alpha}_{\mathrm{fb}} in the feedback training phase, as the theory predicted that a bigger leakage constant is needed during the feedback training phase in the first epochs. However, toy simulations in Section C suggest that the feedback learning also works for smaller α ~ ~ 𝛼 \tilde{\alpha} , which we did not explore in the computer vision experiments. Finally, we used lr ⋅ λ ⋅ lr 𝜆 \mathrm{lr}\cdot\lambda and λ 𝜆 \lambda as hyperparameters in the search instead of lr lr \mathrm{lr} and λ 𝜆 \lambda separately, as lr lr \mathrm{lr} and λ 𝜆 \lambda have a similar influence on the magnitude of the forward parameter updates. The specific hyperparameter configurations for all experiments can be found in our codebase. 13 13 13 PyTorch implementation of all methods is available at https://github.com/meulemansalex/deep_feedback_control .

F.5 Extended experimental results

In this section, we provide extra experimental results accompanying the results of Section 6 .

F.5.1 Training losses of the computer vision experiments

Table S5 provides the best training loss over all epochs for all the considered computer vision experiments. Comparing the train losses with the test performances in Table 1 , shows that good test performance is not only caused by good optimization properties (i.e., low train loss) but also by other mechanisms, such as implicit regularization. The distinction is most pronounced in the results for MNIST. These results highlight the need to disentangle optimization from implicit regularization mechanisms to study the learning properties of DFC, which we do in the MNIST-train experiments provided in Table 1 .

F.5.2 Alignment plots for the toy experiment

Here, we show the alignment of the methods used in the toy experiments of Fig. 3 with MN updates and compare it with the alignment with BP updates. We plot the alignment angles per layer to investigate whether the alignment differs between layers. Fig. S6 shows the alignment of all methods with the damped MN angles and Fig. S7 with the BP angles. We see clearly that the alignment with MN angles is much better for the DFC variants with trained feedback weights compared to the alignment with BP angles, hence indicating that DFC uses a fundamentally different approach to learning, compared to BP, and thereby confirming the theory.

F.5.3 Alignment plots for computer vision experiments

Figures S8 and S9 show the alignment of all methods with MN and BP updates, respectively. In contrast to the toy experiments in the previous section, now the alignment with BP is much closer to the alignment with MN updates. There are two main reasons for this. First, the classification networks we used have big hidden layers and a small output layer. In this case, the network Jacobian J 𝐽 J has many rows and only very few columns, which causes J † superscript 𝐽 † J^{\dagger} to approximately align with J T superscript 𝐽 𝑇 J^{T} (see among others Theorem S12 in Meulemans et al. [ 21 ] ). Hence, the BP updates will also approximately align with the MN updates, explaining the better alignment with BP updates on MNIST compared to the toy experiments. Secondly, due to the nonlinearity of the network, J 𝐽 J changes for each datasample and Q 𝑄 Q cannot satisfy Condition 2 exactly for all datasamples. We try to model this effect by introducing a higher damping constant, γ = 1 𝛾 1 \gamma=1 , for computing the ideal damped MN updates (see Section F.1 ). However, this higher damping constant is not a perfect model for the phenomena occurring. Consequently, the alignment of DFC with the damped MN updates is suboptimal and a better alignment could be obtained by introducing other variants of MN updates that more accurately describe the behavior of DFC on nonlinear networks. 14 14 14 Now, we perform a small grid-search to find a γ ∈ { 0 , 10 − 5 , 10 − 4 , 10 − 3 , 10 − 2 , 10 − 1 , 1 , 10 } 𝛾 0 superscript 10 5 superscript 10 4 superscript 10 3 superscript 10 2 superscript 10 1 1 10 \gamma\in\{0,10^{-5},10^{-4},10^{-3},10^{-2},10^{-1},1,10\} that best aligns with the DFC and DFA updates after 3 epochs of training. As this is a very coarse-grained approach, better alignment angles with damped MN updates could be obtained by a more fine-tuned approach for finding an optimal γ 𝛾 \gamma . Note that nonetheless, the alignment with MN updates is better compared to the alignment with BP updates.

Surprisingly, for Fashion-MNIST and MNIST-autoencoder, the DFC updates in the last and penultimate layer align better with BP than with MN updates (see Figures S11 - S12 ). One notable difference between the configurations used for MNIST on the one hand and Fashion-MNIST and MNIST-autoencoder on the other hand, is that the hyperparameter search selected for the latter two to fix the output feedback weights Q L subscript 𝑄 𝐿 Q_{L} to the identity matrix (see Section F.2 for a description and discussion). This freezing of the output feedback weights slightly improved the performance of the DFC methods. Freezing Q L subscript 𝑄 𝐿 Q_{L} to the identity matrix explains why the output weight updates align closely with BP, as the postsynaptic plasticity signal is now an integrated plus proportional version of the output error. However, it is surprising that the alignment in the penultimate layer is also changed significantly. We hypothesize that this is due to the fact that the feedback learning rule ( 13 ) was designed for learning all feedback weights (leading to Theorem 6 ) and that freezing Q L subscript 𝑄 𝐿 Q_{L} breaks this assumption. However, extra investigation is needed to fully understand the occurring phenomena.

Refer to caption

F.5.4 Autoencoder images

Fig. S14 shows the autoencoder output for randomly selected samples of BP, DFC-SSA, DFC-SSA (fixed), and DFA, compared with the autoencoder input. As DFC, DFC-SS, and DFC-SSA have very similar test losses and hence autoencoder performance, we only show the plots for DFC-SSA and DFC-SSA (fixed). Fig. S14 shows that BP and the DFC variants with trained weights have almost perfect autoencoding performance when visually inspected, while DFA and the DFC (fixed) variants do not succeed in autoencoding their inputs, which is also reflected in the performance results (see Table 1 .

Refer to caption

F.6 Resources and compute

For the computer vision experiments, we used GeForce RTX 2080 and GeForce RTX 3090 GPUs. Table S6 provides runtime estimates for 1 epoch of feedforward training and 3 epochs of feedback training (if applicable) for the DFC methods, using a GeForce RTX 2080 GPU. For MNIST and Fashion-MNIST we do 100 training epochs and for MNIST-autoencoder 25 training epochs. We did hyperparameter searches of 200 samples on all datasets for DFC-SSA and DFC-SSA (fixed) and reused the hyperparameter configuration for the other DFC variants. For BP and DFA we also performed hyperparameter searches of 200 samples for all experiments, with computational costs negligible compared to DFC.

F.7 Dataset and Code licenses

For the computer vision experiments, we used the MNIST dataset [ 40 ] and the Fashion-MNIST dataset [ 41 ] , which have the following licenses:

MNIST: https://creativecommons.org/licenses/by-sa/3.0/

Fashion-MNIST: https://opensource.org/licenses/MIT

For the implementation of the methods, we used PyTorch [ 71 ] and built upon the codebase of Meulemans et al. [ 21 ] , which have the following licenses:

Pytorch: https://github.com/pytorch/pytorch/blob/master/LICENSE

Meulemans et al. [ 21 ] : https://www.apache.org/licenses/LICENSE-2.0

Appendix G DFC and multi-compartment models of cortical pyramidal neurons

As mentioned in the Discussion, the multi-compartment neuron of DFC (see Fig. 1 C) is closely related to recent dendritic compartment models of the cortical pyramidal neuron [ 23 , 25 , 26 , 47 ] . In the terminology of these models, our central, feedforward, and feedback compartments, correspond to the somatic, basal dendritic, and apical dendritic compartments of pyramidal neurons. Here, we relate our network dynamics ( 1 ) in more detail to the proposed pyramidal neuron dynamics of Sacramento et al. [ 23 ] . Rephrasing their dynamics for the somatic membrane potentials of pyramidal neurons (equation (1) of Sacramento et al. [ 23 ] ) with our own notation, we get

Like DFC, the network is structured in multiple layers, 0 ≤ i ≤ L 0 𝑖 𝐿 0\leq i\leq L , where each layer has its own dynamical equation as defined above. Basal and apical dendritic compartments ( 𝐯 i ff subscript superscript 𝐯 ff 𝑖 \mathbf{v}^{\mathrm{ff}}_{i} and 𝐯 i fb subscript superscript 𝐯 fb 𝑖 \mathbf{v}^{\mathrm{fb}}_{i} resp.) of pyramidal cells are coupled towards the somatic compartment ( 𝐯 i subscript 𝐯 𝑖 \mathbf{v}_{i} ) with fixed conductances g B subscript 𝑔 B g_{\mathrm{B}} and g A subscript 𝑔 A g_{\mathrm{A}} , and leakage g lk subscript 𝑔 lk g_{\text{lk}} . Background activity of all compartments is modeled by an independent white noise input 𝝃 i ∼ 𝒩 ​ ( 0 , I ) similar-to subscript 𝝃 𝑖 𝒩 0 𝐼 \boldsymbol{\xi}_{i}\sim\mathcal{N}(0,I) . The dendritic compartment potentials are given in their instantaneous forms (c.f. equations (3) and (4) in Sacramento et al. [ 23 ] )

with W i subscript 𝑊 𝑖 W_{i} the synaptic weights of the basal dendrites, Q i subscript 𝑄 𝑖 Q_{i} the synaptic weights of the apical dendrites, ϕ italic-ϕ \phi a nonlinear activation function transforming the voltage levels to firing rates, and 𝐮 𝐮 \mathbf{u} a feedback input.

Filling the instantaneous forms of 𝐯 ff superscript 𝐯 ff \mathbf{v}^{\mathrm{ff}} and 𝐯 fb superscript 𝐯 fb \mathbf{v}^{\mathrm{fb}} into the dynamics of the somatic compartment ( 224 ), and reworking the equation, we get:

subscript 𝑔 lk subscript 𝑔 B subscript 𝑔 A \tilde{\tau}_{v}=\frac{\tau_{v}}{g_{\mathrm{lk}}+g_{\mathrm{B}}+g_{\mathrm{A}}} . When we absorb g ~ B subscript ~ 𝑔 B \tilde{g}_{\mathrm{B}} and g ~ A subscript ~ 𝑔 A \tilde{g}_{\mathrm{A}} into W i subscript 𝑊 𝑖 W_{i} and Q i subscript 𝑄 𝑖 Q_{i} , respectively, we recover the DFC network dynamics ( 1 ) with noise added. Hence, we see that not only the multi-compartment neuron model of DFC is closely related to dendritic compartment models of pyramidal neurons, but also the neuron dynamics used in DFC are intimately connected to models of cortical pyramidal neurons. What sets DFC apart from the cortical model of Sacramento et al. [ 23 ] is its unique feedback dynamics that make use of a feedback controller and lead to approximate GN optimization.

Appendix H Feedback pathway designs compatible with DFC

To present DFC in its most simple form, we used direct linear feedback mappings from the output controller towards all hidden layers. However, DFC is also compatible with more general feedback pathways.

Consider 𝐯 i fb = g i ​ ( 𝐮 ) subscript superscript 𝐯 fb 𝑖 subscript 𝑔 𝑖 𝐮 \mathbf{v}^{\mathrm{fb}}_{i}=g_{i}(\mathbf{u}) with g i subscript 𝑔 𝑖 g_{i} a smooth mapping from the control signal 𝐮 𝐮 \mathbf{u} towards the feedback compartment of layer i 𝑖 i , leading to the following network dynamics:

The feedback path g i subscript 𝑔 𝑖 g_{i} could be for example a multilayer neural network (see Fig. S15 A) and different g i subscript 𝑔 𝑖 g_{i} could share layers (see Fig. S15 B). As the output stepsize λ 𝜆 \lambda is taken small in DFC, the control signal 𝐮 𝐮 \mathbf{u} will also remain small. Hence, we can take a first-order Taylor approximation of g i subscript 𝑔 𝑖 g_{i} around 𝐮 = 0 𝐮 0 \mathbf{u}=0 :

Refer to caption

Until now, we considered general feedback paths g i subscript 𝑔 𝑖 g_{i} and linearized them around 𝐮 = 0 𝐮 0 \mathbf{u}=0 , thereby reducing their expressive power to linear mappings. As the forward Jacobian J 𝐽 J changes for each datasample in nonlinear networks, it can be helpful to have a feedback path for which J g i subscript 𝐽 subscript 𝑔 𝑖 J_{g_{i}} also changes for each datasample. Then, each J g i subscript 𝐽 subscript 𝑔 𝑖 J_{g_{i}} can specialize its mapping for a particular cluster of datasamples, thereby enabling a better compliance to Conditions 2 and 3 for each datasample. To let J g i subscript 𝐽 subscript 𝑔 𝑖 J_{g_{i}} change depending on the considered datasample and hence activations 𝐯 i subscript 𝐯 𝑖 \mathbf{v}_{i} of the network, the feedback path g i subscript 𝑔 𝑖 g_{i} needs to be ‘influenced’ by the network activations 𝐯 i subscript 𝐯 𝑖 \mathbf{v}_{i} .

One interesting direction for future work is to have connections from the network layers 𝐯 i subscript 𝐯 𝑖 \mathbf{v}_{i} onto the layers of the feedback path g i subscript 𝑔 𝑖 g_{i} , that can modulate the nonlinear activation function ϕ g subscript italic-ϕ 𝑔 \phi_{g} of those layers. By modulating ϕ g subscript italic-ϕ 𝑔 \phi_{g} , the feedback Jacobian J g i subscript 𝐽 subscript 𝑔 𝑖 J_{g_{i}} will depend on the network activations 𝐯 i subscript 𝐯 𝑖 \mathbf{v}_{i} and, hence, will change for each datasample. Interestingly, there are many candidate mechanisms to implement such modulation in biological cortical neurons [ 72 , 73 , 74 ] .

Another possible direction is to integrate the feedback path g i subscript 𝑔 𝑖 g_{i} into the forward network ( ​ 1 ​ ) italic-( 1 italic-) \eqref{eq:network_dynamics} and separate forward signals from feedback signals by using neural multiplexed codes [ 26 , 75 ] . As the feedback path g i subscript 𝑔 𝑖 g_{i} is now integrated into the forward pathway, its Jacobian J g i subscript 𝐽 subscript 𝑔 𝑖 J_{g_{i}} can be made dependent on the forward activations 𝐯 i subscript 𝐯 𝑖 \mathbf{v}_{i} . While being a promising direction, merging the forward pathway with the feedback path is not trivial and significant future work would be needed to accomplish it.

Appendix I Discussion on the biological plausibility of the controller

The feedback controller used by DFC (see Fig. 1 A and eq. ( 4 )) has three main components. First, it needs to have a way of computing the control error 𝐞 ​ ( t ) 𝐞 𝑡 \mathbf{e}(t) . Second, it needs to perform a leaky integration ( 𝐮 int superscript 𝐮 int \mathbf{u}^{\mathrm{int}} ) of the control error. Third, the controller needs to multiply the control error by k p subscript 𝑘 𝑝 k_{p} .

Following the majority of biologically plausible learning methods [ 9 , 14 , 15 , 16 , 20 , 21 , 22 , 26 , 42 ] , we assume to have access to an output error that the feedback controller can use. As the error is a simple difference between the network output and an output target 𝐫 L ∗ superscript subscript 𝐫 𝐿 \mathbf{r}_{L}^{*} , this should be relatively easily computable. Another interesting aspect of computing the output error is the question of where the output target 𝐫 L ∗ superscript subscript 𝐫 𝐿 \mathbf{r}_{L}^{*} could originate from in the brain. This is currently an open question in the field [ 76 ] which we do not aim to address in this work.

Integrating neural signals over long time horizons is a well-studied subject concerning many application areas, ranging from oculomotor control to maintaining information in working memory [ 58 , 59 , 60 , 61 , 62 ] . To provide intuition, a straightforward approach to leaky integration is to use recurrent self-connections with strength ( 1 − α ) 1 𝛼 (1-\alpha) . Then, the same neural dynamics used in ( 1 ) give rise to

When we take the input weights W in subscript 𝑊 in W_{\mathrm{in}} equal to the identity matrix, we recover the dynamics for 𝐮 int ​ ( t ) superscript 𝐮 int 𝑡 \mathbf{u}^{\mathrm{int}}(t) described in ( 4 ).

Finally, a multiplication of the control error by k p subscript 𝑘 𝑝 k_{p} can simply be done by having synaptic weights with strength k p subscript 𝑘 𝑝 k_{p} .

ar5iv homepage

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Published: 13 May 2021

Burst-dependent synaptic plasticity can coordinate learning in hierarchical circuits

  • Alexandre Payeur   ORCID: orcid.org/0000-0002-2437-8249 1 , 2 , 3   na1   nAff12 ,
  • Jordan Guerguiev 4 , 5   na1 ,
  • Friedemann Zenke   ORCID: orcid.org/0000-0003-1883-644X 6 ,
  • Blake A. Richards   ORCID: orcid.org/0000-0001-9662-2151 7 , 8 , 9 , 10   na2 &
  • Richard Naud   ORCID: orcid.org/0000-0001-7383-3095 1 , 2 , 3 , 11   na2  

Nature Neuroscience volume  24 ,  pages 1010–1019 ( 2021 ) Cite this article

20k Accesses

80 Citations

109 Altmetric

Metrics details

  • Learning algorithms
  • Sensory processing
  • Spike-timing-dependent plasticity

An Author Correction to this article was published on 02 November 2021

This article has been updated

Synaptic plasticity is believed to be a key physiological mechanism for learning. It is well established that it depends on pre- and postsynaptic activity. However, models that rely solely on pre- and postsynaptic activity for synaptic changes have, so far, not been able to account for learning complex tasks that demand credit assignment in hierarchical networks. Here we show that if synaptic plasticity is regulated by high-frequency bursts of spikes, then pyramidal neurons higher in a hierarchical circuit can coordinate the plasticity of lower-level connections. Using simulations and mathematical analyses, we demonstrate that, when paired with short-term synaptic dynamics, regenerative activity in the apical dendrites and synaptic plasticity in feedback pathways, a burst-dependent learning rule can solve challenging tasks that require deep network architectures. Our results demonstrate that well-known properties of dendrites, synapses and synaptic plasticity are sufficient to enable sophisticated learning in hierarchical circuits.

This is a preview of subscription content, access via your institution

Access options

Access Nature and 54 other Nature Portfolio journals

Get Nature+, our best-value online-access subscription

24,99 € / 30 days

cancel any time

Subscribe to this journal

Receive 12 print issues and online access

195,33 € per year

only 16,28 € per issue

Buy this article

  • Purchase on Springer Link
  • Instant access to full article PDF

Prices may be subject to local taxes which are calculated during checkout

credit assignment problem in neural networks

Similar content being viewed by others

credit assignment problem in neural networks

Adaptive control of synaptic plasticity integrates micro- and macroscopic network function

Daniel N. Scott & Michael J. Frank

credit assignment problem in neural networks

Introducing principles of synaptic integration in the optimization of deep neural networks

Giorgia Dellaferrera, Stanisław Woźniak, … Evangelos Eleftheriou

credit assignment problem in neural networks

Dendrites help mitigate the plasticity-stability dilemma

Katharina A. Wilmes & Claudia Clopath

Data availability

The MNIST, CIFAR-10 (ref. 76 ) and ImageNet 77 datasets are publicly available from http://yann.lecun.com/exdb/mnist/ , https://www.cs.toronto.edu/~kriz/cifar.html and http://www.image-net.org , respectively.

Code availability

The code used in this article is available at https://github.com/apayeur/spikingburstprop and https://github.com/jordan-g/Burstprop .

Change history

02 november 2021.

A Correction to this paper has been published: https://doi.org/10.1038/s41593-021-00970-x

Hebb, D. O. The Organization of Behavior (Wiley, New York, 1949).

Google Scholar  

Artola, A., Bröcher, S. & Singer, W. Different voltage dependent thresholds for inducing long-term depression and long-term potentiation in slices of rat visual cortex. Nature 347 , 69–72 (1990).

Article   CAS   PubMed   Google Scholar  

Markram, H., Lübke, J., Frotscher, M. & Sakmann, B. Regulation of synaptic efficacy by coincidence of postsynaptic APs and EPSPs. Science 275 , 213–215 (1997).

Paulsen, O. & Sejnowski, T. J. Natural patterns of activity and long-term synaptic plasticity. Curr. Opin. Neurobiol. 10 , 172–180 (2000).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Sjöström, P. J., Turrigiano, G. G. & Nelson, S. B. Rate, timing, and cooperativity jointly determine cortical synaptic plasticity. Neuron 32 , 1149–1164 (2001).

Article   PubMed   Google Scholar  

Letzkus, J. J., Kampa, B. M. & Stuart, G. J. Learning rules for spike timing-dependent plasticity depend on dendritic synapse location. J. Neurosci. 26 , 10420–10429 (2006).

Kampa, B., Letzkus, J. & Stuart, G. Requirement of dendritic calcium spikes for induction of spike-timing-dependent synaptic plasticity. J. Physiol. 574 , 283–290 (2006).

Sjöström, P. J. & Häusser, M. A cooperative switch determines the sign of synaptic plasticity in distal dendrites of neocortical pyramidal neurons. Neuron 51 , 227–238 (2006).

Gambino, F. et al. Sensory-evoked LTP driven by dendritic plateau potentials in vivo. Nature 515 , 116–119 (2014).

Geun Hee, S. et al. Neuromodulators control the polarity of spike-timing-dependent synaptic plasticity. Neuron 55 , 919–929 (2007).

Article   Google Scholar  

Gerstner, W., Lehmann, M., Liakoni, V., Corneil, D. & Brea, J. Eligibility traces and plasticity on behavioral time scales: experimental support of neoHebbian three-factor learning rules. Front. Neural Circuits 12 , 53 (2018).

Article   PubMed   PubMed Central   Google Scholar  

Roelfsema, P. R. & Holtmaat, A. Control of synaptic plasticity in deep cortical networks. Nat. Rev. Neurosci. 19 , 166–180 (2018).

Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8 , 229–256 (1992).

Werfel, J., Xie, X. & Seung, H. S. Learning curves for stochastic gradient descent in linear feedforward networks. Neural Comput. 17 , 2699–2718 (2005).

Lillicrap, T. P., Santoro, A., Marris, L., Akerman, C. J. & Hinton, G. Backpropagation and the brain. Nat. Rev. Neurosci. 21 , 335–346 (2020).

Richards, B. A. et al. A deep learning framework for systems neuroscience. Nat. Neurosci. 22 , 1761–1770 (2019).

Rumelhart, D. E., Hinton, G. E. & Williams, R. J. Learning representations by back-propagating errors. Nature 323 , 533–536 (1986).

Larkum, M. E., Zhu, J. & Sakmann, B. A new cellular mechanism for coupling inputs arriving at different cortical layers. Nature 398 , 338–341 (1999).

Markram, H., Wang, Y. & Tsodyks, M. Differential signaling via the same axon of neocortical pyramidal neurons. Proc. Natl. Acad. Sci. USA 95 , 5323–5328 (1998).

Nevian, T. & Sakmann, B. Spine Ca 2+ signaling in spike-timing-dependent plasticity. J. Neurosci. 26 , 11001–11013 (2006).

Froemke, R. C., Tsay, I. A., Raad, M., Long, J. D. & Dan, Y. Contribution of individual spikes in burst-induced long-term synaptic modification. J. Neurophys. 95 , 1620–1629 (2006).

Bell, C. C., Caputi, A., Grant, K. & Serrier, J. Storage of a sensory pattern by anti-Hebbian synaptic plasticity in an electric fish. Proc. Natl Acad. Sci. USA 90 , 4650–4654 (1993).

Bol, K., Marsat, G., Harvey-Girard, E., Longtin, André & Maler, L. Frequency-tuned cerebellar channels and burst-induced LTD lead to the cancellation of redundant sensory inputs. J. Neurosci. 31 , 11028–11038 (2011).

Richards, B. A. & Lillicrap, T. P. Dendritic solutions to the credit assignment problem. Curr. Opin. Neurobiol. 54 , 28–36 (2019).

Brandalise, F. & Gerber, U. Mossy fiber-evoked subthreshold responses induce timing-dependent plasticity at hippocampal Ca3 recurrent synapses. Proc. Natl Acad. Sci. USA 111 , 4303–4308 (2014).

Kayser, C., Montemurro, M. A., Logothetis, N. K. & Panzeri, S. Spike-phase coding boosts and stabilizes information carried by spatial and temporal spike patterns. Neuron 61 , 597–608 (2009).

Herzfeld, D. J., Kojima, Y., Soetedjo, R. & Shadmehr, R. Encoding of action by the purkinje cells of the cerebellum. Nature 526 , 439–442 (2015).

Naud, R. & Sprekeler, H. Sparse bursts optimize information transmission in a multiplexed neural code. Proc. Nat. Acad. Sci. USA 115 , 6329–6338 (2018).

Burbank, K. S. Mirrored STDP implements autoencoder learning in a network of spiking neurons. PLoS Comp. Biol. 11 , e1004566 (2015).

Akrout, M., Wilson, C., Humphreys, P. C., Lillicrap, T. & Tweed, D. Using weight mirrors to improve feedback alignment. Preprint at arXiv https://arxiv.org/abs/1904.05391 (2019).

Murayama, M. et al. Dendritic encoding of sensory stimuli controlled by deep cortical interneurons. Nature 457 , 1137–1141 (2009).

Körding, K. P. & König, P. Supervised and unsupervised learning with two sites of synaptic integration. J. Comput. Neurosci. 11 , 207–215 (2001).

Granseth, B., Ahlstrand, E. & Lindström, S. Paired pulse facilitation of corticogeniculate epscs in the dorsal lateral geniculate nucleus of the rat investigated in vitro. J. Physiol. 544 , 477–486 (2002).

Sherman, S. M. Thalamocortical interactions. Curr. Opin. Neurobiol. 22 , 575–579 (2012).

Meredith, R. M., Floyer-Lea, A. M. & Paulsen, O. Maturation of long-term potentiation induction rules in rodent hippocampus: role of gabaergic inhibition. J. Neurosci. 23 , 11142–11146 (2003).

Inglebert, Y., Aljadeff, J., Brunel, N. & Debanne, D. Synaptic plasticity rules with physiological calcium levels. Proc. Natl Acad.Sci. USA 117 , 33639–33648 (2020).

Kampa, B. M. & Stuart, G. J. Calcium spikes in basal dendrites of layer 5 pyramidal neurons during action potential bursts. J. Neurosci. 26 , 7424–32 (2006).

Doron, G. et al. Perirhinal input to neocortical layer 1 controls learning. Science 370 , eaaz3136 (2020).

Mäki-Marttunen, T., Iannella, N., Edwards, A. G., Einevoll, G. & Blackwell, K. T. A unified computational model for cortical post-synaptic plasticity. eLife 9 , e55714 (2020).

Bienenstock, E. L., Cooper, L. N. & Munro, P. W. Theory for the development of neuron selectivity: orientation specificity and binocular interaction in visual cortex. J. Neurosci. 2 , 32–48 (1982).

Ning-Long, X. et al. Nonlinear dendritic integration of sensory and motor input during an active sensing task. Nature 492 , 247–251 (2012).

Felleman, D. J. & van Essen, D. C. Distributed hierarchical processing in the primate cerebral cortex. Cereb. Cortex 1 , 1–47 (1991).

Sacramento, J., Costa, R. C., Bengio, Y. & Senn, W. Dendritic cortical microcircuits approximate the backpropagation algorithm. Adv. Neural Inf. Process. Syst. 31 , 8721–8732 (2018).

Bartunov, S. et al. Assessing the scalability of biologically-motivated deep learning algorithms and architectures. Adv. Neural Inf. Process. Syst. 31 , 9368–9378 (2018).

Boerlin, M., Machens, C. K. & Denève, S. Predictive coding of dynamical variables in balanced spiking networks. PLoS Comp. Biol. 9 , e1003258 (2013).

Petreanu, L., Mao, T., Sternson, S. M. & Svoboda, K. The subcellular organization of neocortical excitatory connections. Nature 457 , 1142–1145 (2009).

Ren, Si-Qiang, Li, Z., Lin, S., Bergami, M. & Shi, S.-H. Precise long-range microcircuit-to-microcircuit communication connects the frontal and sensory cortices in the mammalian brain. Neuron 104 , 385–401.e3 (2019).

Golding, N. L., Staff, N. P. & Spruston, N. Dendritic spikes as a mechanism for cooperative long-term potentiation. Nature 418 , 326–331 (2002).

Wang, X. et al. Feedforward excitation and inhibition evoke dual modes of firing in the cat’s visual thalamus during naturalistic viewing. Neuron 55 , 465–478 (2007).

Owen, S. F., Berke, J. D. & Kreitzer, A. C. Fast-spiking interneurons supply feedforward control of bursting, calcium, and plasticity for efficient learning. Cell 172 , 683–695 (2018).

Zenke, F. & Gerstner, W. Limits to high-speed simulations of spiking neural networks using general-purpose computers. Front. Neuroinf. 8 , 76 (2014).

Bittner, K. C., Milstein, A. D., Grienberger, C., Romani, S. & Magee, J. C. Behavioral time scale synaptic plasticity underlies ca1 place fields. Science 357 , 1033–1036 (2017).

Tremblay, R., Lee, S. & Rudy, B. Gabaergic interneurons in the neocortex: from cellular properties to circuits. Neuron 91 , 260–292 (2016).

Nigro, M. J., Hashikawa-Yamasaki, Y. & Rudy, B. Diversity and connectivity of layer 5 somatostatin-expressing interneurons in the mouse barrel cortex. J. Neurosci. 38 , 1622–1633 (2018).

Hilscher, M. M., Leão, R. N., Edwards, S. J., Leão, K. E. & Kullander, K. ChRNA2-Martinotti cells synchronize layer 5 type a pyramidal cells via rebound excitation. PLOS Biol . 15 , e200139226 (2017).

Naud, R., Marcille, N., Clopath, C. & Gerstner, W. Firing patterns in the adaptive exponential integrate-and-fire model. Biol. Cybern. 99 , 335–347 (2008).

Packer, A. M. & Yuste, R. Dense, unspecific connectivity of neocortical parvalbumin-positive interneurons: a canonical microcircuit for inhibition? J. Neurosci. 31 , 13260–13271 (2011).

De Kock, C. P. J. & Sakmann, B. High frequency action potential bursts (>100 Hz) in l2/3 and l5b thick tufted neurons in anaesthetized and awake rat primary somatosensory cortex. J. Physiol. 586 , 3353–3364 (2008).

Womelsdorf, T., Ardid, S., Everling, S. & Valiante, T. A. Burst firing synchronizes prefrontal and anterior cingulate cortex during attentional control. Current Biology 24 , 2613–2621 (2014).

Costa, R. P., Sjöström, P. J. & Van Rossum, M. C. W. Probabilistic inference of short-term synaptic plasticity in neocortical microcircuits. Front. Comput. Neurosci. 7 , 75 (2013).

Samadi, A., Lillicrap, T. P. & Tweed, D. B. Deep learning with dynamic spiking neurons and fixed feedback weights. Neural Comput. 29 , 578–602 (2017).

Guerguiev, J., Lillicrap, T. P. & Richards, B. A. Towards deep learning with segregated dendrites. eLife 6 , e22901 (2017).

Lee, D.-H., Zhang, S., Fischer, A. & Bengio, Y. Difference target propagation. In Proc. Joint European Conference on Machine Learning and Knowledge Discovery in Databases (ed. Hutter, F. et al.) 498–515 (Springer, 2015).

Liao, Q., Leibo, J. Z. & Poggio, T. How important is weight symmetry in backpropagation? In Proc. Thirtieth AAAI Conference on Artificial Intelligence (ed. Schuurmans, D. et al.) 1837–1844 (AAAI, 2016).

Xiao, W., Chen, H., Liao, Q. & Poggio, T. Biologically-plausible learning algorithms can scale to large datasets. Preprint at arXiv https://arxiv.org/abs/1811.03567 (2018).

Lillicrap, T. C., Cownden, D., Tweed, D. B. & Akerman, C. J. Random synaptic feedback weights support error backpropagation for deep learning. Nature Commun. 7 , 13276 (2016).

Scellier, B. & Bengio. Y. Towards a biologically plausible backprop. Preprint at arXiv https://arxiv.org/abs/1602.05179v5 (2016).

Yali, A. Deep learning with asymmetric connections and Hebbian updates. Front. Comput. Neurosci. https://doi.org/10.3389/fncom.2019.00018 (2019).

Whittington, J. C. R. & Bogacz, R. Theories of error back-propagation in the brain. Trends Cogn. Sci. 23 , 235–250 (2019).

Mostafa, H., Ramesh, V. & Cauwenberghs, G. Deep supervised learning using local errors. Front. Neurosci. 12 , 608 (2018).

Nokland, A. Direct feedback alignment provides learning in deep neural networks. Adv. Neural Inf. Process. Syst. 29 , 1037–1045 (2016).

Lansdell, B., J., Prakash, P. R. & Kording, K. P. Learning to solve the credit assignment problem. Preprint at arXiv https://arxiv.org/abs/1906.00889v4 (2019).

Pozzi, I., Bohté, S. & Roelfsema, P. A biologically plausible learning rule for deep learning in the brain. Preprint at arXiv https://arxiv.org/abs/1811.01768 (2018).

Laborieux, A. et al. Scaling equilibrium propagation to deep convnets by drastically reducing its gradient estimator bias. Front. Neurosci. 15 , 129 (2021).

Kolen, J. F. & Pollack, J. B. Backpropagation without weight transport. In Proc. 1994 IEEE International Conference on Neural Networks (ICNN’94) Vol. 3, 1375–1380 (IEEE, 1994).

Krizhevsky, A., Nair, V. & Hinton, G. Cifar-10 (Canadian Institute for Advanced Research) Technical Report (Univ. Toronto, 2009).

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K. & Fei-Fei, L. ImageNet: a large-scale hierarchical image database. In Proc. CVPR09 248–255 (IEEE, 2009).

He, K., Zhang, X., Ren, S. & Sun, J. Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In Proc. IEEE International Conference on Computer Vision 1026–1034 (IEEE, 2015).

Glorot, X. & Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proc. Thirteenth International Conference on Artificial Intelligence and Statistics (ed. Whye Teh, Y. et al.) 249–256 (Society for Artificial Intelligence and Statistics, 2010).

Download references

Acknowledgements

We thank A. Santoro and L. Maler for comments on this manuscript. We also thank M. Hilscher and M.J. Nigro for sharing data about SOM+ neurons. In addition, we thank T. Mesnard for helping with the development of the rate-based model. This work was supported by two NSERC Discovery grants (to R.N., no. 06872 and to B.A.R., no. 04947), a CIHR Project grant (no. RN383647-418955), a Fellowship from the CIFAR Learning in Machines and Brains Program (to B.A.R.), an Ontario Early Researcher Award (to B.A.R., no. ER 17-13-242), a Healthy Brains, Healthy Lives New Investigator Start-up (to B.A.R., no. 2b-NISU-8) the Novartis Research Foundation (to F.Z.).

Author information

Alexandre Payeur

Present address: University of Montréal and Mila, Montréal, QC, Canada

These authors contributed equally: Alexandre Payeur, Jordan Guerguiev.

These authors jointly supervised this work: Blake A. Richards, Richard Naud.

Authors and Affiliations

Department of Cellular and Molecular Medicine, University of Ottawa, Ottawa, ON, Canada

Alexandre Payeur & Richard Naud

Ottawa Brain and Mind Institute, University of Ottawa, Ottawa, ON, Canada

Centre for Neural Dynamics, University of Ottawa, Ottawa, ON, Canada

Department of Biological Sciences, University of Toronto Scarborough, Toronto, ON, Canada

Jordan Guerguiev

Department of Cell and Systems Biology, University of Toronto, Toronto, ON, Canada

Friedrich Miescher Institute for Biomedical Research, Basel, Switzerland

  • Friedemann Zenke

Mila, Montréal, QC, Canada

Blake A. Richards

Department of Neurology and Neurosurgery, McGill University, Montréal, QC, Canada

School of Computer Science, McGill University, Montréal, QC, Canada

Learning in Machines and Brains Program, Canadian Institute for Advanced Research, Toronto, ON, Canada

Department of Physics, University of Ottawa, Ottawa, ON, Canada

Richard Naud

You can also search for this author in PubMed   Google Scholar

Contributions

All authors contributed to the burst-dependent learning rule. A.P., F.Z. and R.N. designed the spiking simulations. A.P. performed the spiking simulations. J.G. designed the recurrent plasticity rule and performed the numerical experiments on CIFAR-10 and ImageNet. B.A.R. and R.N. wrote the manuscript, with contributions from J.G. and A.P. B.A.R. and R.N. cosupervised the project.

Corresponding authors

Correspondence to Blake A. Richards or Richard Naud .

Ethics declarations

Competing interests.

R.N., B.A.R. and A.P. have a provisional patent application for a neuromorphic implementation of the algorithm described in this article. The other authors declare no competing interests.

Additional information

Peer review information   Nature Neuroscience  thanks Gabriel Kreiman, Panayiota Poirazi and Nelson Spruston for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended data fig. 1 effects of population size, randomized examples and absence of hidden-layer plasticity on the xor task..

a , Comparison of costs for the XOR task. In blue is the cost for the network in Fig. 4 in the main text, but with 2000 neurons per population and slightly different parameter values. The dot-dashed pink line is for when the examples are randomly selected within an epoch. The dotted red line has no plasticity in the hidden layer. The dashed green line is for 400 neurons per population. b - e , Output event rate (ER) after learning. The dashed grey line separates ‘true (1)’ and ‘false (0)’ for the XOR. Only in c is XOR not solved.

Extended Data Fig. 2 Impact of different time scales on the XOR task.

a , Comparison of costs for when the duration of examples T (in s) (dashed green line) and the moving average time constant τ a v g (in s) (dotted orange line) are changed with respect to the values used in Fig. 4 (solid blue). b , Output event rate (ER) after learning for the three cases in panel a. The dashed grey line separates ‘true (1)’ and ‘false (0)’ for the XOR.

Extended Data Fig. 3 Learning XOR with symmetric feedback pathways.

a , Schematic diagram illustrating the symmetric feedback ( ⊕ and ⊕ ). b , Output-layer activity for the XOR task. Note that the XOR task is still solved. Only a single realization is displayed here. (ci-cii) The symmetric feedback yields very similar representations at the hidden layer.

Extended Data Fig. 4 Dynamics of the time-dependent rate model while learning MNIST.

a , Schematic of the network. The enlarged hidden layer population stresses the fact that the burst rate is equal to the event rate times the burst probability, with the event and burst probability nonlinearly integrating the feedforward and feedback signals, respectively. b , Example event rates (i, iii, v) and weights (ii, iv) for two consecutive examples during the first epoch. In (i), the teacher is illustrated as a dashed line. Learning intervals are indicated by light green vertical bars. c , Burst probabilities (i, iii) and differences of burst probabilities (ii, iv) for the same examples as in b.

Extended Data Fig. 5 Network mechanisms regulating the bursting nonlinearity.

All panels display the burst probability of a large population of two-compartment pyramidal neurons as a function of the intensity of the injected dendritic current. The insets illustrate the microcircuit - including the PV-like neurons (disks) and the SOM-like neurons (inverted triangles) - and the parameter that is being modified is indicated by a colored circuit element. Increasing color intensities corresponds to increasing values of the parameter. a , Increasing the strength of inhibitory synapses from SOM neurons onto the pyramidal neurons’ dendrites produces divisive burst probability control. b , Disinhibiting the pyramidal neurons’ dendrites by applying a hyperpolarizing current to the SOM neurons - mimicking inhibition from the VIP neurons - increases the slope. c , Increasing the probability of release onto SOM neurons produces a small divisive gain modulation. d , Increasing the dendritic excitability by increasing the strength of the regenerative dendritic activity produces an additive gain control.

Extended Data Fig. 6 The bursting nonlinearity controls the learning rate.

a , Schematic of the network. Each hidden layer had 500 units. The recurrent weights ( Z (1) and Z (2) ) and the feedback alignment weights ( Y (1) and Y (2) ) are explicitly represented. b , Angle between the weight updates W (1) in the standard backpropagation algorithm and in burstprop for the MNIST digit recognition task. The angle is displayed for different values of the slope of the dendritic nonlinearity ( β ). Results are displayed as the mean +/- standard deviation over 10 realizations with randomly initialized weights.

Extended Data Fig. 7 Linearity of feedback signals degrades with depth in deep convolutional network trained on ImageNet.

Each plot shows the change in burst probability of a unit in hidden layer l , Δ p l , as the burst probability at the output layer, p 8 , is changed by Δ p 8 (n=1000), along with the Pearson’s correlation coefficient and two-tailed p-value (blue, top), as well as a random sample of 2000 burst probabilities after presentation of an input image (red, bottom).

Extended Data Fig. 8 Learning MNIST with the simplified rate model.

A convolutional network whose architecture is described in Supplementary Table 3 was trained using backprop, feedback alignment, and burstprop. As in Fig. 6a,c , recurrent input was introduced at hidden layers to keep burst probabilities linear with respect to feedback signals.

Extended Data Fig. 9 The variance of the burst probability decreases during learning.

a , Variance of the burst probability as a function of the epoch for the MNIST task, for each layer in a network with 3 hidden layers with 500 units each. b , Variance of the burst probability as a function of the test error, showing that the magnitude of the variance is correlated with the test error.

Supplementary information

Supplementary information.

Supplementary Text, Tables 1–4 and Supplementary Figs. 1 and 2.

Reporting Summary

Rights and permissions.

Reprints and permissions

About this article

Cite this article.

Payeur, A., Guerguiev, J., Zenke, F. et al. Burst-dependent synaptic plasticity can coordinate learning in hierarchical circuits. Nat Neurosci 24 , 1010–1019 (2021). https://doi.org/10.1038/s41593-021-00857-x

Download citation

Received : 30 March 2020

Accepted : 15 April 2021

Published : 13 May 2021

Issue Date : July 2021

DOI : https://doi.org/10.1038/s41593-021-00857-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Co-dependent excitatory and inhibitory plasticity accounts for quick, stable and long-lasting memories in biological networks.

  • Everton J. Agnes
  • Tim P. Vogels

Nature Neuroscience (2024)

The combination of Hebbian and predictive plasticity learns invariant object representations in deep sensory networks

  • Manu Srinath Halvagal

Nature Neuroscience (2023)

Introducing the Dendrify framework for incorporating dendrites to spiking neural networks

  • Michalis Pagkalos
  • Spyridon Chavlis
  • Panayiota Poirazi

Nature Communications (2023)

Learning on tree architectures outperforms a convolutional feedforward network

  • Itamar Ben-Noam

Scientific Reports (2023)

The plasticitome of cortical interneurons

  • Amanda R. McFarlan
  • Christina Y. C. Chou
  • P. Jesper Sjöström

Nature Reviews Neuroscience (2023)

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

credit assignment problem in neural networks

Credit Assignment in Neural Networks through Deep Feedback Control

Part of Advances in Neural Information Processing Systems 34 (NeurIPS 2021)

Alexander Meulemans, Matilde Tristany Farinha, Javier Garcia Ordonez, Pau Vilimelis Aceituno, João Sacramento, Benjamin F. Grewe

The success of deep learning sparked interest in whether the brain learns by using similar techniques for assigning credit to each synaptic weight for its contribution to the network output. However, the majority of current attempts at biologically-plausible learning methods are either non-local in time, require highly specific connectivity motifs, or have no clear link to any known mathematical optimization method. Here, we introduce Deep Feedback Control (DFC), a new learning method that uses a feedback controller to drive a deep neural network to match a desired output target and whose control signal can be used for credit assignment. The resulting learning rule is fully local in space and time and approximates Gauss-Newton optimization for a wide range of feedback connectivity patterns. To further underline its biological plausibility, we relate DFC to a multi-compartment model of cortical pyramidal neurons with a local voltage-dependent synaptic plasticity rule, consistent with recent theories of dendritic processing. By combining dynamical system theory with mathematical optimization theory, we provide a strong theoretical foundation for DFC that we corroborate with detailed results on toy experiments and standard computer-vision benchmarks.

Name Change Policy

Requests for name changes in the electronic proceedings will be accepted with no questions asked. However name changes may cause bibliographic tracking issues. Authors are asked to consider this carefully and discuss it with their co-authors prior to requesting a name change in the electronic proceedings.

Use the "Report an Issue" link to request a name change.

The neuronal credit assignment problem as causal inference

Two complementary tasks to understand intelligence.

credit assignment problem in neural networks

Learning is central to both human and artificial intelligence

$\Rightarrow$ Advances in each domain can inspire the other

Machine learning, neuroscience, and causality

credit assignment problem in neural networks

Messerli, N Engl J Med 2012

  • Causal models are more robust to changes in environment/distribution: better transfer, generalization
  • Fairness: strong associations are not causal, and may be unfair/biased/prejudiced
  • Safety: observational data may not say what happens when we act/intervene/change distributions

In neuroscience:

  • Efficient learning, transfer, generalization
  • Causal learning

Learning in the brain

credit assignment problem in neural networks

  • Be consistent with known neurophysiology
  • Be good enough at learning complicated tasks

The neuronal credit assignment problem

To learn, a neuron must know its effect on the reward function

In spiking neural networks, this means something like:

The problem: noise correlations and confounding

credit assignment problem in neural networks

$\Rightarrow$ Viewing learning as a causal inference problem may provide insight

Credit assignment as causal inference

What is a neuron's causal effect on reward, and so how should it change to improve performance? $$ \beta_i = \mathbb{E}(R| H_i \leftarrow 1) - \mathbb{E}(R| H_i \leftarrow 0) $$

$\Rightarrow$ How can a neuron perform causal inference?

One solution: Randomization

If independent (unconfounded) noise is added to the system, this can be correlated with reward for an estimate of its reward gradient

In fact, the REINFORCE algorithm correlates reward with independent pertubations in activity, $\xi^i$: $$ \mathbb{E}( R\xi^i ) \approx \sigma^2 \frac{\delta R}{\delta h^i} $$

  • Only well characterized in specific circuits e.g. birdsong learning (Fiete and Seung 2007)

Causal learning without randomization

Adapted from Moscoe et al, J Clin Epid 2015

Two more observations:

Rdd for solving credit assignment.

Lansdell and Kording, bioRxiv 2019

A small demonstration

The two-neuron network with noise correlations

Can use RDD to estimate the causal effect

Works in cases where a correlational estimator fails

Under some assumptions

A larger example

Reward/cost function trains one neuron to have different firing rate from rest of population

Application to brain-computer interface learning

  • In single-unit BCIs, individual neurons are trained through biofeedback
  • Here, causal effect of a neuron is known by construction

Lansdell et al IEEE Trans NSRE 2020

Is this plausible?

Ngezahayo et al 2000, Seol et al 2007

Calcium imaging in Hydra. Dupre and Yuste 2017

Part 1 summary

How to scale to large problems.

credit assignment problem in neural networks

Richards et al Nature Neuroscience 2019

  • Backpropagation the standard for challenging ML problems
  • How much feedback is needed?

credit assignment problem in neural networks

Biologically implausible backpropagation

Learning without weight transport.

$\Rightarrow$ Can we improve on feedback alignment by learning weights $B_i$?

Learning feedback weights with perturbations

If we only update $B$, then weights in the final layer converge to $W$, in the following way

Theorem 1: The least squares estimator \begin{equation*} (\hat{B}^{N+1})^T = \hat{\lambda}^N (\mathbf{e}^{N+1})^T\left(\mathbf{e}^{N+1}(\mathbf{e}^{N+1})^T\right)^{-1}, \end{equation*} converges to the true feedback matrix, in the sense that: $$ \lim_{c_h\to 0}\text{plim}_{T\to\infty} \hat{B}^{N+1} = W^{N+1}, $$ where $\text{plim}$ indicates convergence in probability.

If we only update $B$, then weights in all layers converge to $W$, for a linear network

Theorem 2: For $\sigma(x) = x$, the least squares estimator $$ \begin{equation*} (\hat{B}^{n})^T = \hat{\lambda}^{n-1} (\mathbf{\tilde{e}}^{n})^T\left(\mathbf{\tilde{e}}^{n}(\mathbf{\tilde{e}}^{n})^T\right)^{-1}\qquad 1 \le n \le N+1, \end{equation*}$$ converges to the true feedback matrix, in the sense that: $$ \lim_{c_h\to 0}\text{plim}_{T\to\infty} \hat{B}^{n} = W^{n}, \qquad 1 \le n \le N+1. $$

A small example

credit assignment problem in neural networks

Lansdell, Prakash and Kording, ICLR 2020

A (slightly) larger example

credit assignment problem in neural networks

$\Rightarrow$ Shows challenging computer vision problems can be solved without weight transport

  • Neuromorphic hardware – learning with spiking networks
  • Application specific integrated circuits (ASICs) – learning without weight transport

Acknowledgments

  • Kording lab
  • Ari Benjamin
  • David Rolnick
  • Roozbeh Farhoodi
  • Prashanth Prakash
  • Adrienne Fairhall (UW)
  • Fairhall lab
  • Alison Duffy
  • Chet Moritz (UW)
  • Ivana Milovanovic (UW)
  • Cooper Mellema (UT Austin)
  • Eberhard Fetz (UW)

RDD as a way for a neuron to solve credit assignment

How to test.

Credit assignment for trained neural networks based on Koopman operator theory

  • Published: 04 September 2023
  • Volume 18 , article number  181324 , ( 2024 )

Cite this article

  • Zhen Liang 1 ,
  • Changyuan Zhao 2 ,
  • Wanwei Liu 3 ,
  • Bai Xue 2 ,
  • Wenjing Yang 1 &
  • Zhengbin Pang 3  

21 Accesses

Explore all metrics

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

Dahnert M, Hou J, Nießner M, et al. Panoptic 3D scene reconstruction from a single RGB image. In: Proceedings of the 35th Neural Information Processing Systems. 2021, 8282–8293

Tian Y, Yang W, Wang J. Image fusion using a multi-level image decomposition and fusion method. Applied Optics, 2021, 60(24): 7466–7479

Article   Google Scholar  

Liang Z, Cai Z, Li M, et al. Parallel gym gazebo: a scalable parallel robot deep reinforcement learning platform. In: Proceedings of the 31st IEEE International Conference on Tools with Artificial Intelligence. 2019, 206–213

Minsky M. Steps toward artificial intelligence. Proceedings of the IRE, 1961, 49(1): 8–30

Article   MathSciNet   Google Scholar  

Whitney H. Differentiable manifolds. Annals of Mathematics, 1936, 37(3): 645–680

Article   MathSciNet   MATH   Google Scholar  

Takens F. Detecting strange attractors in turbulence. In: Rand D, Young L S, eds. Dynamical Systems and Turbulence, Warwick 1980. Berlin, Heidelberg: Springer, 1981, 366–381

Chapter   Google Scholar  

Kutz J N, Brunton S L, Brunton B W, Proctor J L. Dynamic Mode Decomposition: Data-Driven Modeling of Complex Systems. Philadelphia: Society for Industrial and Applied Mathematics, 2016

Book   MATH   Google Scholar  

Lecun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998, 86(11): 2278–2324

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Grant Nos. 61872371, 61836005 and 62032024) and the CAS Pioneer Hundred Talents Program.

Author information

Authors and affiliations.

Institute for Quantum Information & State Key Laboratory of High Performance Computing, National University of Defense Technology, Changsha, 410000, China

Zhen Liang & Wenjing Yang

Institute of Software, Chinese Academy of Sciences, Beijing, 100190, China

Changyuan Zhao & Bai Xue

College of Computer Science and Technology, National University of Defense Technology, Changsha, 410000, China

Wanwei Liu & Zhengbin Pang

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Wanwei Liu .

Ethics declarations

Competing interests The authors declare that they have no competing interests or financial conflicts to disclose.

Electronic Supplementary Material

Credit assignment for trained neural networks based on koopman operator theory, rights and permissions.

Reprints and permissions

About this article

Liang, Z., Zhao, C., Liu, W. et al. Credit assignment for trained neural networks based on Koopman operator theory. Front. Comput. Sci. 18 , 181324 (2024). https://doi.org/10.1007/s11704-023-2629-4

Download citation

Received : 16 October 2022

Accepted : 09 May 2023

Published : 04 September 2023

DOI : https://doi.org/10.1007/s11704-023-2629-4

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Find a journal
  • Publish with us
  • Track your research

Physical Review E

Covering statistical, nonlinear, biological, and soft matter physics.

  • Collections
  • Editorial Team

Meta predictive learning model of languages in neural circuits

Chan li, junbin qiu, and haiping huang, phys. rev. e 109 , 044309 – published 12 april 2024.

  • No Citing Articles
  • INTRODUCTION
  • RESULTS AND DISCUSSION
  • ACKNOWLEDGMENTS

Large language models based on self-attention mechanisms have achieved astonishing performances, not only in natural language itself, but also in a variety of tasks of different nature. However, regarding processing language, our human brain may not operate using the same principle. Then, a debate is established on the connection between brain computation and artificial self-supervision adopted in large language models. One of most influential hypotheses in brain computation is the predictive coding framework, which proposes to minimize the prediction error by local learning. However, the role of predictive coding and the associated credit assignment in language processing remains unknown. Here, we propose a mean-field learning model within the predictive coding framework, assuming that the synaptic weight of each connection follows a spike and slab distribution, and only the distribution, rather than specific weights, is trained. This meta predictive learning is successfully validated on classifying handwritten digits where pixels are input to the network in sequence, and moreover, on the toy and real language corpus. Our model reveals that most of the connections become deterministic after learning, while the output connections have a higher level of variability. The performance of the resulting network ensemble changes continuously with data load, further improving with more training data, in analogy with the emergent behavior of large language models. Therefore, our model provides a starting point to investigate the connection among brain computation, next-token prediction, and general intelligence.

Figure

  • Received 6 November 2023
  • Accepted 18 March 2024

DOI: https://doi.org/10.1103/PhysRevE.109.044309

©2024 American Physical Society

Physics Subject Headings (PhySH)

  • Research Areas
  • Physical Systems

Authors & Affiliations

  • 1 PMI Laboratory, School of Physics, Sun Yat-sen University, Guangzhou 510275, People's Republic of China
  • 2 Department of Physics, University of California, San Diego, 9500 Gilman Drive, La Jolla, California 92093, USA
  • 3 Guangdong Provincial Key Laboratory of Magnetoelectric Physics and Devices, Sun Yat-sen University, Guangzhou 510275, People's Republic of China
  • * These authors contributed equally to this work.
  • [email protected]

Article Text (Subscription Required)

References (subscription required).

Vol. 109, Iss. 4 — April 2024

Access Options

  • Buy Article »
  • Log in with individual APS Journal Account »
  • Log in with a username/password provided by your institution »
  • Get access through a U.S. public or high school library »

credit assignment problem in neural networks

Authorization Required

Other options.

  • Buy Article »
  • Find an Institution with the Article »

Download & Share

The performance of meta predictive learning on the 28 × 28 MNIST classification task. (a) Test accuracy as a function of epoch. The network with N = 100 recurrent neurons, N in = 28 input units, and N out = 10 output nodes is trained on the full MNIST dataset with 60 k training images (handwritten digits) and validated on another unseen 10 k test handwritten digits. Predictive coding indicates the learning direct in the weight space rather than the distribution space. If the epoch is less than 40, the number of inference steps is set to n = 100 , and n = 200 otherwise. The inset shows how ln F changes with training in the first 60 training epochs (this log-energy becomes stable in the late training stage, and is thus not shown). Five independent runs are considered for the fluctuation of the result. (b) The logarithmic average value of [ Ξ ℓ , π ℓ , m ℓ ] vs epoch in all layers, the log means logarithm with the base e . Only the first 20 epochs are considered (the result remains stable in the later training stage), and the fluctuation is computed from five independent runs.

The properties of meta predictive learning on the simplified language prediction task. The grammatical rule is designed as follows: starting from a random letter ( ′ a ′ here), only the candidates located two letters or four letters after ′ a ′ can follow the starting letter with equal probability, and each letter only repeats once in this next-word generation. All letters in the alphabet form a cyclic structure. T = 11 is considered, and the full size of dataset is 26 624. RNN with N = 100 , N in = 26 , N out = 26 is trained, and two instances of networks are randomly sampled from the (trained or untrained) network ensemble. (a) Starting from the letter a, the network generates the next letter, which serves as the input at the next time step, until a sequence with desired length is generated. (b) The correct letter ratio as a function of data load α = M N , and five independent runs are considered. M examples of sequences are used for training. A chance level of 1 13 is marked. The inset shows the correct letter ratio in the range of α ∈ [ 0.02 , 0.1 ] . (c) The log-energy ln F changes with training epochs and decreases to near zero. The inset shows how the correct letter ratio changes with the length of generated sequence after a full dataset is used for training. The error bar is computed with five independent networks.

Softmax values of the output units for different data loads α . Panels (a,b), (c,d), (e,f), and (g,h) show two typical patterns for each data load α = 0 , α = 0.01 , α = 0.03 , and α = 0.05 , respectively. Only predictions following the designed language rule are displayed, and the text shown in the panel ′ ′ a → c ′ ′ means inputting the letter ′ a ′ and the network predicts the immediate following letter ′ c ′ (corresponding to the largest softmax output). The training conditions are the same as in Fig.  2 .

Illustration of hyperparameters [ π , m , Ξ ] in meta predictive learning on the simplified language task. The training conditions are the same as in Fig.  2 . In (c-d) we show statistical properties of bidirectional connections, and i < j is considered.

Training performance of networks with different architectures in the Penn Treebank dataset. In the upper part of the figure, we choose the vanilla RNN [ 42 ], SaS RNN (ensemble learning) [ 18 ], RNN with standard predictive coding [ 12 ], and RNN with meta predictive learning to show how test perplexity decreases with the training epoch. The first two algorithms belong to the backpropagation through time category [ 42 ]. In the inset we provide the performance of transformer model (see details in Appendix  pp2 ) with single encoder block for comparison. We also mark the mean test accuracy of the transformer model at the beginning of training and at the end of training. In the bottom part of the figure, we select untrained, trained-for-five-epoch, and full-trained RNNs with meta predictive learning to show the performances at different training stages in generating one of the sentences in the test dataset. The correctly predicted tokens from the test sentence are highlighted, while the wrongly predicted tokens are gray colored. The indicated accuracy is the ratio of the number of correctly predicted tokens from the test sentence to the total number of tokens in the sentence. The mean accuracy evaluated from 100 sentences is about 0 % , 21.3 % ± 10.5 % , 23.5 % ± 11.3 % at the three shown stages, respectively. Note that all the models share the same training hyperparameters, such as batch size, learning rate, and training optimizers (see Appendix  pp2 for details).

Probability distribution of hyperparameters m , π , Ξ in the RNN networks trained with meta predictive learning. Distributions of hyperparameters m , π , Ξ at the input layer are shown at the top of the figure (blue histogram), and those at the hidden layer and output layer are shown at the middle (salmon histogram) and at the bottom (green histogram) of the figure, respectively.

Sign up to receive regular email alerts from Physical Review E

  • Forgot your username/password?
  • Create an account

Article Lookup

Paste a citation or doi, enter a citation.

IMAGES

  1. The credit assignment problem in multi-layer neural networks. (A

    credit assignment problem in neural networks

  2. Neural Network

    credit assignment problem in neural networks

  3. The credit assignment problem in multi-layer neural networks. (A

    credit assignment problem in neural networks

  4. PPT

    credit assignment problem in neural networks

  5. Three approaches to the credit assignment problem in deep neural

    credit assignment problem in neural networks

  6. The credit assignment problem in multi-layer neural networks. (A

    credit assignment problem in neural networks

VIDEO

  1. COMP3132 Assignment 2 Iris Dataset Neural Network

  2. Defending Against Fraud

  3. cs610p assignment fall 2023 || Cs601P Computer networks (practical) assignment solution fall 2023

  4. SOCIAL NETWORKS

  5. A Novel Method for Credit Scoring Based on Cost Sensitive Neural Network Ensemble

  6. Fuzzy Logic And Neural Networks Week 1 Quiz Assignment Solution

COMMENTS

  1. What Is the Credit Assignment Problem?

    The credit assignment problem (CAP) is a fundamental challenge in reinforcement learning. It arises when an agent receives a reward for a particular action, but the agent must determine which of its previous actions led to the reward. In reinforcement learning, an agent applies a set of actions in an environment to maximize the overall reward.

  2. neural networks

    In its simplest form, the credit assignment problem refers to the difficulty of assigning credit in complex networks. Updating weights using the gradient of the objective function, $\nabla_WF(W)$, has proven to be an excellent means of solving the credit assignment problem in ANNs. A question that systems neuroscience faces is whether the brain ...

  3. PDF LEARNING TO SOLVE THE CREDIT ASSIGNMENT PROBLEM

    the number of units in the network (Rezende et al., 2014). This drives the hypothesis that learning in the brain must rely on additional structures beyond a global reward signal. In artificial neural networks (ANNs), credit assignment is performed with gradient-based methods

  4. [2212.00998] Credit Assignment for Trained Neural Networks Based on

    Credit assignment problem of neural networks refers to evaluating the credit of each network component to the final outputs. For an untrained neural network, approaches to tackling it have made great contributions to parameter update and model revolution during the training phase. This problem on trained neural networks receives rare attention ...

  5. PDF Structural Credit Assignment in Neural Networks using Reinforcement

    Structural credit assignment in neural networks is a long-standing problem, with a variety of alternatives to backpropagation proposed to allow for local training of nodes. One of the early strategies was to treat each node as an agent and use a reinforcement learning method called REINFORCE to update each node locally with only a global reward ...

  6. Credit Assignment

    We distinguish two cases in the credit assignment problem. Temporal credit assignment refers to the assignment of credit for outcomes to actions. ... developed a system for transferring inductive bias in neural networks performing multitask learning and more recent research has been directed toward transfer learning in Bayesian Networks ...

  7. Solving the problem of credit assignment (Chapter 8)

    The architectures of the neural networks we considered in Chapter 7 are made exclusively of visible units. During the learning stage, the states of all neurons are entirely determined by the set of patterns to be memorized. They are so to speak pinned and the relaxation dynamics plays no role in the evolution of synaptic efficacies.

  8. Credit Assignment in Neural Networks through Deep Feedback Control

    Credit Assignment in Neural Networks through Deep Feedback Control Alexander Meulemans, Matilde Tristany Farinha , Javier García Ordóñez, ... assignment (CA) in deep neural networks. Although deep learning was inspired by biological neural ... weight transport problem. Another issue of relevance is that, in biological networks, feedback also ...

  9. PDF Credit Assignment in Neural Networks through Deep Feedback Control

    r L, r L @ L (rL;y ) @ rL r L = r L = r L + ; (3) with L (rL;y ) a supervised loss function dening the task, y the label of the training sample, a stepsize, and L shorthand notation. Note that (3) only needs the easily obtained loss gradient w.r.t. the output, e.g., for an L 2 output loss, one obtains the convex combination r L = (1 2 )r L +2 y . The feedback controller produces a feedback ...

  10. Dendritic solutions to the credit assignment problem

    The 'credit assignment problem' refers to the fact that credit assignment is non-trivial in hierarchical networks with multiple stages of processing. One difficulty is that if credit signals are integrated with other inputs, then it is hard for synaptic plasticity rules to distinguish credit-related activity from non-credit-related activity.

  11. Solving the Credit Assignment Problem With the Prefrontal Cortex

    Intriguingly, this model could reproduce monkey behavior and neural responses on two tasks: four-choice deterministic and two-choice probabilistic paradigms, entailing a complex spatio-temporal credit assignment problem as the stimuli disappeared from the screen prior to action execution and outcome presentation (Khamassi et al., 2011, 2013, 2015).

  12. Structural Credit Assignment in Neural Networks using Reinforcement

    Abstract. Structural credit assignment in neural networks is a long-standing problem, with a variety of alternatives to backpropagation proposed to allow for local training of nodes. One of the early strategies was to treat each node as an agent and use a reinforcement learning method called REINFORCE to update each node locally with only a ...

  13. Tackling the Credit Assignment Problem in Reinforcement Learning

    Assigning credit or blame for each of those actions individually is known as the (temporal) Credit Assignment Problem (CAP) . The CAP is particularly relevant for real-world tasks, where we need to learn effective policies from small, limited training datasets. ... To train the neural network, InferNet distributes the final delayed reward among ...

  14. Credit Assignment in Neural Networks through Deep Feedback Control

    A gram-gauss-newton method learning overparameterized deep neural networks for regression problems. arXiv preprint arXiv:1905.11675, 2019. Urbanczik and Senn [2014] ... Dendritic solutions to the credit assignment problem. Current opinion in neurobiology, 54:28-36, 2019. Larkum [2013] ...

  15. LEARNING TO SOLVE THE CREDIT ASSIGNMENT PROBLEM

    the number of units in the network (Rezende et al., 2014). This drives the hypothesis that learning in the brain must rely on additional structures beyond a global reward signal. In artificial neural networks (ANNs), credit assignment is performed with gradient-based methods computed through backpropagation (Rumelhart et al., 1986).

  16. Credit Assignment in Neural Networks through Deep Feedback Control

    Here, we introduce Deep Feedback Control (DFC), a new learning method. that uses a feedback controller to drive a deep neural network to match a desired. output target and whose control signal can ...

  17. PDF Email: arXiv:2212.00998v1 [cs.LG] 2 Dec 2022

    Credit assignment problem of neural networks refers to eval-uating the credit of each network component to the final out-puts. For an untrained neural network, approaches to tack-ling it have made great contributions to parameter update and model revolution during the training phase. This problem

  18. Cell-type specific neuromodulation guides synaptic credit assignment in

    now solve the problem of credit assignment for artificial neural networks effectively enough to have ushered in an era of shockingly powerful artificial intelligence. Nevertheless, their exact implementation on advanced tasks can be extremely costly in terms of computation, storage, and circuit interconnects (3), driving a search for more ...

  19. Burst-dependent synaptic plasticity can coordinate learning in ...

    The authors propose a synaptic plasticity rule for pyramidal neurons based on postsynaptic bursting that captures experimental data and solves the credit assignment problem for deep networks.

  20. Credit Assignment in Neural Networks through Deep Feedback Control

    Here, we introduce Deep Feedback Control (DFC), a new learning method that uses a feedback controller to drive a deep neural network to match a desired output target and whose control signal can be used for credit assignment. The resulting learning rule is fully local in space and time and approximates Gauss-Newton optimization for a wide range ...

  21. The neuronal credit assignment problem as causal inference

    RDD can be used to estimate causal effects, and can provide a solution to the credit assignment problem in spiking neural networks Shows a neuron can do causal inference without needing to randomize ... • Thus: * neurons can solve the credit assignment problem without an independent noise source, in the presence of high correlations * can ...

  22. Credit assignment for trained neural networks based on ...

    Credit Assignment for Trained Neural Networks Based on Koopman Operator Theory. Rights and permissions. Reprints and permissions. About this article. Cite this article. Liang, Z., Zhao, C., Liu, W. et al. Credit assignment for trained neural networks based on Koopman operator theory.

  23. Meta predictive learning model of languages in neural circuits

    Figure 1. The performance of meta predictive learning on the 28 × 28 MNIST classification task. (a) Test accuracy as a function of epoch. The network with N = 100 recurrent neurons, N in = 28 input units, and N out = 10 output nodes is trained on the full MNIST dataset with 60 k training images (handwritten digits) and validated on another unseen 10 k test handwritten digits.

  24. Information

    The proliferation of new technologies and advancements in existing ones are altering our perspective of the world. So, continuous improvements are needed. A connected world filled with a vast amount of data was created as a result of the integration of these advanced technologies in the financial sector. The advantages of this connection came at the cost of more sophisticated and advanced ...

  25. Notes on Bus User Assignment Problem Using Section Network ...

    A recurrent solution to consecutive transit assignment problems is typically required to help address the bus network design problem (BNDP). Intriguingly, the transit assignment issue is differentiated by a number of distinctive characteristics. In this article, a complete analysis of one of the well-known graphical representations of the problem is conducted. The presented design is founded ...

  26. Fast Assignment in Asset-Guarding Engagements using Function ...

    We propose to use neural networks for function approximation of the minimum time until intercept. The neural networks are trained offline, thus allowing for real-time online construction of cost matrices. Moreover, the function approximators have sufficient accuracy to obtain reasonable solutions to the assignment problem.