Introduction
Machine learning models are widely used today on a wide range of applications to achieve high performance on specific tasks. Yet, their overall ability is still far from humans, as they are highly specialized and are unable to learn tasks incrementally. Whenever they learn new tasks, their performance for previous tasks deteriorates drastically, which is a phenomenon known as catastrophic forgetting. To achieve true lifelong learning modules that resemble human brains, machine learning models must be able to retain learned skillsets and knowledge without expensing an indefinite amount of resources to increase their capacities.
Currently, the three main methodologies for overcoming catastrophic forgetting are architectural strategies, rehearsal strategies, and regularization strategies.
Architectural Strategy
By altering the architecture and structure of the machine learning models, this method aims to preserve the performance for previous tasks while learning new tasks by introducing new resources. Generally, the model is expanded in an efficient manner, creating new parts that are tailored to learn the new tasks while minimizing the changes to the old parts to prevent the model from “forgetting.”
Dynamically Expandable Networks (DEN) computes the group-sparsity regularization and uses the results to guide the network expansions. The regularization values confine the number of parameters to optimality for any given tasks [1]. As a result, this encourages parameters efficiency and also prevents the model from overfitting. Nevertheless, simply adding new weights to the model does not guarantee good performance since the network architecture is not intelligently designed to fit the incoming tasks.
Learn to Grow [2] achieves better performance by introducing Neural Architecture Search methodologies adopted from Differentiable Architecture Search (DARTS) [3]. For each new task, a controller scans each of the model’s layers and categorizes each of them as either “reuse,” “adaptation,” or “new,” depending on the DARTS search algorithm. “Reuse” simply uses the old parameters unchanged; “adaptation” adds a small adaptor to the old parameters; “new” creates an exact copy of the old parameters [2]. This structure optimization component allows for an automatic way to create a sensible design of an expansion to the current model when encountering new tasks. However, as the number of tasks grows larger, the search space for the DARTS algorithm may also grow exponentially.
Architectural strategies allow for the option to increase the model’s capacity, but at the same time require additional resources and time-consuming searches to learn new tasks.
Rehearsal Strategy
Inspired by how humans reconsolidate information, rehearsal strategies replays old data to the machine learning models in order to strengthen the learned connections. Similar to the biological synaptic reinforcements in the brain, repeated activations of connections allow the memories to retain longer in the models.
Gradient Episodic Memory (GEM) makes use of an episodic memory that stores a subset of the observed examples from each learned tasks. When learning new tasks, the parameters are only allowed to change in ways that do not incur losses in the performance of the collected samples through the use of inequality constraints [4]. This has the advantage of positive backward transfer, allowing even better performance for the old tasks (represented by the saved dataset) with the introduction of the new task. It is clear that, however, the model must spend a considerable amount of resources, in both computational and memory aspects, to store and rehearse the previous examples. When the number of tasks is large, the computational overhead for computing the GEM gradient of the loss for previous tasks becomes very heavy.
To alleviate this issue, pseudo-rehearsal has become the state-of-the-art methodology for this strategy. A generative model learns the data population of the previous tasks so that it can create data from the same population. FearNet is a model that imitates the brain and makes use of this generative model for pseudo-rehearsal. It has a module resembling the hippocampal complex for recent memories, a deep neural network resembling the medial prefrontal cortex for long-term memories, and a controller with the functionality of the basolateral amygdala to determine which memory center to utilize as the associated memory required for the current task. During a sleep phase that is immediately after the learning of a new task, the recent memories from the hippocampal complex are consolidated to the medial prefrontal cortex through pseudo-rehearsal [5]. Although this method significantly reduces the resources required to represent previous tasks for rehearsal, the tradeoff is that FearNet suffers in recent recalls due to the ineffective learning from storing huge statistics when the number of tasks is large.
Overall, rehearsal strategies act as a means to strengthen past memories, but with low scalability due to resource requirements.
Regularization Strategies
Different from the previous two strategies, regularization strategies do not require any additional resources to learn new tasks. Instead, this method simply introduces penalties to prevent deterioration of the performance for previous tasks.
By limiting the parameters of the model to stay within the region of low error for both the previously learned tasks and the new task, structural regularization promotes parameters efficiency by ensuring that each parameter is maximized in its contribution to all the tasks. Elastic Weight Consolidation (EWC) achieves this by computing a quadratic penalty using the Fisher information computed from only the first-order derivatives [6]. Even though the required computations are thus relatively light, a hyperparameter is required to be manually tuned in order to maximize the overall performance between old tasks and new tasks.
On the other hand, functional regularization makes use of the feedbacks from old parameters when learning new tasks, which is a process known as knowledge distillation. In Learning without Forgetting (LwF), the outputs for the new data generated by the old model (the model before training for the new task) is saved in order to be used in the loss function to evaluate the new model’s performance on old tasks [7]. This way of preserving the new outputs of old tasks provides meaningful feedback for understanding the shared structures of the model that are crucial to the previous tasks. Nevertheless, when the tasks come from different domains, the model’s performance for the previous tasks degrades significantly. This is because the old tasks’ performance loss is dependent on the new data, which belongs to a completely different domain with a drastically different distribution.
Even though regularization strategies are the most efficient in utilizing the provided resources, it often requires manual tuning of hyperparameters or specific requirements that cannot be generalized to all types of tasks.
In this paper, we propose a new regularization algorithm of incremental moment matching (IMM) with Kronecker-factored approximation to overcome catastrophic forgetting. In the previous works, IMM demonstrates the ability to approximate the combined previous tasks by using only one single Gaussian distribution [8]. Rather than using the proposed Laplacian approximation, we use the Kronecker-factored approximation instead to reduce the computation complexity and speed up the optimization process. As a result, this approach yields better performance due to the faster convergence of the second-order optimization to the combined distribution of all tasks. Not only is this method free from using any additional resources, but it also does not have any hyperparameters that need to be tuned manually, while being able to model the combined posterior regardless of the incoming data distribution.
Citations
[1] J. Yoon, E. Yang, J. Lee, and S. J. Hwang, “Lifelong Learning with Dynamically Expandable Networks,” pp. 1–11, 2017.
[2] X. Li, Y. Zhou, T. Wu, R. Socher, and C. Xiong, “Learn to Grow: A Continual Structure Learning Framework for Overcoming Catastrophic Forgetting,” 2019.
[3] H. Liu, K. Simonyan, and Y. Yang, “DARTS: Differentiable Architecture Search,” pp. 1–13, 2018.
[4] D. Lopez-Paz and M. Ranzato, “Gradient episodic memory for continual learning,” Adv. Neural Inf. Process. Syst., vol. 2017-Decem, no. Nips, pp. 6468–6477, 2017.
[5] R. Kemker and C. Kanan, “FearNet: Brain-Inspired Model for Incremental Learning,” pp. 1–16, 2017.
[6] J. Kirkpatrick et al., “Overcoming catastrophic forgetting in neural networks,” Proc. Natl. Acad. Sci. U. S. A., vol. 114, no. 13, pp. 3521–3526, 2017.
[7] Z. Li and D. Hoiem, “Learning without Forgetting,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 12, pp. 2935–2947, 2018.
[8] S. W. Lee, J. H. Kim, J. Jun, J. W. Ha, and B. T. Zhang, “Overcoming catastrophic forgetting by incremental moment matching,” Adv. Neural Inf. Process. Syst., vol. 2017-Decem, no. Nips, pp. 4653–4663, 2017.
Comments