Conference Papers Evaluation

Lawrence Chan
Sep 25, 2019
28 min read

1. Overcoming Catastrophic Forgetting in Neural Networks

Summary: Elastic Weight Consolidation (EWC) tries to overcome catastrophic forgetting by imitating the brain of reducing the plasticity of synapses that are vital to previously learned tasks. A quadratic penalty is used to constraint the parameters to stay within the region of low error for both the previously learned tasks and the new task. This effectively slows down the learning of the weights that are heavily constrained (those that are crucial to previous tasks), allowing the more unused weights to learn faster to adapt to the new task. In order to find out which weights are more important for a task, the diagonal of the Fisher information (negative second-order

derivatives) is used to approximate the posterior as a Gaussian distribution.

Pros:

1. EWC allows the optimization of a network by maximizing its functionality to account for as many tasks as possible with its fixed amount of weights.

2. When a new task is related to a previously learned task, the weights that are important to the old task are shared and act as a positive forward transfer learning.

3. The Fisher information can be computed from only the first-order derivatives, thus it can be scaled to large models without complex calculations.

Cons:

1. Since the model capacity is fixed with only changing the weights for each new task, the initial structure must be large enough to account for all the tasks without the ability to adding new weights.

2. The regularization only pulls back the important weights to their old values for the previous tasks, so there are no mechanisms in allowing backwards knowledge transfer even if the tasks are very related.

3. The importance of the parameters to the tasks may not fully be represented by the Fisher information alone. This method under-estimates parameter uncertainty, thus the training of the new task may still negatively impact the performance of the previously learned tasks.

4. As more tasks are learned, the overlapping regions of low error for all tasks may be very small, thus it may require manual changing of the hyperparameters such as learning rates in order to find the region.

2. Training Recurrent Neural Networks for Lifelong learning

Summary: The model unifies Gradient Episodic Memory (GEM) and Net2Net to develop a lifelong learning model. GEM is used to overcome catastrophic forgetting and to provide positive backward transfer learning. A buffer is used to store a subset of examples from each task for rehearsal. If the gradient for the new task increases the loss on any of the previous tasks, it is projected to the closest gradient to ensure that the losses do not increase, thereby enabling positive backwards transfer. Whenever the model fails to learn the new task, Net2WiderNet serves as a model expansion technique by increasing the width of the original network using function preserving transformations to achieve a zero shot knowledge transfer from the small, trained

network to a larger, untrained network. When the transformation is applied to RNN, the condition number of the hidden layer matrices is very large, causing the network to become ill-conditioned. By adding random noise in which the sum of elements in any columns is 0, the outputs are preserved and the condition number is reduced since the random noise eliminates the correlation between rows and columns of the expanded weight matrix.

Pros:

1. Since the constraints in GEM only prevent the loss for previous tasks to increase, it is possible to have positive backward transfer, allowing even better performance for the old tasks with the introduction of the new task.

2. With Net2Net, the larger, untrained network can immediately have the same performance as the original small, trained network because of the use of function preserving transformations.

3. Both GEM and Net2Net are parameter efficient, in which the parameters are shared for related tasks in order to maximize the model's capacity.

Cons:

1. In GEM, the projection of the gradient regularizes the model, thus reducing its effective capacity.

2. In GEM, storing and rehearsing the previous examples is very costly in both computational and memory aspects, and the computational overhead is very heavy when computing the GEM gradient for a large number of tasks.

3. The Net2Net's function preserving transformations need to be worked out manually. As a result, the process cannot be automated if different architectures are used throughout the learning process.

3. Learn to Grow: A Continual Structure Learning Framework for Overcoming Catastrophic Forgetting

Summary: The learn-to-grow framework is composed of a neural structure optimization component and a parameter learning and fine-tuning component. Structure optimization is done through the use of neural architecture search. With each new task, each layer of the model is categorized as either "reuse", "adaptation", or "new" by the controller. "Reuse" makes use of the old parameters, "adaptation" adds a small parameter overhead to the original layer's output, and "new" randomly initializes a new set of parameters that are of the same size as the original set. After the search for the structure, all of the parameters are learned and fine-tuned with the data from the new task, even for the layers with "reuse".

Pros:

1. The structure optimization component allows for an automatic way to create a sensible design of an expansion to the current model when encountering new tasks.

2. Fine-tuning "reuse" layers allow for positive forwards and backwards transfer.

3. Structure optimization allows for more effective use of the model's parameters that are useful across multiple tasks that are related, thus increasing the model's parameter efficiency.

Cons:

1. The search space for the structure optimization may grow exponentially with respect to the number of tasks.

2. The options for the structure optimization are quite limiting as to how the model can be expanded to adapt to new tasks. As a result, the model may not be able to deal with more complex tasks.

3. When the model gets larger, the process of evaluating each layer will grow exponentially for each task, creating a lot of computation overhead and inefficiency.

4. Hierarchically Structured Meta-learning

Summary: An aggregator is used to aggregate representations of all examples in the whole training set. With the learned representations of each task, soft assignment of the hierarchical clustering is performed. If the new task does not fit any of the learned task clusters, the number of clusters is increased with parameters randomly initialized to incrementally add model capacity. With the assigned clusters, a cluster-specific parameter gate is used in order to adapt the transferrable knowledge to a cluster-specific initialization. This allows the training process to only take a few gradient descent steps to reach the optimal parameters.

Pros:

1. A hierarchical task clustering is able to model more complex task relationships. It allows for an interpretable representation of how the tasks are related and shows the characteristics of the clusters as a whole.

2. The knowledge adaptation process allows for the customization of the task knowledge, and at the same time preserves the knowledge generalization in the hierarchical clustering structure.

3. The meta-learning makes use of all of the previous knowledge from the same clusters, allowing fully relevant knowledge transfer to achieve few-shots learning.

Cons:

1. Since the knowledge is only preserved through generalization, knowledge retention may be lossy, thus old tasks may still require a model to retrain in order to acquire optimal performance again.

2. Hierarchical clustering does not scale well, as more tasks are introduced, it will be very inefficient due to its high time complexity.

3. Since the transfer learning is only through initialization and are subject to changes with SGD, the parameters may not be efficiently shared across tasks when the number of tasks is large.

5. Continual and Multi-Task Architecture Search

Summary: Continual Architecture Search (CAS) is used to achieve lifelong learning with the aim of maintaining performance on previously learned tasks when trained sequentially on new tasks. In order to achieve this, the model parameters are constrained to be block sparse to ensure the closeness between old and new parameters, and that the new parameters are orthogonal to old parameters in order to ensure that the previous learned knowledge is not affected. Efficient Neural Architecture Search (ENAS) is used to to find the best Directed Acyclic Graph (DAG) for RNN cell whenever there is a new task. The weight parameters are shared across all tasks and are updated for every new task. When testing the performance, the task's DAG is used with the new weights in order to get task-specific outputs. Multi-Task Architecture Search (MAS) is used when multiple tasks are given at once, with the aim of creating a generalizable architecture that is able to perform well across multiple tasks. This is accomplished by providing the performance of each sampled cell structure of a given task as reward to the controller.

Pros:

1. In CAS, since the new DAG is initialized with the old model's parameters, there is positive forward transfer.

2. In CAS, the search is efficient since it is only searching through the permutations of the RNN cells, thus it will not grow exponentially with new tasks.

3. In MAS, a cell that learns on multiple tasks can become more generalizable to other tasks, even including those that are unseen by the model.

Cons:

1. In CAS, even though the new parameters are orthogonal to the old ones, the old parameters are not frozen and thus may still be changed when training the new task. There are no restrictions to prevent the previous tasks' performance from worsening, thus may still subject to some degree of catastrophic forgetting.

2. In CAS, since the number RNN cells are pre-defined, there is no model expansion mechanism to account for insufficient capacity. The performance heavily depends on how representative the initial cells are with the upcoming tasks.

3. In MAS, when there are more tasks at the same time, generalizing all of them may result in accounting for too little for each of the tasks, causing deteriorating performance compared to individually trained models.

6. Net2net: Accelerating Learning via Knowledge Transfer

Summary: Net2WiderNet replaces a layer with a wider layer. This is done by introducing a new unit to the layer that is an exact copy of a random cell from the current layer, with the same set of connections and weights. In order to account for the enlarged outputs due to the introduction of the new unit, the outputs of the new unit and the copied unit are divided by a replication factor, so that all the units will then have the exact value as the original network. Net2DeeperNet replaces a layer with two layers. The new layer is initialized to an identity matrix, but remains free to take on any value later. This allows for the preservation of the equivalent representation. Together with Net2WiderNet, it is possible to add any hidden layer that is at least as wide as the layer below it.

Pros:

1. The larger, untrained network can immediately have the same performance as the original small, trained network because of the use of function preserving transformations.

2. The initial change to the larger model size does not worsen performance, thus guarenteeing an improvement with any changes made to the network after initialization with each local step.

3. The weights do not need to be frozen to ensure the performance is not affected. The newly adapted units are able to maintain and improve performance with optimization.

Cons:

1. The function preserving transformations need to be worked out manually. As a result, the process cannot be automated if different architectures are used throughout the learning process.

2. Net2net does not provide any mechanisms to knowledge retention to overcome catastrophic forgetting. When the training data distribution changes, the network will train for the new task instead, forgetting knowledge of the previously learned tasks.

3. Even though it preserves the original network's performance, it does not provide significant forward transfer learning since the new parameters are either randomly initialized or are identity functions. This requires the network to readapt to the training problem.

4. There is no clear guidance or guidelines as to when and why Net2WiderNet or Net2DeeperNet should be used on the network.

7. Gradient Episodic Memory for Continual Learning

Summary: The main feature of GEM is an episodice memory that stores a subset of the observed examples from tasks. The episodic memory are indexed with task descriptors with the goal to allow for positive backward transfer. This is done by using the losses in the episodic memory as inequality constraints, avoiding their increase but allowing their decrease. When violations occur, the proposed gradient is projected to the closest gradient in order to satisfy all the constraints.

Pros:

2. The projection of the gradient and the detection of violation can be done automatically, thus the process does not require any manual labor.

3. The parameters are fully optimized due to the gradient projections and are not frozen, thus enabling model's parameters efficiency.

Cons:

1. Since integer task descriptors are used, there is nearly no positive forward transfer learning.

2. The projection of the gradient regularizes the model, thus reducing its effective capacity.

3. Storing and rehearsing the previous examples is very costly in both computational and memory aspects, and the computational overhead is very heavy when computing the GEM gradient for a large number of tasks.

4. When storing only a small number of examples from each task, it is sometimes not possible for the small subset to be representative of the whole population.

5. Each GEM iteration requires one backward pass per task, thus it may grow rapidly with larger number of tasks.

8. Overcoming Catastrophic Forgetting by Incremental Moment Matching

Summary: The moments of posterior distributions are matched in an incremental way. A Gaussian distribution is used to approximate the posterior distribution of parameters. Given a sequence of tasks, optimal parameters of the Gaussian approximation function are found from the posterior parameter for each task. Mean-based incremental moment matching and mode-based incremental moment matching are used for weight-transfer, L2-transfer, and drop-transfer.

Pros:

1. The search space of regularizers can also be nearly convex, making the search space smooth, and the point in the search space have good accuracy.

2. Catastrophic forgetting is minimized for tasks of the same domain with the use of the approximation of the posterior parameters.

3. The incremental moment matching algorithms are able to balance out the information between an old and a new network, utilizing the parameters of the model efficiently.

Cons:

1. Weight transfer is an inadequate initialization technique between different problem classes.

2. The method may not be able to avoid catastrophic forgetting for tasks that do not have much relations with one another.

3. There is no positive backward transfer since that can be no further learning for most of the model.

9. Fast Context Adaptation via Meta-Learning

Summary: Fast context adaptation via meta-learning (CAVIA) partitions the model parameters into two parts: context parameters are adapted in the inner loop for each task, and network parameters are meta-learned in the outer loop and shared across tasks. The context parameters are initialized to 0 in order to not affect the output of the layer before adaptation.

Pros:

1. The sharing of parameters across tasks provide model parameters efficiency and allows for forward transfer learning.

2. Task embeddings are created through the use of context parameters.

3. The initialization of the context parameters do not need to be learned, making CAVIA more robust to the initial learning rate.

Cons:

1. Updating the shared network parameters may be difficult and may not be able to account across all tasks when the number of tasks is large and of different domains.

2. The updates for the shared network parameters do not guarentee to lower the losses of the previously learned tasks, thus may negatively impact their performance.

3. The model must be set with an initial capacity and do not have a mechanism to expand further to account for more tasks.

10. Parameter-Efficient Transfer Learning for NLP

Summary: The paper proposes a new bottleneck adaptor module for Transformers. The adaptor tuning strategy involves injecting new layers into an original network. During training, the original parameters are frozen while the new adapter layers are trained from a random initialization. The adaptors are initialized to a near-identity function in order not to affect the original performances. This is done by initializing the parameters of the projection layers to near-zero with skip connections. The number of parameters are limited by projecting the original features to a much smaller dimension to apply a nonlineariry, and then project back to the original dimensions.

Pros:

1. The model is parameter efficient and is able to scale well since only a small number of additional parameters need to be added per task.

2. The model is able to adapt to many downstream tasks since the lower layers are able to extract lower-level features that can be shared across tasks.

3. Freezing of parameters can guarentee previously learned tasks' performance, which is able to overcome catastrophic forgetting better when compared to fine-tuning.

Cons:

1. Since the original parameters are frozen, there is no backward transfer learning.

2. The model is designed to deal with tasks of the same domain. Thus, if tasks differ drastically, the model will not be able to make use of the bottleneck tuning method to obtain high performance.

3. Even though the newly added adaptors are clearly useful in task performance, it is not clear about the number of adaptors to add for each of the incoming tasks. The process of expansion cannot be automated.

11. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks

Summary: The model-agnostic meta-learning algorithm (MAML) aims to find model parameters that are sensitive to changes in the task, such that small changes in the parameters will produce large improvements on the loss function of any task drawn from the same distribution, when altered in the direction of the gradient of that loss. The meta-optimization is performed over the model parameters, while the objective is computed using the updated model parameters. This allows the optimization of the model parameters such that one or a small number of gradient steps on a new

task will produce maximally effective behavior on that task.

Pros:

1. The algorithm is model-agnostic, thus can be applied to any kinds of models that make use of gradient descent.

2. The model is parameters efficient, since parameters are shared in order to create shared internal representations that are transferrable across tasks.

3. The model does not require the introduction of any learned parameters for meta-learning.

Cons:

1. This algorithm only aims to tackle problems of the same domain, thus cannot be applied when the tasks are drastically different.

2. The capacity of the model needs to be set initially, thus there is no model expansion

mechanisms.

3. Since the model parameters are not frozen and that there is no restrictions on the losses for previously learned tasks, it does not prevent catastrophic forgetting.

4. As the number of tasks grow, the computation overhead for the gradients will become large.

5. The transfer learning is solely dependent on weight initialization, thus the model

architecture must be suitable initially in order to have a good performance.

12. Online Structured Laplace Approximations for Overcoming Catastrophic Forgetting

Summary: Bayesian Online Learning, or Assumed Density Filtering, is used as a framework for updating an approximate posterior when data arrive sequentially. In order to make Bayesian online learning tractable for neural networks, a Laplace approximation is used. Kronecker factored approximations to the curvature of the neural networks are used to obtain a better fit to the posterior. A hyperparameter is introduced to act as a regularizer on the approximation to the posterior, providing a way of trading off retaining performance on previous tasks against having sufficient flexibility for learning a new one.

Pros:

1. This algorithm enables parameter efficiency in the method similar to that of EWC. It also goes beyond diagonal approximation methods that only measure the sensitivity of individual parameters.

2. When a new task is related to a previously learned task, the approximated posterior can act as forward transfer learning to the new task.

3. The newly introduced hyperparameter allows more flexibility in the usage of the model's parameters to either retain previously learned task performances or the new task's performance.

Cons:

1. The model capacity is fixed with no expansion mechanisms, thus the initial structure must be large enough to account for all the tasks without the ability to adding new weights.

2. Similar to EWC, there are no positive backward transfer to previously learned tasks.

3. The approximate online Bayesian updating is not sufficient to warrant good performance. It requires manual tuning of the regularization hyperparameters.

13. Exploring Continual Learning Using Incremental Architecture Search

Summary: For each new class of data that arrives, a neuron is added to the output layer. The model is then trained with all available data using early stopping to ensure that any increase in the performance of a sampled architecture is only caused by the improvement in neural architecture. 6 new architectures will then be sampled based on selected guidelines the consist of several Net2Net transformations that are likely to improve the performance. The best architecture is selected with the highest average validation accuracy and trained with all available data with early stopping.

Pros:

1. The neural architecture search combined with Net2Net transformations allow for sensible model expansion that can accomodate the newly arrived class of data.

2. Since all the weights are allowed to be trained with each additional class, it provides

both forward and backward transfer learning.

3. The training of newly arrived data is efficient since the Net2Net transformations preserve the original performance, allowing few training steps with the student network.

Cons:

1. This technique is only suitable for continual learning that involves the addition of classes, rather than the addition of new tasks (such as reinforcement learning settings).

2. Since only six architectures are sampled, the performance of the architectures may not be very optimized.

3. This method requires that all the training data are available instead of arriving sequentially in a stream.

14. Continuous Learning in Single-Incremental-Task Scenarios

Summary: AR1 is a continuous learning method that combines architectural and regularization strategies. This is done by improving Copy Weight with Reinit (CWR) with mean-shift and zero initialization, and then extending it by combining with synaptic intelligence. Synaptic intelligence is used as a regularization constraint for the tuning of the optimal shared weights across batches.

Pros:

1. Mean-shift allows the weights to be normalized without tuning any parameters, removing the process of rescaling by the weights.

2. By initializing the output layer to the same value, the model can avoid the errors of the softmax normalization producing strong predictions for wrong classes, allowing for better learning through backpropagation.

3. AR1 is suitable for online implementations due to the low computational overhead and the small number of epochs for SGD.

Cons:

1. This technique is only suitable for continual learning that involves the addition of classes, rather than the addition of new tasks (such as reinforcement learning settings).

2. Similar to EWC, the regularization only pulls back the important weights to their old values for the previous tasks, so there are no mechanisms in allowing backwards knowledge transfer even if the tasks are very related. Moreover, the weights are not frozen, thus can still cause some degree of catastrophic forgetting.

3. There is no model expansion techniques that allow the increase of the model's capacity.

15. Encoder Based Lifelong Learning

Summary: Autoencoders are used to learn the submanifold of informative features for a given task. The features that are the most informative for the first task are preserved, providing more flexibility for the other features in order to improve the performance on the following tasks. The aim of the use of autoencoders is not only to capture the information that is important to reconstruct the features, but also important for the task operator.

Pros:

1. This is an improved result of learning without forgetting as the authors make the used approximation less sensitive to the data distributions.

2. This method does not require the storage of a large number of parameters since it reduces forgetting of earlier tasks by controlling the distance between the representations of the different tasks.

3. The learning of the submanifold from the autoencoders allow for forward transfer learning since the task operator allows parameters sharing for closely related tasks.

Cons:

1. The autoencoders are required to be trained for each task, making the training process highly computational expensive.

2. The memory required to train the autoencoders will grow linearly to the number of tasks.

3. Since the important features determined by the autoencoders cannot be trained effectively during the training for new tasks, there is no backward transfer learning.

4. With separate FCs, the parameters efficiency is lower as a lot more parameters are used for each of the new incoming tasks.

16. Expert Gate: Lifelong Learning with a Network of Experts

Summary: A specialized model (expert) is trained for each task by transferring knowledge from the most related previous task. A gating function is also learned to capture the characteristics of each task. This gate forwards the test data to the corresponding expert according to their reconstruction errors in order to maintain a high performance over all the learned tasks. For each task, a low dimensional subspace is learned from an undercomplete autoencoder in order to fit the best one for the incoming test sample. Two different transfer methods are used based the reconstruction errors. When two tasks are sufficiently related, learning-without-forgetting is used as the transfer method; otherwise, fine-tuning is used.

Pros:

1. Positive forward transfer learning is supported with the help of transfer learning methods for the most closely related task.

2. Since a representation is learned from the undercomplete autoencoders, it does not require storing data from previous tasks.

3. A gating function is used to determine the relatedness of the previous tasks to the new task, which has similar performance to using a discriminative classifier, thus reducing the need for a lot of computational resources and time to train.

Cons:

1. An additional expert model needs to be trained for each of the new incoming tasks, causing the model's capacity to increase significantly when the number of tasks is large.

2. The positive forward transfer learning is only based on a single task that is the most

related, thus there is a loss of knowledge when a few of the previous tasks are closely

related to the new task.

3. There are no mechanisms for positive backward transfer learning.

17. End-to-end incremental learning

Summary: An end-to-end approach is used with a deep network trained with a cross-distilled loss function. To overcome catastrophic forgetting, a representative memory is used to store and manage the most representative samples from the old classes. The selection of new samples is done based on herding selection, which produces a sorted list of samples of one class based on the distance to the mean sample of that class, selecting the first samples from it. When new classes are trained, a new classification layer is added and connected to the feature extractor and the component for computing the cross-distilled loss. The cross-distilled loss function combines a distillation loss, which retains the knowledge from old classes, with a multi-class

cross-entropy loss, which learns to classify the new classes.

Pros:

1. The feature extractor is allowed to continue to train throughout the incremental training process, allowing more and more features to be extracted as new classes come in.

2. There is positive backward transfer learning since the samples from the new classes are also used for distillation to reinforce the old knowledge.

3. Because of the distillation function, the parameter weights are efficiently used to

adapt to different classes, and that the network is able to learn a more discriminative

representation of the classes.

Cons:

1. The performance of the model heavily depends on whether or not the training sets are balanced, thus will not work well in few-shots learning scenarios.

2. This method requires memory to store samples from previous tasks, which will either grow as the number of tasks grow or reduce the number of samples stored per task to a non-representative amount for the tasks.

3. This method is only suitable for classification and cannot deal with more complex and different tasks like in reinforcement scenarios.

18. Memory Aware Synapses: Learning what (not) to forget

Summary: An important weight is estimated for each parameter in the network, approximating the sensitivity of the learned function to a parameter change. The goal is to preserve the prediction of the network (the learned function) at each observed data point and prevent changes to parameters that are important for this prediction. The method is able to learn the importance of network parameters from the input data that the system is active on, in an unsupervised manner. Since the loss function is not used, the important matrix can be computed on any available data considered most representative for test conditions. Because the model resembles an implicit memory

included for each parameter of the network, it is referred to as Memory Aware Synapses.

Pros:

1. It is able to compute the importance of the parameters on any given data point without the need to access the labels or the condition of being computed while training the model.

2. This method can be applied on top of any pretrained network and is able to adapt and specialize to a given subset of data points.

3. The model is parameter efficient since they are optimized in order to account for multiple tasks.

Cons:

2. This model is more tailored towards optimizing for learning important information, rather than real lifelong learning since the knowledge retention is very specialized.

3. There is no positive backward transfer learning since the important parameters are discouraged to be changed when learning new tasks.

19. Fearnet: Brain-inspired model for incremental learning

Summary: FearNet has two complementary memory centers. The first one resembles the hippocampal complex for recent memories, which is a short-term memory system that immediately learns new information for recent recall. The second is a deep neural network that resembles the medial prefrontal cortex for long-term storage. A controller network that resembles the basolateral amygdala determines which memory center contains the associated memory required for prediction. During sleep phases, FearNet uses a generative model to consolidate data from HC to mPFC through

pseudorehersal. HC computes class conditional probabilities using stored training examples. mPFC is trained to both reconstruct its input using a symmetric encoder-decoder and to compute class conditional probabilities.

Pros:

1. The training examples do not have to be stored after the consolidation phase.

2. Since mPFC is a pseudoexample generator that is learned as a unsupervised reconstruction task to both discriminate examples and generate new examples, FearNet is slow to forget old information.

3. FearNet is capable of incrementally learning multi-modal information if the model has a good starting point (high base-knowledge).

Cons:

1. The stored covariance matrix has the largest impact on the model's size, which is a trade-off for not storing training examples.

2. If the classes are seen in more than one study sessions, the storage and updating of class statistics are not dealt with, and that it may affect the performance for older tasks since recent study sessions are favored for the learning in the autoencoder.

3. If the number of classes is high, FearNet may suffer in recent recall due to the ineffective learning as the storing statistics become larger.

4. BLA needs to be trained independently, instead of with the other two parts of the model.

5. The output of the mPFC encoder is assumed to be normally distributed for each class, which may not be the case.

6. There is no model expansion techniques to adapt to more incoming classes and examples.

7. If the model starts with lower base-knowledge performance, the model struggles to learn new information incrementally.

20. Learning without forgetting

Summary: The goal is to add task-specific parameters for a new task and to learn parameters that work well on old and new tasks, using images and labels from only the new task. The number of new parameters is equal to the number of new classes times the number of nodes in the last shared layer. When training, only the new parameters are trained first in the warm-up step, and then all weights are jointly trained until convergence in the joint-optimize step. The output of old tasks for new data is saved in order to used for the loss function in the performance of the old tasks. It also introduces knowledge distillation for lifelong learning to preserve the performance on old tasks.

Pros:

1. This method does not require the storage of any data from the old dataset. This brings the benefit of joint optimization of the shared parameters, as well as saving the computation since the images only have to pass through the shared layers once for both the new task and the old task.

2. Preserving outputs on the old task is a more direct and interpretable way to retain the important shared structures learned for the previous tasks.

3. Instead of using Net2Net techniques to expand the network, the additional parameters to account for the new classes are much smaller in comparison while maintaining similar performances.

Cons:

1. The performance on old tasks degrades a lot when the model is exposed to a long sequence of tasks for different domains since the loss for old tasks is computed on the new coming data which is likely to be drawn from a significantly different distribution compared to the previous data.

2. In joint training, different images are used for different tasks, with each task requiring separate back-propagation through the shared parameters, causing the training process to be slow.

3. This method is only limited to incremental learning of more classes in classification of

the same domain of task, and cannot be applied when the tasks are drastically different.

4. The constraint of mimicking the output of Original CNN as much as possible is likely to hinder the adaptation to the new task.

21. Meta Continual Learning

Summary: The model is trained to limit the updates for parameters of the mapping function that are important for performance on the previous tasks and allow large updates for parameters used to learn the current task. When a new task is given to the mapping function, the function's parameters are updated using the outputs of the update step prediction model. The update step predictor is able to compute the importance of each mapping function parameter for the previous task compared

to other parameters. This allows for meta-learning in order to alleviate catastrophic learning.

Pros:

1. The method allows for parameters efficiency since the weights are optimized in order to account for multiple tasks.

2. There is positive forward transfer learning since the weights are shared when learning a new task that is similar to some older tasks.

3. The use of a meta-learning algorithm allows for the optimization of the weight updates when compared to regularization methods, allowing the weights to be more flexible throughout the updates instead of being restricted by the regulators.

Cons:

1. This meta-training approach requires access to all previous tasks samples or to store them in order to overcome catastrophic forgetting.

2. In order to account for multiple tasks, it is required to chain multiple update predictors (each with their own previous task to consider), which increases the size and computation significantly when the number of tasks is large.

3. Since the model capacity is fixed with only changing the weights for each new task, the initial structure must be large enough to account for all the tasks without the ability to adding new weights.

4. There is no mechanisms for positive backward transfer learning.

22. Lifelong learning via progressive distillation and retrospection

Summary: The model adapts to a new task through knowledge distillation instead of directly training on the new data. An expert CNN is trained with the new training data, and the learning of the new task is based on the knowledge distillation from the expert CNN. Retrospection is used by storing a small fraction of data for old tasks in order to preserve the performance on these tasks.

Pros:

1. The one-hot labels of new data are replaced by the soft labels output by Expert CNN, which can enforce the relationship among classes and thus facilitate the learning on the new task.

2. Distillation is beneficial for the performance preservation on old tasks since it is easier for the original CNN to match the output on new data to a soft distribution instead of a very peaked one.

3. There is positive transfer learning because of the use of distillation.

Cons:

1. It requires retrospection, which is the storing of a small fraction of data for old tasks. This requires additional memory space and it becomes huge when the number of tasks is large.

2. It requires the training of an expert CNN for each of the new tasks in order to enable the use of distillation.

4. There is no mechanism for positive backward transfer learning since distillation simply preserves the performance of old tasks instead of improving it.

23. Continual Learning Through Synaptic Intelligence

Summary: A class of algorithms is developed to keep track of an importance measure which reflects past credit for improvements of the task objective for each task to individual synapses (parameters - weights and biases). A quadratic surrogate loss is used and has the same minimum as the cost function of the previous tasks and yields the same important measures over the parameter distance. Individual synapses are allowed to estimate their importance for solving past tasks. According to these information, penalties are issued to changes of the most important synapses in order to learn novel tasks with minimal interference to previously learned tasks.

Pros:

1. This method is able to compute an importance measure online and along the entire learning trajectory.

2. This method resembles biological synapses by adding more intelligence to synapses through potentially complex dynamical properties.

3. When a new task is related to a previously learned task, the weights that are important to the old task are shared and act as a positive forward transfer learning.

4. The method is parameters efficient since the weights are optimized in order to account for as many tasks as possible.

Cons:

1. The regularization only pulls back the important weights to their old values for the previous tasks, so there are no mechanisms in allowing backwards knowledge transfer even if the tasks are very related.

2. As more tasks are learned, the overlapping regions of low error for all tasks may be very small, thus it may require manual changing of the hyperparameters such as learning rates in order to find the region.

24. Lifelong Learning with Dynamically Expandable Networks

Summary: Dynamically Expandable Network first identifies neurons that are relevant to the new task, and selectively retrains the network parameters associated with them. When a new task arrives, a sparse linear model is fit to predict the task using topmost hidden units of the neural network. Breadth-first search is performed on the network starting from the nodes that are affected by the training to identify all units that have paths to the output for the task. If the selective retraining fails to obtain desired loss below set threshold, the network capacity is expanded in a top-down manner, and any unnecessary neurons are eliminated using group-sparsity regularization. After that, the network calculates the drift for each unit to identify units that have drifted too much from their original values during training and duplicate them. The neurons are timestamped in order to ensure old tasks only use neurons that exist at the time

of training, instead of using the newly added neurons as well.

Pros:

1. The use of the dynamic expansion through group-sparsity regularization reduces the number of parameters. This encourages parameters efficiency and prevents the model from overfitting.

2. By splitting the neurons that drift far away from their original values, the performance of old tasks is preserved and the new tasks are given more capacity and flexibility to enhance the network's performance on them.

3. There is positive backwards and forwards transer learning since the parameters are shared for future tasks, and that the old parameters are allowed to train further when it is related to the new tasks.

Cons:

1. The steps of the dynamical network expansion and network split/duplication requires heavy computations and may take a long time.

2. It requires storing of both previous parameters and timestamps for all tasks, which may be huge when the network is big and the number of tasks is large.

3. There is no intelligent design on the model expansion since no architectural search is used. The model may not have a suitable architecture for the corresponding tasks, and that the neurons may not be optimized in a structural way.

25. Continual Learning with Deep Generative Replay

Summary: The model is called a scholar as it is capable of learning a new task and teaching its knowledge to other networks. It is composed of a generative model that produces real-like samples and a solver which is a task solving model. When a new taks arrives, real and replayed samples are mixed together as inputs. The generator learns to reconstruct cumulative input space, and the new solver is trained to couple the inputs and targets drawn from the same mix of real and replayed data. The replayed target is past solver's response to replayed input.

Pros:

1. The generative replay allows ease of balancing the former and new task performances and flexible knowledge transfer.

2. The network is jointly optimized towards task objectives, thus guaranteed to achieve the full performance when the former input spaces are recovered by the generator.

3. The past data does not need to be stored since they are generated by the generator. This allows the storage and memory requirements of the model to be small.

Cons:

1. The training process requires both the training of the generator and the solver, which is time-consuming and computation heavy.

2. When the number of tasks is large, it will require the generation of a lot of data, which will add a lot of computation overhead in order to generate a reasonable data distribution for all past tasks.

3. Whenever there is a new task, a new scholar has to be trained. This requires not only the resources to create a new model, but also additional training time and computational power.

4. Since the model capacity is fixed with only changing the weights for each new task, the initial structure must be large enough to account for all the tasks without the ability to adding new weights.

26. Episodic Memory in Lifelong Language Learning

Summary: This methodology do not make assumptions that each example comes with a dataset descriptor, thus the model does not know which dataset an example comes from. The model consists of three main components: an example encoder, a task decoder, and an episodic memory module. The example encoder is the text encoder BERT to encode the input. The task decoder creates the outputs as probabilities for both text classification and question answering. The episodic memory module is used for sparse experience replay and local adaption to prevent catastrophic forgetting and encourage positive transfer. The module is a key-value memory block. The key representation of the input is given by a key network, which is a pretrained, frozen BERT model to prevent key representations from drifting as data distributions changes. Training examples are written based on random write. Random sampling is used to perform sparse experience replay, and K-nearest neighbors is used for local adaptation, in which the key network is used to obtain a query vector of the test example and query the memory to retrieve K nearest neighbors. Gradient-based local adaptation is used to update parameters of the encoder-decoder model to obtain local parameters to be used for the current prediction. This locally adapts parameters of the encoder-decoder network to be better at predicting retrieved examples from the memory, while keeping it close to the base parameters.

Pros:

1. The local adaptation phase is able to shape the output distribution of a test example to peak around relevant classes based on retrieved examples from the memory. Due to the use of key network that is able to reliably compute similarities between the test example and stored examples in memory, it is able to make the model adapt well to give better predictions.

2. The combination of sparse experience replay and local adaptation allows for positive transfer learning both forwards and backwards.

3. The methodology is able to adapt to distribution changes and is not reliant on any dataset descriptors to learn effectively.

Cons:

1. Storing and rehearsing the previous examples is very costly in both computational and memory aspects.

2. When storing only a small number of examples from each task, it is sometimes not possible for the small subset to be representative of the whole population.

3. The local adaptation phase slows down inference time.

4. The frequency of the sparse experience replay may not be sufficient to overcome catastrophic forgetting.

5. This method heavily depends on and requires a good key network to provide useful key representations.

1. Overcoming Catastrophic Forgetting in Neural Networks

Pros:

Cons:

2. Training Recurrent Neural Networks for Lifelong learning

Pros:

3. Learn to Grow: A Continual Structure Learning Framework for Overcoming Catastrophic Forgetting

Pros:

4. Hierarchically Structured Meta-learning

Pros:

5. Continual and Multi-Task Architecture Search

Pros:

Cons:

6. Net2net: Accelerating Learning via Knowledge Transfer

Pros:

Cons:

Pros:

Cons:

8. Overcoming Catastrophic Forgetting by Incremental Moment Matching

Pros:

Cons:

9. Fast Context Adaptation via Meta-Learning

Pros:

Cons:

10. Parameter-Efficient Transfer Learning for NLP

Pros:

Cons:

11. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks

Pros:

Cons:

12. Online Structured Laplace Approximations for Overcoming Catastrophic Forgetting

Pros:

Cons:

13. Exploring Continual Learning Using Incremental Architecture Search

Pros:

Cons:

14. Continuous Learning in Single-Incremental-Task Scenarios

Pros:

Cons:

15. Encoder Based Lifelong Learning

Pros:

Cons:

16. Expert Gate: Lifelong Learning with a Network of Experts

Pros:

Cons:

17. End-to-end incremental learning

Pros:

Cons:

18. Memory Aware Synapses: Learning what (not) to forget

Pros:

Cons:

19. Fearnet: Brain-inspired model for incremental learning

Pros:

Cons:

20. Learning without forgetting

Pros:

21. Meta Continual Learning

Pros:

Cons:

22. Lifelong learning via progressive distillation and retrospection

Pros:

Cons:

23. Continual Learning Through Synaptic Intelligence

Pros:

Cons:

Pros:

Cons:

25. Continual Learning with Deep Generative Replay

Pros:

Cons:

26. Episodic Memory in Lifelong Language Learning

Pros:

Cons:

Comments