Efficient Lifelong Learning with A-GEM

08 Jan 2019

Contributions

A new (and more realistic) evaluation protocol for lifelong learning where each data point is observed just once and a disjoint set of tasks are used for training and validation.
A new metric that focuses on the efficiency of the models - in terms of sample complexity and computational (and memory) costs.
Modification of Gradient Episodic Memory ie GEM which reduces the computational overhead of GEM without compromising on the results.
Empirical validation that using task descriptors help lifelong learning models and improve their few-shot learning capabilities.
Link to the paper
Link to the code

Two group of datasets - one for training and evaluation (D^EV) and other for cross validation (D^CV).
Data can be sampled multiple times for cross-validation dataset but only once from the training dataset.
Each group of dataset (say D^EV or D^CV) is a list of task-specific datasets D_k (k is the task index).
Each sample in D_k is of the form (x, t, y) where x is the data, t is the task descriptor and y is the output.
D_k contains B^k minibatches of data.

a_k,i,j = accuracy on test task j after training on ith minibatch of training task k.
A_k = mean over all j = 1 to k (a_{k, B_k, j}) ie train the model on data for task k and then test it on all the tasks.

Z_b = average b shot performance where b is the minibatch number.
Z_b = mean over all k = 0 to T (a_{k, b, k})
LCA_β = mean over all b = 0 to β (Z_b)
One special case is LCA₀ which is the forward transfer performance or performance on the unseen task.
In experiments, β is kept small as we want the model to learn from few examples.

GEM has been shown to be very effective in single epoch setting but introduces a very high computational overhead.
Average GEM (AGEM) reduces this overhead by sampling (and using) only some examples from the episodic memory instead of using all the examples.
While GEM provides better guarantees in terms of worst-case forgetting, AGEM provides better guarantees in terms of average accuracy.

Compositional Task Descriptors are used to speed training on the subsequent tasks.
A matrix specifying the attribute value of objects (to be recognized in the task) are used.
A joint-embedding space between image features and attribute embeddings is learned.

Integer task descriptors for MNIST and CIFAR and class attributes as descriptors for CUB and AWA
Baselines include GEM, iCaRL, Elastic Weight Consolidation, Progressive Neural Networks etc.

AGEM outperforms other models on all the datasets expect MNIST where the Progressive Neural Networks lead. One reason could be that MNIST has a large number of training examples per task. But Progressive Neural Networks lead to bad utilization of capacity.
While AGEM and GEM have similar performance, GEM has a much higher computational and memory overhead.
Use of task descriptors improves the accuracy for all the models.
It seems that AGEM offers a good tradeoff between average accuracy performance and efficiency - in terms of sample efficiency, memory requirements and computational costs.