Efficient Lifelong Learning with A-GEM
08 Jan 2019Contributions
-
A new (and more realistic) evaluation protocol for lifelong learning where each data point is observed just once and a disjoint set of tasks are used for training and validation.
-
A new metric that focuses on the efficiency of the models - in terms of sample complexity and computational (and memory) costs.
-
Modification of Gradient Episodic Memory ie GEM which reduces the computational overhead of GEM without compromising on the results.
-
Empirical validation that using task descriptors help lifelong learning models and improve their few-shot learning capabilities.
Learning Protocol
-
Two group of datasets - one for training and evaluation (DEV) and other for cross validation (DCV).
-
Data can be sampled multiple times for cross-validation dataset but only once from the training dataset.
-
Each group of dataset (say DEV or DCV) is a list of task-specific datasets Dk (k is the task index).
-
Each sample in Dk is of the form (x, t, y) where x is the data, t is the task descriptor and y is the output.
-
Dk contains Bk minibatches of data.
Metrics
Accuracy
-
ak,i,j = accuracy on test task j after training on ith minibatch of training task k.
-
Ak = mean over all j = 1 to k (ak, Bk, j) ie train the model on data for task k and then test it on all the tasks.
Forgetting Measure
-
fjk = forgetting on task j after training on all minibatches upto task k.
-
fjk = max over all l = 1 to k-1 (al, Blj - ak, Bkj)
-
Forgetting = Fk = mean over all j = 1 to k-1 (fjk)
LCA - Learning Curve Area
-
Zb = average b shot performance where b is the minibatch number.
-
Zb = mean over all k = 0 to T (ak, b, k)
-
LCAβ = mean over all b = 0 to β (Zb)
-
One special case is LCA0 which is the forward transfer performance or performance on the unseen task.
-
In experiments, β is kept small as we want the model to learn from few examples.
Model
-
GEM has been shown to be very effective in single epoch setting but introduces a very high computational overhead.
-
Average GEM (AGEM) reduces this overhead by sampling (and using) only some examples from the episodic memory instead of using all the examples.
-
While GEM provides better guarantees in terms of worst-case forgetting, AGEM provides better guarantees in terms of average accuracy.
Joint Embedding Model Using Compositional Task Descriptors
-
Compositional Task Descriptors are used to speed training on the subsequent tasks.
-
A matrix specifying the attribute value of objects (to be recognized in the task) are used.
-
A joint-embedding space between image features and attribute embeddings is learned.
Experiments
Datasets
Setup
-
Integer task descriptors for MNIST and CIFAR and class attributes as descriptors for CUB and AWA
-
Baselines include GEM, iCaRL, Elastic Weight Consolidation, Progressive Neural Networks etc.
Results
-
AGEM outperforms other models on all the datasets expect MNIST where the Progressive Neural Networks lead. One reason could be that MNIST has a large number of training examples per task. But Progressive Neural Networks lead to bad utilization of capacity.
-
While AGEM and GEM have similar performance, GEM has a much higher computational and memory overhead.
-
Use of task descriptors improves the accuracy for all the models.
-
It seems that AGEM offers a good tradeoff between average accuracy performance and efficiency - in terms of sample efficiency, memory requirements and computational costs.