Modular meta-learning

22 Jan 2019

Introduction

The paper proposes an approach for learning neural networks (modules) that can be combined in different ways to solve different tasks (combinatorial generalization).
The proposed model is called as BOUNCEGRAD.
Link to the paper
Link to the code

Focuses on supervised learning.
Task distribution p(T).
Each task is a joint distribution p_T(x, y) over (x, y) data pairs.
Given data from m meta-training tasks, and a meta-test task, find a hypothesis h which performs well on the unseen data drawn from the meta-test task.

Given a compositional scheme C, a set of modules F₁, …, F_k (represented as a whole by F) and the set of their respective parameters θ₁, …, θ_k (represented as a whole by θ), (C, F, θ) represents the set of possible functional input-output mappings. These mappings form the hypothesis space.
A structured hypothesis model is specified by what modules to use and their parametric forms (but not the values).

Offline Meta Learning Phase:
- Take training and validation dataset for the first k tasks and generate a parameterization for each module θ₁, …, θ_k.
- The hypothesis (or composition) to use comes from the online meta-test learning phase.
- In this stage, find the best θ given a structure.
Online Meta-test Learning Phase
- Given a hypothesis space and θ, the output is a compositional form (or hypothesis) that specifies how to compose the models.
- In this stage, find the best structure, given a hypothesis space and θ.

During Meta-test learning phase, simulated annealing is used to find the optimal structure, with temperature T decreased over time.
During meta-learning phrase, the actual objective function is replaced by a surrogate, smooth objective function (during the search step) to avoid local minima.
Once a structure has been picked, any gradient descent based approach can be used to optimize the modules.
Basically the state of optimization process comprises of the parameters and the temperature. Together, they are used to induce a distribution over the structures. Given a structure, θ is optimized and T is annealed over time.
The learning procedure can be improved upon by performing parameter tuning during the online (meta-test learning) phase as well. the resulting approach is referred to as MOMA - MOdular MAml.

Sine-function prediction problem
In general, MOMA outperforms other models.
With a small amount of online training data, BOUNCEGRAD outperforms other models as it has a better structural prior.

11 different objects (with different shapes) on 4 surfaces with different friction properties.
2 meta-learning scenarios are considered. In the first case, the object-surface combination in the test case was present in some meta-training tasks and in the other case, it was not present.
For previously seen combinations, MOMA performs the best followed by BOUNCEGRAD and MAML.
For unseen combinations, all the 3 are equally good.
Compositional scheme is the attention mechanism.
An interesting result is that the modules seem to specialize (and activate more often) based on the shape of the object.

Composition Structure - generating kinematics subtrees for each body part (2 legs, 2 arms, 2 torsi).
Again 2 setups are used - one where all activities in the training and the meta-test task are shared while the other setup where the activities are not shared.
For known activities MOMA and BOUNCEGRAD perform the best while for unknown activities, MOMS performs the best.

While the approach is interesting, maybe a more suitable set of tasks (from the point of composition) would be more convincing.
It would be useful to see the computational tradeoff between MAML, BOUNCEGRAD, and MOMA.