NetTailor: Tuning the architecture, not just the weights


Real-world applications of object recognition often require the solution of multiple tasks in a single platform. Under the standard paradigm of network fine-tuning, an entirely new CNN is learned per task, and the final network size is independent of task complexity. This is wasteful, since simple tasks require smaller networks than more complex tasks, and limits the number of tasks that can be solved simultaneously. To address these problems, we propose a transfer learning procedure, denoted NetTailor, in which layers of a pre-trained CNN are used as universal blocks that can be combined with small task-specific layers to generate new networks. Besides minimizing classification error, the new network is trained to mimic the internal activations of a strong unconstrained CNN, and minimize its complexity by the combination of 1) a soft-attention mechanism over blocks and 2) complexity regularization constraints. In this way, NetTailor can adapt the network architecture, not just its weights, to the target task. Experiments show that networks adapted to simple tasks, such as character or traffic sign recognition, become significantly smaller than those adapted to hard tasks, such as fine-grained recognition. More importantly, due to the modular nature of the procedure, this reduction in network complexity is achieved without compromise of either parameter sharing across tasks, or classification accuracy.


Presented at CVF/IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, 2019.




Schematic representation of the NetTailor procedure.


  1. Pre-train universal network on source task (ImageNet).
  2. Train teacher network by fine-tuning universal network on target task.
  3. Define student network by augmenting the universal network (gray blocks) with task-specific low-complexity proxy layers (blue blocks).
  4. Train proxy layers on target task to:
    1. Optimize classification performance.
    2. Mimic internal activations of the teacher.
    3. Satisfy complexity constraints that encourage the use of low-complexity layers.
  5. Prune layers with low impact on network performance.
  6. Fine-tune remaining task-specific parameters.

Source code and trained models available on GitHub.


Final architectures produced by NetTailor on several datasets from the Visual Decathlon challenge. Gray boxes represent univeral blocks. Blue boxes represent task-specific blocks. The attention weight associated with each block is encoded in its opacity. Hover over each image to see the evolution of attention weights during training .




DTD (Textures)

GTSR (Traffic Signs)

Omniglot (Characters)

SVHN (Digits)




Pedro Morgado

UC San Diego