Contents

# Session 1/2

In this reading group session: we review the LassoNet method, from its motivation to its theoretical justification.

In the next session: we review experimental results with the LassoNet and discuss potential applications and extensions of this algorithm

The LassoNet paper by Ismael Lemhadri, Feng Ruan, Louis Abraham and Rob Tibshirani is to be published in JMLR, the paper website is available at https://lassonet.ml, most images throughout this presentation are taken from the paper.

## Setup

• Supervised learning problem

• Dataset $D = (x_i,~y_i)_{i \le 1 \le n}$

• $x_i \in \mathbb{R}^d$ input features
• $y_i \in \mathbb{R}$ responses to model
• Loss function $l(y',~y)$ to evaluate models (differentiable)

Task: Find a minimal set of features $k \subset [d]$ to model $y$

Problem: Relationship between the response $y$ and input variables $x_i \in \mathbb{R}^d$ is non linear, but well-modelled by a neural network function approximator.

Goal: Can we efficiently select variables $x_i$ when the mapping from $x_i$ to $y_i$ is a neural network?

## Approach

The LassoNet procedures that the authors propose can be broken down into 3 steps:

1. Augment the neural network model space with a skip connection

• Let $\{g_{W} ~|~ W \in \mathbb{R}^p \}$ the space of neural network function approximators selected, with parameters $W \in \mathbb{R}^p$.

We assume that there exists some $W \in \mathbb{R}^p$ such that

$$\forall i \in D, \quad y_i \approx g_W(x_i)$$

• Consider the larger set space of models

$$\{f_{\theta,W} : x \mapsto g_W(x) + x^T\theta ~|~ \theta \in \mathbb{R}^d,~W \in \mathbb{R}^p \}$$

• On a picture:

Adapted from the LassoNet paper Figure 3.

2. Define a sparsity inducing loss function to select variables in those models

$$\min_{\theta,W} \quad L(\theta,~W) + \lambda ||\theta||1\\ \text{ }\\ \text{s.t}~\quad \forall 1 \le j \le d,\quad ||W_j^{(1)}||{\infty} \le M |\theta_j|$$

With,

$$L(\theta,~W) = \frac{1}{n} \sum_{1 \le i \le n} l(f_{\theta,W}(x_i), y_i)$$

• $M > 0$ a hyperparameter - hierarchy coefficient

Discussion

1. Penalize skip connection weights $\theta$ with a Lasso-like L1 penalty

→ Only a subset $k \subset [d]$ of input variables will have a non-zero weight $\theta_j$ in the skip connection, for high enough $\lambda$.