THREE SIMPLE HEURISTICS MATHEMATICAL PROOFS ON LASSO THEORY

Three relevant facts about the least absolute shrinkage and selection operator (Lasso) are studied: The estimatives follows piecewise linear curves in relation to tuning parameter, the number of nonzero selected covariates is an unbiased estimator of its degrees of freedom and when the number of covariates p is greater than the numbers of observations n at most n covariates are selected. These results are well known and described in the literature, but with no simple demonstrations. We present, based on a geometrical approach, simple and intuitive heuristics proofs for these results.


Introduction
Suppose the usual regression situation: data x i ,y i , i = 1, . . . , n, where x i = (x i1 , . . . , x ip ) is a vector of predictors variables and y i is the corresponding response. Consider as usual that the observations are independent and 1 n n i=1 x ij = 0, 1 n n i=1 x 2 ij = 1. Tibshirani (1996) defines the Lasso estimative as the solution of the quadratic convex optimization problem: The parameter restriction defines, for each t, a convex diamond shaped region K in R p . We can think the n×p design matrix X = (x ij ) as a linear transformation from the Euclidean space R p to R n . We will suppose, to avoid generalized inverses, that X is injective. In this case, we have a geometrical set-up described as: the image of R p , by the linear transform X, is a p-dimensional subspace of R n , the image of the convex subset K is the convex subset K p = X (K).
To obtain the Lasso estimative we have to find the point in K p closest to the data vector y. To do this we project y orthogonally into the subspace image of X (y * P = P Im(X) y) and then find in K p the point y p closest to y * P . As X is injective, the pre-image of this point defines the estimativeβ Lasso (t). In the parameter space, this is equivalent to find in K the point closest, in the Mahalanobis distance, β 1 ,β 2 m = β 1 X Xβ 2 , to the ordinary least squared estimativeβ ols . This can be done by constructing several hyperboloids onβ ols , until one of these reach a tangent point on K. (see Figure 1). The Lasso estimative shrinks the coefficients towards zero as t goes to zero. Another main characteristic is that, with high probability, some coefficients are set exactly equal to zero sinceβ Lasso (t) occurs in a singular face. Therefore the estimation process is also a model selection process.
The organization of this paper is as follows. In Section 2, we point out that, in relation to the tuning parameter t, the coordinates curves ofβ Lasso (t) are piecewise linear. We will present a simple but intuitive proof of this fact. In Section 3, we review the Stein's unbiased risk estimation and, with a very simple mathematical approach, we obtain the known unbiased estimator of the degrees of freedom for the Lasso. In Section 4 we prove that if the number n of observations are less than the number p of covariates then the Lasso selects at most n covariates.
2 Lasso trace curves are piecewise linear   Tibshirani (1996) showed that for the orthogonal case the Lasso trace curves are piecewise linear. Efron et al. (2004) presented a new model selection algorithm, named Least Angle Regression (LARS). This algorithm is piecewise linear by construction. The authors observed that the same geometry of the algorithm applies to the Lasso, despite of the fact that the two methods seems to be quite different.
Since authors couldn't find a simple approach for this fundamental property, a very elementary proof, although incomplete, using only undergraduate analytical geometry, will be present.
There are two families of hipersurfaces in R p : the family of parallel ellipsoids centered in (a 1 , . . . , a p ) and the family of diamond shaped convex sets (or simplest diamond set) of the form where t is a parameter. It is clear that each ellipsoid has only two tangent points with planes that are faces in the family of diamond sets, both are border of convex subsets R p . It is intuitive that by the convexity of these two subsets. That they are tangent in two points. Of course, we are considering only regular points, that is, points in hyperfaces of the diamond set to avoid measure theory that is necessary if we consider singular faces. We will be concerned only with the tangent point closest to the origin. It is a typical problem in Mathematical analysis to show that tangents points between these two families defines a smooth curve, which is called by definitionβ (lasso) (t). We have to show that this curve is a straight line. First we will suppose that the family of ellipsoids have principal axes parallel to the coordinates axes.
Therefore the family of ellipsoids is of the form: where r 1 , . . . , r p are fixed number and r > 0 is a family parameter. Clearly for each r there is only two values t such that the ellipsoid and the hyperplane are tangents. We will consider only the tangent point closest to the origin and will suppose also that this tangent point is in a (p − 1)-dimensional face of the hypercube. In tangency point the hyperplane and the ellipsoid have a common normal vector. If this tangent point has positive coordinates a normal vector of the hyperplane is the vector (1, . . . ,1), and this vector is also normal to the ellipsoid. Let's (β 1 (s) , . . . ,β p (s)) be a curve in the ellipsoid, such that (β 1 (0) , . . . ,β p (0)) =β (Lasso) (t)) is the tangent point. By implicit differentiation of follows that: Hence, the vector (β1(0)−a1) is perpendicular to the tangent vector β 1 (0) , . . . ,β p (0) . As this vector is a generic vector on the tangent space of the ellipsoid, it is necessarily parallel to the vector (1, . . . ,1). That is, As α is dependent to the tangent pointβ (Lasso) (t) it is also a function of t and then Thus, the tangent point satisfies the equation This shows that the tangent pointsβ Lasso (t) lies in a straight line. If the ellipsoids don't have its principal axes parallel to the coordinates axes a new coordinate system can be built in such way that the ellipsoid with this new coordinates have its axes parallel. What happens with the plane and it is easy to see that and in this case we have again a straight line in this general situation. That is, we have the situation described in Figure 3. As the tuning parameter t varies, the tangent point may move from a p − 1 dimensional face of the hypercube to a lower dimensional face. In this case the normal vector changes, as an example, for a (p − 3)-dimensional face the normal vector (1,1,0,0,1, . . . ,1). And so, 1,1,0,0,1, . . . ,1) .
Again we have a strait line but with a new direction. This fully describes the behavior of Lasso trace curves.

Degrees of freedom
If a model, for example, an ordinary linear regression, fits some data y, producing an estimateμ = m (y), m : R n → R n , the question of how well m (y) will predict a future dataset, independently generated from the same random mechanism that produced y, is probably the main problem to be answered. This is the prediction error and it's the sum of expectation of the fitting error plus a penalty related to the covariance between the data y and the model m (y) (EFRON, 2004). This drives us to the concept of degrees of freedom (df) as a covariance penalty.
Definition: The degrees of freedom of a modelμ = m (y) is defined as where σ 2 is the error variance.
In the linear case,μ = M y, where M is a n × n matrix, the degrees of freedom is the trace of M . If we are in the usual regression or analysis of variance (Anova), M is a projection matrix and, therefore, trace(M ) = p, the dimension of the projection space, that is, the rank of M , agreeing with the usual definition of degrees of freedom.
The degrees of freedom is a population parameter and has to be estimated. For this, we have to use the multidimensional version of classical Stein's lemma. Under very reasonable mathematical conditions on m (y) we have: ∂(m (y)) i ∂y i then div (m (y)) is an unbiased estimator of degrees of freedom.

Degrees of freedom for the Lasso
It is well know that the nonzero number of covariates select by Lasso is an unbiased estimator of the degree of freedom for the Lasso. Thus, the modelμ = m (y) is given by m (y) = P Kp (y) = y p , where P Kp is the minimum projection distance of the data y on the convex set K p . To calculate div(P Kp (y)) we will follow Kato (2009).
Let P K : R p → K be the minimum Mahalanobis distance projection on the convex set K. Therefore, To compute the divergence of m (y) we will have to use the chain rule. If f : R n → R n , then df (x) is a linear transformation df : R n → R n . The divergence definition does not depend on coordinates and is given by divf (x) = tr (df (x)). If f is the linear transformation f (y) = M y, then df (y) = M and div (y) = tr (M ). In the case of composition, g • f : R n f → R p g → R n , a derivative is given by composition of linear transformation d (g Thus, d m(y) = X · d P K β ols (y) (X X) −1 X . It follows that, div m(y) = tr X dP K β ols (y) (X X) −1 X = tr d P K β ols (y) (X X) −1 X X = tr d P K β ols (y) .
Therefore, the divergence of m (y) is equal to the divergence the projection P K , in relation to the variable β, applied to the pointβ ols (y).
In the orthogonal case, X X = I, the Mahalanobis metric is the Euclidian metric and an explicit formula for the projection P K is possible, given by: where γ is a constant.
In this case the Lasso estimator is where γ depends on the value of t.
With this formula is possible to calculate divP K β ols (y) Then, tr dP K β ols (y) is the number of nonzero coordinates on Lasso estimative, that is, the number of selected covariates.
For the general case, the proofs of the degrees of freedom for the Lasso (KATO, 2009;TIBSHIRANI and TAYLOR, 2012) are quite of complex since the diamond shaped set K has faces with dimensions 0, 1, . . . , p − 1 and it is necessary to find the intersections of ellipsoids with these low dimensional faces. Such a situation requires measure theory. Here, we will present a mathematical semi-complete proof, however with a much more intuitive and useful application for a broad statistical audience.
The only thing that we have to intuitively accept is that each face has a domain of attraction. That is, for almost every β that projects on a face L there is an open subset around β that also projects in the same face. Let us give an example: for the orthogonal case, on R 2 , the singular face with only one point (0,t) has the domain of attraction as Figure 4. What changes for the non orthogonal case? The Mahalanobis metric preserves straight lines. The only thing that changes is the angles. Therefore, the orthogonal projection in the Mahalanobis distance is the same as a oblique projection in Euclidean metric as seen in the previous section (see Figure 5). Consider now a l-dimensional face F such thatβ Lasso (t) = P K β ols (y) ∈ F .
Any small enough open ball centered onβ ols (y) is necessarily mapped on the face F . It's possible to get, in this ball, a small l-dimensional subspace, parallel, on the Mahalanobis metric, to the face F . As the projection P K preserves distance (see Figure 6), the derivative of the projection P K on the pointβ ols (y) can be given by the following matrix: Then, div dP K β ols (y) = l. But the dimension of the face l is exactly the number of selected covariates and the results follows.
4 The case p > n Zou and Hastie (2005) proposed the elastic net estimator as an alternative to the Lasso. They pointed out that if the number of covariates p is greater than the number of observations, the Lasso selects at most n covariates. It implies that Lasso is not a very satisfactory variable selection method if p > n. They claim that this deficiency comes from the nature of the convex optimization problem that defines the estimator, but they do not present any other explanation or proof for this fact. There is a great source of confusion here. Lasso is defined based on the minimum squares estimator and if the tunning parameter t is such that, for example, t = β ols 1 thenβ Lasso =β ols and certainly the number of covariates selected is p. Such confusion is recurrent and quite frequent is statistics forums, and the answers posted are somewhat incomplete. We develop a more in-depth discussion of this issue.
The linear transformation X can't be injective if the dimension of the image of X is k < n. Then KerX has dimension n−k. The geometric construction developed in the previous sections remains fully valid, that is, after defined the value for the tunning parameter t we want to find the vector in K p as close as possible to the data vector. Using the orthogonal projection of y in Im (X) and any generalized inverse of X,β ols is obtained. Again it is possible to build the family of ellipsoids β −β ols X X β −β ols = c. The difference here is that these ellipsoids are singular in the sense that they are in some subspace of dimension k. If the value of c is changed until the ellipsoid get a tangent point with the hyperface β 1 = t we have a Lasso estimate, which we will call momentarilyβ p Lasso . This estimative solves the variational take of problem min Xβ − y restricted to β 1 ≤ t. As X β p Lasso + z = X β p Lasso , we have thatβ p Lasso + z is also a solution of the minimization problem for all z belonging to ker X.
Consistent with the Lasso estimation philosophy of shrinkage and covariate selection, it is reasonable to choose among the solutionsβ p Lasso + z one of minimum norm, that is, we have to solve a new minimization problem min β 1 , restricted to β = β p lasso + z, z ∈ ker X.
The solution to this problem is simple and follows from the proposition.
Proposition (Boyd-Vandenberghe, page 141, 2004) The convex optimization problem has solution x * with ∇f (x * ) orthogonal to the ker A.

Proof:
Assuming that all coordinates of β = β p lasso + z are positive, we have ∇ β 1 = (1,1, . . . ,1). If some coordinate is negative, simply place -1 in the corresponding position. Since ker X has dimension p − k this subspace for dimensionality reasons can't have empty intersection with all coordinate subspaces of dimension n because dim ker X = p − k and dim ker X + n = p − k + n > p. Therefore there is a coordinate plane of dimension less than or equal to n that intercepts the subspace {β p Lasso + z ; z ∈ ker X}. We can suppose, without loss of generality, that ker X is not parallel to a hyperface β 1 = t. In this case a vector normal to the subspace {β p Lasso + z ; z ∈ ker X} can't be parallel to the vector (1,1, . . . , 1). Therefore the solution of the minimization problem (2) can only occurs in the intercession of {β p Lasso + z ; z ∈ ker X} with a coordinate subspace. The vectors of this intersection are then candidates for to be a Lasso estimative and they have at most n non-zero covariates (Figure 7).

Conclusion
The theory of lasso estimators is strongly based on geometric constructions, although it is presented as a convex optimization problem. In this paper it is shown that using basic linear algebra and geometric arguments gives a greater intuitive understanding of the basic facts of the theory.