MT SGL encourages a individual feature selection based on

2024-03-12

MT-SGL encourages (a) individual feature selection based on the utility of the features across all tasks with ℓ2,1-norm and (b) task specific group selection based on the utility of the group with G2,1-norm, i.e., (±)-CPSI 1306 regions of interest (ROI) for that task. Unlike basic SGL for regression (Chatterjee et al., 2012, Liu and Ye., 2010, Friedman et al., 2010), MT-SGL has a parameter coupling across tasks because of ∥θ ∥ 2 which encourages simultaneous sparsity across tasks for individual feature selection. Further, in the proposed MT-SGL, the group sparsity as determined by is task specific, so that different tasks can use different groups if needed. The proposed MT-SGL framework is related to the recently proposed G-SMuRFS (Yan et al., 2015, Wang et al., 2012) with three key differences: (i) unlike G-SMuRFS, MT-SGL regularization decouples the group sparse regularization across tasks allowing for more flexibility; (ii) MT-SGL allows the loss function to be based on generalized linear models (GLMs), rather than just square loss which corresponds to a Gaussian model, and (iii) the optimization in MT-SGL is done using FISTA (Beck and Teboulle, 2009) which leads to a fast and correct algorithm for the optimization. The motivation behind considering GLMs is that the responses in the context of AD are often non-Gaussian variables, e.g., the number of words an individual can remember after half hour, which can be potentially better modeled by a Poisson distribution or other distributions over discrete counts. We will study the effectiveness of using a GLM based MT-SGL in Section 4. Further, the formulation makes MT-SGL applicable to more general problems and data types. Further, while G-SMuRFS (Yan et al., 2015, Wang et al., 2012) considers a related model, the optimization was done based on an approximate gradient (not sub-gradient) descent method to handle sparse coefficient blocks. In contrast, we directly use an accelerated method based on FISTA (Beck and Teboulle, 2009) which is provably correct and faster. The difference between the formulations of MT-SGL and G-SMuRFS is illustrated in Fig. 4.
Efficient optimization for MT-SGL The optimization problem for MT-SGL as in (6) is a convex optimization problem with a composite objective with a smooth term corresponding to the square loss and a non-smooth term corresponding to the regularizer. The composite minimization problem where the objective consists of a smooth loss function and a sum of nonsmooth functions, have received increasing attention due to the arise of structured sparsity (Bach et al., 2012), such as the graph-guided fused lasso (Kim and Xing, 2009), fused sparse group lasso (Zhou et al., 2013) and some others. These structured regularizers although greatly enhance our modeling capability, introduce significant new computational challenges as well (Yu, 2013a). In this section, we present a FISTA-style (Beck and Teboulle, 2009) algorithm for efficiently solving the MT-SGL problem. Consider a general convex optimization problem with a composite objective given bywhere , is a smooth convex function of the type C1,1, i.e., continuously differentiable with Lipschitz continuous gradient so that ∥f(x) − f(w) ∥ ≤ κ ∥ x − w∥ where κ denotes the Lipschitz constant, and is a continuous convex function which is possibly non-smooth. A well studied idea in efficient optimization of such composite objective functions is to start with a quadratic approximation of the form:Ignoring constant terms in x, the unique minimizer of the above expression can be written aswhich can be viewed as a proximal operator corresponding to the non-smooth function g(x). A popular approach to solving problems such as (7) is to simply do the following iterative update:which can be shown to have a O(1/t) rate of convergence (Nesterov, 2005, Parikh and Boyd, 2013). For our purposes, we consider a refined version of the iterative algorithm inspired by Nesterov's accelerated gradient descent (Nesterov, 2005, Parikh and Boyd, 2013). The main idea, as studied in the literature as FISTA-style algorithms (Beck and Teboulle, 2009), is to iteratively consider the proximal operator at a specific linear combination of the previous two iterates x, x , in particular atinstead of at just the previous iterate x. The choice of α follows Nesterov's accelerated gradient descent (Nesterov, 2005, Parikh and Boyd, 2013) and is detailed in Algorithm 1. The iterative algorithm simply updatesAs shown in (Beck and Teboulle, 2009), the algorithm has a rate of convergence of O(1/t2).