Published: March 4, 2016
Friday, March 04, 2016 3:00 PM - 4:00 PM
Main Campus - Engineering Classroom Wing - 265
;Ìý;Ìý

Column Subset Selection on Terabyte-sized Scientific Data

One of the most straightforward formulations of a feature selectionproblem boils down to the linear algebraic problem of selecting good columns from a data matrix.  This formulation has the advantage of yielding features that are interpretable to scientists in the domain from which the data are drawn, an important consideration when machine learning methods are applied to realistic scientific data.  While simple, this problem is central to many other seemingly nonlinear learning methods.  Moreover, while unsupervised, this problem also has strong connections with related supervised learning methods such as Linear Discriminant Analysis and Canonical Correlation Analysis.  We will describe recent work implementing Randomized Linear Algebra algorithms for this feature selection problem in parallel and distributed environments on inputs of size ranging from ones to tens of terabytes, as well as the application of these implementations to specific scientific problems in areas such as mass spectrometry imaging and climate modeling.