Tuesday, April 23, 2019

CSHardware: Development of sparse coding and reconstruction subsystems for astronomical imaging, João Rino-Silvestre

João just let me know the following:
Salut Igor,  
I've noticed that you have a blog dedicated to compressed sensing, so I thought that my MSc dissertation (in which I detail the development of subsystems and of a stand-alone instrument prototype) might be of interest to you and your community. I leave you the link below:

Best regards,
João Rino-Silvestre
Compressed sensing (CS) is a revolutionary signal processing technique that allows us, under a specific set of conditions, to fully reconstruct an under-sampled signal. Very early, from its inception, in 2006, to subsequent development and propagation in the following years compressed sensing enabled advancements in photography, holography and medical instrumentation among others. Its application in astronomy however, even though some calls to action have recently been made, has failed to leave the test bed. Continuing from the work developed by Bandarra and Pires in their respective Masters’ dissertations, advancements will here be described on the development of a physical, out of the table, instrument; a compressed sensing astronomy camera (COSAC). Such an instrument was projected to be constituted by five subsystems: optomechanics, signal coding, acquisition electronics, signal reconstruction and the mechanical structure (casing and inner supports for the aforementioned subsystems). The present work focused on the development/implementation of signal coding and reconstruction subsystems, while a simple prototype for the mechanical structure is also proposed to enable testing the instrument in a real world setting; this required the redesign of the optomechanical supports. Additionally, some changes were made to the acquisition electronics in order to not only improve its behavior, but to also facilitate its integration with the signal coding subsystem; as a result two working circuit are proposed, one using an ADC of 10 bits resolution, the other an ADC of 24 bits . A central component to this instrument, which bridges the optomechanics, signal coding and acquisition subsystems, is a digital micromirror device (DMD), an array of independently controlled micromirrors which can be tilted in two, opposed, directions. Such a device can, and is, thus used to manipulate light. For this project a DLP LightCrafter, a projector development kit by Texas Instruments which includes a DMD, was used to encode light signals. The signal coding subsystem is constituted by the LightCrafter and two programs: one written in C/C++ to run either on a PC (DMD-CS.cpp), which main purpose is to control the DMD and communicate with the LightCrafter’s processor, and which also communicates with an Arduino micro-controller that manages the acquisition electronics; the second (which is also part of the acquisition subsystem) in Arduino programming language, to run on the micro-controller (pIDDO.ino), which will manage the processes required to perform measurements with the electronics and communicate with the C/C++ program; the interactions between both programs are crucial to ensure synchronism between the signal coding and acquisition subsystems. The chosen encoding basis are squared Hadamard matrices that can be attained by following simple algorithms; rows of such matrices were then manipulated into tilt configurations for the micromirror grid; the set of rows used will constitute a sampling matrix. Each program outputs a file, one holding information about the sampling matrix used, the other holding the measurements. The signal reconstruction subsystem is another program that takes the files generated by DMD-CS.cpp and pIDDO.ino to reconstruct the original signal by implementing a Matlab script written by Romberg. The program then outputs a BMP image file of that reconstruction. The components of the prototype structural subsystem and optomechanical supports were designed using computer assisted design (CAD) software, with which finite element simulations were also performed to ensure those same components would be able to endure real world conditions. Some of these components were bought most of them were fabricated in the laboratory. All subsystems were individually tested, as well as in couples (when relevant). After passing those tests, these subsystems were assembled to form COSAC. The instrument was calibrated, analysed and validated, using both versions of the acquisition circuit, in a laboratory setting with controlled lighting conditions. Comparative results of COSAC’s performance for three modes of acquisition (raster, Hadamard transform optics and CS with Hadamard base) are also presented. COSAC was shown to be able to produce images of CS measurements, performed in the visible spectrum, with at least 64_64 pixels.

Follow @NuitBlog or join the CompressiveSensing Reddit, the Facebook page, the Compressive Sensing group on LinkedIn  or the Advanced Matrix Factorization group on LinkedIn

Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email.

Other links:
Paris Machine LearningMeetup.com||@Archives||LinkedIn||Facebook|| @ParisMLGroup About LightOnNewsletter ||@LightOnIO|| on LinkedIn || on CrunchBase || our Blog
About myselfLightOn || Google Scholar || LinkedIn ||@IgorCarron ||Homepage||ArXiv

Book: High-Dimensional Probability An Introduction with Applications in Data Science by Roman Vershynin

Found in the comment section of Terry's blogRoman Vershynin is writing a book titled: High-Dimensional Probability An Introduction with Applications in Data Science (University of California, Irvine March 25, 2019) 

Here is the table of content:
Preface vi Appetizer: using probability to cover a geometric set 
1 Preliminaries on random variables 6
1 1 Preliminaries on random variables 6
1.1 Basic quantities associated with random variables 6
1.2 Some classical inequalities 7
1.3 Limit theorems 9
1.4 Notes 12
2 Concentration of sums of independent random variables 13 
2.1 Why concentration inequalities? 13
2.2 Hoeffding’s inequality 16
2.3 Chernoff’s inequality 19
2.4 Application: degrees of random graphs 21
2.5 Sub-gaussian distributions 24
2.6 General Hoeffding’s and Khintchine’s inequalities 29
2.7 Sub-exponential distributions 32
2.8 Bernstein’s inequality 37
2.9 Notes 40  
3 Random vectors in high dimensions 42
3.1 Concentration of the norm 43
3.2 Covariance matrices and principal component analysis 45
3.3 Examples of high-dimensional distributions 50
3.4 Sub-gaussian distributions in higher dimensions 56
3.5 Application: Grothendieck’s inequality and semidefinite programming 60
3.6 Application: Maximum cut for graphs 66
3.7 Kernel trick, and tightening of Grothendieck’s inequality 70
3.8 Notes 74  
4 Random matrices 76 
4.1 Preliminaries on matrices 76
4.2 Nets, covering numbers and packing numbers 81
4.3 Application: error correcting codes 86
4.4 Upper bounds on random sub-gaussian matrices 89
4.5 Application: community detection in networks 93
4.6 Two-sided bounds on sub-gaussian matrices 97 iii iv Contents
4.7 Application: covariance estimation and clustering 99
4.8 Notes 103 
5 Concentration without independence 105
5.1 Concentration of Lipschitz functions on the sphere 105
5.2 Concentration on other metric measure spaces 112
5.3 Application: Johnson-Lindenstrauss Lemma 118
5.4 Matrix Bernstein’s inequality 121
5.5 Application: community detection in sparse networks 129
5.6 Application: covariance estimation for general distributions 129
5.7 Notes 133
6 Quadratic forms, symmetrization and contraction 135 
6.1 Decoupling 135
6.2 Hanson-Wright Inequality 139
6.3 Concentration of anisotropic random vectors 142
6.4 Symmetrization 145
6.5 Random matrices with non-i.i.d. entries 147
6.6 Application: matrix completion 148
6.7 Contraction Principle 151 6.8 Notes 154 
7 Random processes 156 
7.1 Basic concepts and examples 156
7.2 Slepian’s inequality 160
7.3 Sharp bounds on Gaussian matrices 167
7.4 Sudakov’s minoration inequality 170
7.5 Gaussian width 172
7.6 Stable dimension, stable rank, and Gaussian complexity 178
7.7 Random projections of sets 181
7.8 Notes 185 
8 Chaining 187 
8.1 Dudley’s inequality 187
8.2 Application: empirical processes 195
8.3 VC dimension 200
8.4 Application: statistical learning theory 212
8.5 Generic chaining 219
8.6 Talagrand’s majorizing measure and comparison theorems 223
8.7 Chevet’s inequality 225
8.8 Notes 227 
9 Deviations of random matrices and geometric consequences 229 
9.1 Matrix deviation inequality 229
9.2 Random matrices, random projections and covariance estimation 235
9.3 Johnson-Lindenstrauss Lemma for infinite sets 238
9.4 Random sections: M∗ bound and Escape Theorem 240
9.5 Notes 
10 Sparse Recovery 246 
10.1 High-dimensional signal recovery problems 246
10.2 Signal recovery based on M∗ bound 248
10.3 Recovery of sparse signals 250
10.4 Low-rank matrix recovery 254
10.5 Exact recovery and the restricted isometry property 256
10.6 Lasso algorithm for sparse regression 262
10.7 Notes 267
11 Dvoretzky-Milman’s Theorem 269
11.1 Deviations of random matrices with respect to general norms 269
11.2 Johnson-Lindenstrauss embeddings and sharper Chevet inequality 272
11.3 Dvoretzky-Milman’s Theorem 274
11.4 Notes 279
Bibliography 280

Follow @NuitBlog or join the CompressiveSensing Reddit, the Facebook page, the Compressive Sensing group on LinkedIn  or the Advanced Matrix Factorization group on LinkedIn

Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email.

Other links:
Paris Machine LearningMeetup.com||@Archives||LinkedIn||Facebook|| @ParisMLGroup About LightOnNewsletter ||@LightOnIO|| on LinkedIn || on CrunchBase || our Blog
About myselfLightOn || Google Scholar || LinkedIn ||@IgorCarron ||Homepage||ArXiv

Monday, April 22, 2019

Videos: New Deep Learning Techniques, February 5 - 9, 2018, IPAM Workshop



In recent years, artificial neural networks a.k.a. deep learning have significantly improved the fields of computer vision, speech recognition, and natural language processing. The success relies on the availability of large-scale datasets, the developments of affordable high computational power, and basic deep learning operations that are sound and fast as they assume that data lie on Euclidean grids. However, not all data live on regular lattices. 3D shapes in computer graphics represent Riemannian manifolds. In neuroscience, brain activity (fMRI) is encoded on the structural connectivity network (sMRI). In genomics, the human body functionality is expressed through DNA, RNA, and proteins that form the gene regulatory network (GRN). In social sciences, people interact through networks. Eventually, data in communication networks are structured by graphs like the Internet or road traffic networks.

Deep learning that has originally been developed for computer vision cannot be directly applied to these highly irregular domains, and new classes of deep learning techniques must be designed. This is highly challenging as most standard data analysis tools cannot be used on heterogonous data domains. The workshop will bring together experts in mathematics (statistics, harmonic analysis, optimization, graph theory, sparsity, topology), machine learning (deep learning, supervised & unsupervised learning, metric learning) and specific applicative domains (neuroscience, genetics, social science, computer vision) to establish the current state of these emerging techniques and discuss the next directions.
This workshop will include a poster session; a request for posters will be sent to registered participants in advance of the workshop.
Here are the videos (and some slides):
Samuel Bowman (New York University)
Toward natural language semantics in learned representations

Emily Fox (University of Washington)
Interpretable and Sparse Neural Network Time Series Models for Granger Causality Discovery

Ellie Pavlick (University of Pennsylvania), Should we care about linguistics?

Leonidas Guibas (Stanford University), Knowledge Transport Over Visual Data

Yann LeCun (New York University), Public Lecture: Deep Learning and the Future of Artificial Intelligence

Alán Aspuru-Guzik (Harvard University), Generative models for the inverse design of molecules and materials

Daniel Rueckert (Imperial College), Deep learning in medical imaging: Techniques for image reconstruction, super-resolution and segmentation

Kyle Cranmer (New York University), Deep Learning in the Physical Sciences

Stéphane Mallat (École Normale Supérieure), Deep Generative Networks as Inverse Problems

Michael Elad (Technion - Israel Institute of Technology), Sparse Modeling in Image Processing and Deep Learning

Yann LeCun (New York University), Public Lecture: AI Breakthroughs & Obstacles to Progress, Mathematical and Otherwise

Xavier Bresson (Nanyang Technological University, Singapore), Convolutional Neural Networks on Graphs

Federico Monti (Universita della Svizzera Italiana), Deep Geometric Matrix Completion: a Geometric Deep Learning approach to Recommender Systems

Joan Bruna (New York University), On Computational Hardness with Graph Neural Networks

Jure Leskovec (Stanford University), Large-scale Graph Representation Learning

Arthur Szlam (Facebook), Composable planning with attributes

Yann LeCun (New York University), A Few (More) Approaches to Unsupervised Learning

Sanja Fidler (University of Toronto), Teaching Machines with Humans in the Loop
Raquel Urtasun (University of Toronto), Deep Learning for Self-Driving Cars

Pratik Chaudhari (University of California, Los Angeles (UCLA)), Unraveling the mysteries of stochastic gradient descent on deep networks

Stefano Soatto (University of California, Los Angeles (UCLA)), Emergence Theory of Deep Learning

Tom Goldstein (University of Maryland), What do neural net loss functions look like?

Stanley Osher (University of California, Los Angeles (UCLA)), New Techniques in Optimization and Their Applications to Deep Learning and Related Inverse Problems

Michael Bronstein (USI Lugano, Switzerland), Deep functional maps: intrinsic structured prediction for dense shape correspondence

Sainbayar Sukhbaatar (New York University), Deep Architecture for Sets and Its Application to Multi-agent Communication

Zuowei Shen (National University of Singapore), Deep Learning: Approximation of functions by composition

Wei Zhu (Duke University), LDMnet: low dimensional manifold regularized neural networks

Join the CompressiveSensing subreddit or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Saturday, April 20, 2019

Videos: Sublinear Algorithms and Nearest-Neighbor Search workshop, Nov. 27 – Nov. 30, 2018 (Simon Institute and Kavli Foundation)

The Sublinear Algorithms and Nearest-Neighbor Search workshop is part of the Program on Foundations of Data Science sponsored by the Simons Institute for the Theory of Computing at Berkeley and the Kavli Foundation. It took place Nov. 27 – Nov. 30, 2018 in Berkeley. Thank you to the organizer to make this workshop a reality: Robert Krauthgamer , Artur Czumaj , Aarti Singh , Rachel Ward . The introduction for the workshop goes as follows:

Many applications require extreme computational efficiency, where the usage of resources, like runtime, storage, or data samples, is sublinear in the size of the input, and the workshop will cover different areas where this topic is studied, and explore connections between them. Specifically, this topic received a lot of attention within Theoretical Computer Science, often motivated by advances in various application areas, and the workshop will review recent developments, such as algorithms for data streams, sketching, dimensionality reduction, graph sparsification, and property testing. In addition, the workshop will examine connections with linear and sublinear methods in Machine Learning, Statistics, Signal Processing and related areas, such as the well-known connections with compressed sensing, sparse recovery, and nearest-neighbor search. Some more recent connections are to online bandit problems and to distributed optimization, where constraints on algorithmic resources are just starting to be considered; such problems may be amenable to techniques from data-stream algorithms.

Here are the videos:

Learning-Augmented Sketches for Frequency EstimationPiotr Indyk (Massachusetts Institute of Technology)
Classical streaming algorithms typically do not leverage data properties or patterns in their input. We propose to augment such algorithms with a learning model that enables them to exploit data properties without being specific to a particular pattern or property. We focus on the problem of estimating the frequency of elements in a data stream, a fundamental problem with applications in network measurements, natural language processing, and security. We propose a new framework for designing frequency estimation streaming algorithms that automatically learn to leverage the properties of the input data. We present a theoretical analysis of the proposed algorithms and prove that, under natural assumptions, they have lower space complexity than prior algorithms. We also evaluate our algorithms on two problems ? monitoring Internet traffic and tracking the popularity of search queries ? and demonstrate their performance gains. Joint work with Chen-Yu Hsu, Dina Katabi and Ali Vakilian.

In sparse recovery/compressed sensing, one can estimate a k-sparse vector in n dimensions with only Theta(k log n) nonadaptive linear measurements. With adaptivity -- if each measurement can be based on the previous ones -- this reduces to O(k log log n). But what happens if the measurement matrices can only be chosen in a few rounds, as seen (for example) in constant-pass streaming algorithms? This talk will give upper and lower bounds, showing (up to a log^* k factor) that R rounds of adaptivity require Theta(k log^{1/R} n) measurements.

Universal Sketches,Vladimir Braverman (Johns Hopkins University)
Streaming and sketching algorithms have found many applications in computer science and other areas. A typical sketching algorithm approximates one function. Given a class of functions F, it is natural to ask if it is possible to compute a single sketch S that will approximate every function f from F. We call S a “universal sketch” for F. In this talk we will discuss results on universal sketches for several classes of functions. For example, we will describe a sketch that approximates a sub-class of symmetric norms (a norm is symmetric if it is invariant under sign-flips and coordinate-permutations) and outline a connection between universal sketches and concentration of measure and Milman’s theorem. Also, we will describe a recent result for subset (i.e. 0-1 weighted) l0 and l1 norms. For these problems we obtain a nearly optimal upper and lower bounds for streaming space complexity. We will discuss the applicability of universal sketches for Software Defined Networks (SDN). For SDN, we will present the UnivMon (short for Universal Monitoring) framework that can simultaneously achieve both generality and high fidelity across a broad spectrum of monitoring tasks.
This talk is based on joint works with Jaroslaw Blasiok, Stephen R. Chestnut, Robert Krauthgamer and Lin F. Yang (STOC 2017), with Robert Krauthgamer and Lin F. Yang (submitted) and with Zaoxing Liu, Antonis Manousis, Gregory Vorsanger, and Vyas Sekar (HotNets 2015, SIGCOMM 2016).

Approximating the Cost of a Metric K-Nearest Neighbor Graph in Sublinear TimeChristian Sohler (Technische Universität Dortmund and Google Switzerland)
Let (X,d) be an n-point metric space. We assume that (X,d) is given in the distance oracle model, i.e., X={1,...,n} and for every pair of points x,y from X we can query their distance d(x,y) in constant time. A k-nearest neighbor (k-NN) graph} for (X,d) is a directed graph G=(V,E) that has an edge to each of v's k nearest neighbors. We use cost(G) to denote the sum of edge weights of G.
In this paper, we study the problem of approximating cost(G) in sublinear time, when we are given oracle access to the metric space (X,d) that defines G. Our goal is to develop an algorithm that solves this problem faster than the time required to compute G. To this end, we develop an algorithm that in time ~O(min (n k^{3/2} / eps^6, n^2 / (eps^2 k))) computes an estimate K for the cost of the minimum spanning tree that satisfies with probability at least 2/3
|cost(G) - K | less or equal to eps (cost(G) + mst(X))
where mst(X) denotes the cost of the minimum spanning tree of (X,d).
Joint work with Artur Czumaj. Work was done as part of the speaker's affiliation with Google Switzerland.

Independent samples from an unknown probability distribution p on a domain of size k are distributed across n players, with each player holding one sample. Each player can send a message to a central referee in a simultaneous message passing (SMP) model of communication, whose goal is to solve a pre-specified inference problem. The catch, however, is that each player cannot simply send their own sample to the referee; instead, the message they send must obey some (local) information constraint. For instance, each player may be limited to communicating only L bits, where L << log k; or they may seek to reveal as little information as possible, and preserve local differentially privacy. We propose a general formulation for inference problems in this distributed setting, and instantiate it to two fundamental inference questions, learning and uniformity testing. We study the role of randomness for those questions, and obtain striking separations between public- and private-coin protocols for the latter, while showing the two settings are equally powerful for the former. (Put differently, sharing with your neighbors does help a lot for the test, but not really for the learning.) Based on joint works with Jayadev Acharya (Cornell University), Cody Freitag (Cornell University), and Himanshu Tyagi (IISc Bangalore).

We initiate the study of the role of erasures in local decoding and use our understanding to prove a separation between erasure-resilient and tolerant property testing. Local decoding in the presence of errors has been extensively studied, but has not been considered explicitly in the presence of erasures. Motivated by applications in property testing, we begin our investigation with local {\em list} decoding in the presence of erasures. We prove an analog of a famous result of Goldreich and Levin on local list decodability of the Hadamard code. Specifically, we show that the Hadamard code is locally list decodable in the presence of a constant fraction of erasures, arbitrary close to 1, with list sizes and query complexity better than in the Goldreich-Levin theorem. We use this result to exhibit a property which is testable with a number of queries independent of the length of the input in the presence of erasures, but requires a number of queries that depends on the input length, $n$, for tolerant testing. We further study {\em approximate} locally list decodable codes that work against erasures and use them to strengthen our separation by constructing a property which is testable with a constant number of queries in the presence of erasures, but requires $n^{\Omega(1)}$ queries for tolerant testing. Next, we study the general relationship between local decoding in the presence of errors and in the presence of erasures. We observe that every locally (uniquely or list) decodable code that works in the presence of errors also works in the presence of twice as many erasures (with the same parameters up to constant factors). We show that there is also an implication in the other direction for locally decodable codes (with unique decoding): specifically, that the existence of a locally decodable code that works in the presence of erasures implies the existence of a locally decodable code that works in the presence of errors and has related parameters. However, it remains open whether there is an implication in the other direction for locally {\em list} decodable codes. (Our Hadamard result shows that there has to be some difference in parameters for some settings.) We relate this question to other open questions in local decoding. Based on joint work with Noga Ron-Zewi and Nithin Varma.

Sublinear Time Local-Access Random GeneratorsRonitt Rubinfeld (Massachusetts Institute of Technology)

No abstract available.

Locality sensitive hashing (LSH) is a popular technique for nearest neighbor search in high dimensional data sets. Recently, a new view at LSH as a biased sampling technique has been fruitful for density estimation problems in high dimensions. Given a set of points and a query point, the goal (roughly) is to estimate the density of the data set around the query. One way to formalize this is by kernel density estimation: Given a function that decays with distance and represents the "influence" of a data point at the query, sum up this influence function over the data set. Yet another way to formalize this problem is by counting the number of data points within a certain radius of the query. While these problems can easily be solved by making a linear pass over the data, this can be prohibitive for large data sets and multiple queries. Can we preprocess the data so as to answer queries efficiently? This talk will survey several recent papers that use locality sensitive hashing to design unbiased estimators for such density estimation problems and their extensions. This talk will survey joint works with Arturs Backurs, Piotr Indyk, Vishnu Natchu, Paris Syminelakis and Xian (Carrie) Wu.

We consider algorithms that take an unlabeled data set and label it in its entirety, given the ability to interact with a human expert. The goal is to minimize the amount of interaction while producing a labeling that satisfies an (epsilon, delta) guarantee: with probability at least 1-delta over the randomness in the algorithm, at most an epsilon fraction of the labels are incorrect. Scenario 1: The algorithm asks the expert for labels of specific points. This is the standard problem of active learning, except that the final product is a labeled data set rather than a classifier. Scenario 2: The expert also provides "weak rules" or helpful features. We will summarize the state of the art on these problems, in terms of promising algorithms and statistical guarantees, and identify key challenges and open problems.

An Optimal Space Lower Bound for Approximating MAX-CUTMichael Kapralov (Ecole Polytechnique Federale de Lausanne)
We consider the problem of estimating the value of MAX-CUT in a graph in the streaming model of computation. At one extreme, there is a trivial $2$-approximation for this problem that uses only $O(\log n)$ space, namely, count the number of edges and output half of this value as the estimate for the size of the MAX-CUT. On the other extreme, for any fixed $\eps > 0$, if one allows $\tilde{O}(n)$ space, a $(1+\eps)$-approximate solution to the MAX-CUT value can be obtained by storing an $\tilde{O}(n)$-size sparsifier that essentially preserves MAX-CUT value.
Our main result is that any (randomized) single pass streaming algorithm that breaks the $2$-approximation barrier requires $\Omega(n)$-space, thus resolving the space complexity of any non-trivial approximations of the MAX-CUT value to within polylogarithmic factors in the single pass streaming model. We achieve the result by presenting a tight analysis of the Implicit Hidden Partition Problem introduced by Kapralov et al.[SODA'17] for an arbitrarily large number $k$ of players. In this problem a number of players receive random matchings of $\Omega(n)$ size together with random bits on the edges, and their task is to determine whether the bits correspond to parities of some hidden bipartition, or are just uniformly random.
Unlike all previous Fourier analytic communication lower bounds, our analysis does not directly use bounds on the $\ell_2$ norm of Fourier coefficients of a typical message at any given weight level that follow from hypercontractivity. Instead, we use the fact that graphs received by players are sparse (matchings) to obtain strong upper bounds on the $\ell_1$ norm of the Fourier coefficients of the messages of individual players using their special structure, and then argue, using the convolution theorem, that similar strong bounds on the $\ell_1$ norm are essentially preserved (up to an exponential loss in the number of players) once messages of different players are combined. We feel that our main technique is likely of independent interest.

Any graph with maximum degree Delta admits a proper vertex coloring with Delta+1 colors that can be found via a simple sequential greedy algorithm in linear time and space. But can one find such a coloring via a sublinear algorithm? In this talk, I present new algorithms that answer this question in the affirmative for several canonical classes of sublinear algorithms including graph streaming, sublinear time, and massively parallel computation (MPC) algorithms. At the core of these algorithms is a remarkably simple meta-algorithm for the (Delta+1) coloring problem: Sample O(log n) colors for each vertex uniformly at random from the Delta+1 colors and then find a proper coloring of the graph using the sampled colors; our main structural result states that the sampled set of colors with high probability contains a proper coloring of the input graph.

Efficient algorithms for k-means clustering frequently converge to suboptimal partitions, and given a partition, it is difficult to detect k-means optimality. We discuss an a posteriori certifier of approximate optimality for k-means clustering based on Peng and Wei's semidefinite relaxation of k-means.

Efficient algorithms for k-means clustering frequently converge to suboptimal partitions, and given a partition, it is difficult to detect k-means optimality. In this paper, we develop an a posteriori certifier of approximate optimality for k-means clustering. The certifier is a sub-linear Monte Carlo algorithm based on Peng and Wei's semidefinite relaxation of k-means. In particular, solving the relaxation for small random samples of the dataset produces a high-confidence lower bound on the k-means objective, and being sub-linear, our algorithm is faster than k-means++ when the number of data points is large. If the data points are drawn independently from any mixture of two Gaussians over R^m with identity covariance, then with probability 1?O(1/m), our poly(m)-time algorithm produces a 3-approximation certificate with 99% confidence (no separation required). We also introduce a linear-time Monte Carlo algorithm that produces an O(k) additive approximation lower bound.

We provide a novel ? and to the best of our knowledge, the first ? algorithm for high dimensional sparse regression with corruptions in explanatory and/or response variables. Our algorithm recovers the true sparse parameters in the presence of a constant fraction of arbitrary corruptions. Our main contribution is a robust variant of Iterative Hard Thresholding. Using this, we provide accurate estimators with sub-linear sample complexity. Our algorithm consists of a novel randomized outlier removal technique for robust sparse mean estimation that may be of interest in its own right: it is orderwise more efficient computationally than existing algorithms, and succeeds with high probability, thus making it suitable for general use in iterative algorithms.

We study the PCA and column subset selection problems in matrices in an online setting, where the columns arrive one after the other. In the context of column subset selection, the goal is to decide whether to include or discard a column, as it arrives. We design a simple algorithm that includes at most O(k \polylog n) columns overall and achieves a multiplicative (1+\epsilon) error compared to the best rank-k approximation of the full matrix. This result may be viewed as an analog of the classic result of Myerson on online clustering.

The need for fast computation typically requires tradeoffs with statistical accuracy; here we are interested in whether computation can be significantly improved without trading-off accuracy.
In particular, for best possible accuracy in NN prediction, the number of neighbors generally needs to grow as a root of n (sample size), consequently limiting NN-search (any technique) to order of root of n complexity; in other words, expensive prediction seems unavoidable, even while using fast search methods, if accuracy is to be optimal. Unfortunately, the usual alternative is to tradeoff accuracy.
Interestingly, we show that it is possible to maintain accuracy, while reducing computation (at prediction time) to just O(log n), through simple bias and or variance correction tricks applied after data quantization or subsampling, together with (black box) fast search techniques. Furthermore, our analysis yields clear insights into how much quantization or subsampling is tolerable if optimal accuracy is to be achieved.
Our theoretical insights are validated through extensive experiments with large datasets from various domains.
The talk is based on a series of works with N. Verma, and with L. Xue.

We establish a generic reduction from nonlinear spectral gaps of metric spaces to space partitions, in the form of data-dependent Locality-Sensitive Hashing. This yields a new approach to the high-dimensional Approximate Near Neighbor Search problem (ANN). Using this reduction, we obtain a new ANN data structure under an arbitrary d-dimensional norm, where the query algorithm makes only a sublinear number of probes into the data structure. Most importantly, the new data structure achieves a O(log d) approximation for an arbitrary norm. The only other such generic approach, via John's ellipsoid, would achieve square-root-d approximation only. Joint work with Assaf Naor, Aleksandar Nikolov, Ilya Razenshteyn, and Erik Waingarten.

I will show the first approximate nearest neighbor search data structure for a general d-dimensional normed space with sub-polynomial in d approximation.
The main tool is a finite-dimensional quantitative version of a theorem of Daher, which yields a Holder homeomorphism between small perturbations of a normed space of interest and a Euclidean space. To make Daher's theorem algorithmic, we employ convex programming to compute the norm of a vector in a space, which is the result of complex interpolation between two given normed spaces.
Based on a joint work (FOCS 2018) with Alex Andoni, Assaf Naor, Sasho Nikolov and Erik Waingarten.

Theoretical work on high-dimensional nearest neighbor search has focused on the setting where a single point is sought within a known search radius, and an acceptable approximation ratio c is given. Locality Sensitive Hashing is a powerful framework for addressing this problem. In practice one usually seeks the (exact) k nearest points, the search radius is unknown, and the parameter c must be chosen in a way that depends on the data distribution. Though reductions of the latter problem to the former exist, they incur polylogarithmic overhead in time and/or space, which in turn make them unattractive in many practical settings. We address this discrepancy between theory and practice by suggesting new, simple, more efficient reductions for solving the k-Nearest Neighbor search problem using Locality Sensitive Hashing. Joint work with Tobias Christiani and Mikkel Thorup.
The main story of this talk is how theoretical ideas on randomized sampling algorithms became part of production code at Twitter. The specific context is finding pairs of similar items: a classic algorithmic problem that is an integral part of recommendation systems. In most incarnations, it boils down to finding high inner products among a large collection of vectors, or alternately high entries in a matrix product. Despite a rich literature on this topic (and despite Twitter's significant compute resources), none of the existing methods scaled to "industrial sized" inputs, which exceed hundreds of billions of non-zeros. I will talk about a distributed algorithm for this problem, that combines low-dimension projections (hashes) with path-sampling techniques (wedges). There is some cute math behind the algorithm, and we were able to run it in production on Twitter's recommendation system. Joint work with Aneesh Sharma (Twitter) and Ashish Goel (Stanford).

In this talk I will discuss how to recover spectral approximations to broad classes of structured matrices using only a polylogarithmic number of adaptive linear measurements to either the matrix or its inverse. Leveraging this result I will discuss how to achieve faster algorithms for solving a variety of linear algebraic problems including solving linear systems in the inverse of symmetric M-matrices (a generalization of Laplacian systems), solving linear systems that are constant spectral approximations of Laplacians (or more generally, SDD matrices), and recovering a spectral sparsifier of a graph using only a polylogarithmc number of matrix vector multiplies. More broadly this talk will show how to leverage a number of recent approaches to spectral sparsification towards expanding the robustness and scope of recent nearly linear time linear system solving research, and providing general matrix recovery machinery that may serve as a stepping stone for faster algorithms. This talk reflects joint work with Arun Jambulapati and Kiran Shiragur.

Spatial Scan Statistics measure and detect anomalous spatial behavior, specifically they identify geometric regions where significantly more of a measured characteristic is found than would be expected from the background distribution. These techniques have been used widely in geographic information science, such as to pinpoint disease outbreaks. However, until recently, available algorithms and software only scaled to at most a thousand or so spatial records. In this work I will describe how using coresets, efficient constructions, and scanning algorithms, we have developed new algorithms and software that easily scales to millions or more data points. Along the way we provide new efficient algorithms and constructions for eps-samples and eps-nets for various geometric range spaces. This is a case where subtle theoretical improvements of old structures from discrete geometry actually result in substantial empirical improvements.

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !

Friday, April 19, 2019

Shedding Light on the “Grand Débat”

This blog post was initially featured on Medium. At LightOn, we decided to start a blog and explain what can be done with the random projections performed by our Optical Processing Unit (OPU). Since most people think an optics based system can do only images, we decided to illustrate our technology on an NLP task. Here is the article:

This article is LightOn’s AI Research (LAIR) first contribution focused on applying LightOn’s Optical Processing Unit (OPU) hardware to generic large-scale Machine Learning tasks. Today we tackle the recently released public dataset of a nation-wide public debate and our first trial in using LightOn’s OPU for a Natural Language Processing task. 

As a response to the “Gilets Jaunes” social unrest in France, President Emmanuel Macron declared the opening of a nation-wide citizen debate (“Grand Débat National”), as a way to collect the opinion of the French people on a number of subjects. Aside from public meetings taking place throughout the country, the “Grand Débat” took the form of an online platform where people could answer questions around four major themes: the environment, taxes, state organization, and democracy / citizenship. For every theme, the survey was divided in two parts: the first part featured multiple-choice surveys, while the second part included open-ended questions. A subset of the answers was made openly available by Etalab, the public organization responsible for the Open Data policy of the French government. The data can be found here.
The deadline for survey submissions to the “Grand Débat” was March 18 2019. The French government provided a first analysis on Monday, four days ago. At LightOn, we have been looking for applications of our Optical Processing Unit (OPU) to various Machine Learning problems including Natural Language Processing (NLP). We therefore took this opportunity to feature some exploratory tests on this new dataset. It should be emphasized that no one at LightOn is currently an expert in NLP: the goal of this article is only to demonstrate the kind of data processing that can be done using an OPU. We want to show how this can be done, in only a few days of work, and with no specific knowledge of NLP but a generic Machine Learning background.
The first NLP application of the OPU technology was to form sentence embeddings from word embeddings. Word embeddings are vector representation of words introduced in their modern form by Mikolov et al. (2013)¹. They have been extremely successful as a first step of virtually all NLP models: we replace vanilla one-hot-encoded vectors of dimension the size of the vocabulary with these typically 300-dimensional dense real-valued vectors. The space of word embeddings even has a sensible arithmetic, the classical example being KING — MAN + WOMAN = QUEEN. To obtain these embeddings, a linear neural network is trained to predict the surrounding words given the current word in a sliding window over a huge text corpus (this model is called skip-gram, predicting the current word from the context is also possible). At the end of the training, the projection columns of the weight matrix are the word embeddings. You can see it as a kind of shallow transfer learning, where only the first layer is pre-trained.
Sentence embeddings target similar representations, but for whole sentences. It is a much harder task than word embeddings because, while words are fixed entities and in finite and reasonable number (at least for a single language), sentences are compositional and sequential in nature. They have several degrees of freedom stemming from meaning, word choice, syntax, tone, length, etc. This diversity not only means that there is more information to encode, but also that it’s virtually impossible to save an embedding for every possible sentence. Sentence embeddings instead have to be formed on-the-fly from their constituent words.
At the time of this writing, some of the leading models for sentence embeddings are SkipThought² and InferSent³. In particular, the paper that sparked our interest in sentence embeddings was the recent work of Wieting and Kiela (2019)⁴. This work shows that it is possible to obtain results on SentEval⁵ comparable to these advanced models with no training at all. One of the models used in this paper, Bag Of Random Embedding Projections(BOREP), uses random projections, which happens to be the operation an OPU excels at. As a result, we decided to give it a try. We followed BOREP but replaced the linear random projection with an OPU transform — complex-valued random projection followed by an element-wise non-linearity. The model is very simple: we get the embedding of each word in a sentence, we project it to a — much — higher dimension and then we apply a pooling function (the mean in practice) to obtain the embedding of the sentence.

The input to the current OPU prototypes -with remote access available to selected users - is a binary vector, meaning in this case that we need to use binary word embeddings. Since there is no such native embedding, we need to binarize real-valued word embeddings. For the latter, we chose fastText⁶, as it is among the best-performing word embedding method to date. To binarize the embeddings, we could use any standard binary encoding scheme. However it is hard to know a priori how destructive it would be for this kind of data. A possibly smarter thing to do is to train an autoencoder to do so. Luckily, some people did just that: binarize word embeddings using an autoencoder⁷. The associated github repo is in C, does not make use of a GPU and cannot handle large embeddings due to memory limitations, so we re-implemented it in PyTorch. It is hard to tell anything from the loss, so we evaluated our binary word embeddings directly on SentEval and obtained satisfying results for most tasks, except Semantic Textual Similarity (STS 12–16), for which there seems to be an unavoidable information loss. On a number of tasks, they performed even better than the original ones, probably because of their higher dimension. Indeed, the paper focuses on memory and computation savings, so the authors use rather low dimensional binary embeddings (up to 512 bits for an original dimension of 300 real numbers). Our goal is instead to obtain binary embeddings with the smallest possible information loss, so we went to much higher dimensions. Keep in mind that producing random projections on the OPU is essentially independent of the size — currently up to dimensions of 1 million ! In the end we used 3000 bits for each vector, so 10 bits per real number instead of 32, which seems reasonable.
Once this is done, we can actually use the OPU to form sentence embeddings using the aforementioned BOREP method. Evaluating it on SentEval gives satisfying results, in many tasks superior to that of the paper (even the Echo State Network). It should be noted that we didn’t find it useful to go above 15,000 dimensions, meaning that a high-end GPU could also have done the job in this case.
There are several reasons why increasing the dimensionality could increase performance on downstream tasks. The first is Cover’s theorem: projecting data non-linearly to a higher dimension increases the chance of them being linearly separable in the new high-dimensional space. This, of course, plays a role but the fact that a linear projection also improves performance in the Wieting and Kiela paper proves that simply adding parameters to the model already helps a lot. More interestingly, we can draw a parallel with work on hyperdimensional computing⁸. In very high-dimensional spaces (several thousand dimensions), the sheer volume available makes it possible to represent an entire dataset with no two points close to each other. The distances between two random points follow a very narrow Gaussian distribution: all points are more or less equidistant. In this setting, the mean of a few vectors is more similar to each of these vectors than any other point. The mean vector is then not merely a lossy summary of the original vectors but actually represents the set that contains them. We can thus expect the mean of the high-dimensional projections of the word embeddings to be truly more representative of the sentence than a mean computed in a lower-dimensional space, irrespective of the fact that the model used in downstream applications will have more parameters to accommodate this higher dimensionality.
For our task of dealing with answers to open-ended questions, which is notoriously difficult, we decided that an interactive visualisation would be a nice way to explore the answers. The goal was to see clusters of similar answers as a summary of what people had to say on the matter. So we chose a specific question and formed an embedding for every answer as if it were a single sentence. From visual observations, if an answer contains several sentences, they are most of the time related, so grouping them seems a reasonable first-attempt assumption. We end up with almost a hundred thousand points in dimension 15,000 (which is also why going above 15,000 would have been problematic) and want to visualise them. A PCA to 2 or 3 dimensions doesn’t give a satisfying result, so we first keep the first 50 principal components and then use t-SNE⁹ do obtain a 2D representation of the data. The result exhibits some interesting patterns but upon inspection, we realize that the global structure doesn’t make much sense. Indeed, t-SNE only preserves local neighborhoods and loses any sense of global structure. Clusters of loosely similar answers are thus not close to each other and can even be on opposite sides of the graph. Increasing the perplexity of t-SNE is supposed to favor global structure over local one, but this had little effect, so we used another technique. t-SNE is an iterative, stochastic method requiring an initial projection. This initialization is usually random but it doesn’t have to. The first two principal components are the best 2D representation of the global structure of the data, so we use them (in practice taking the first two columns of the 50-component PCA we already computed) as the initial state of t-SNE. This allows us to preserve global structure while benefiting from the superior visualizing power of t-SNE. Follow this link to see an interactive version (in French) of this visualization. It only contains 10,000 answers, sampled uniformly without replacement, so that the interactivity remains smooth. Some annotated screenshots are shown just below. What you see are the answers to the first question of the “ecology” (environment) questionnaire. The question was:
What is to you the most important concrete problem regarding ecology ?
  1. Air pollution
  2. Biodiversity and species extinction
  3. Climate change
  4. Shore erosion
  5. Other: answer in plain text

It was a semi-open question with 4 suggested answers but also the possibility to write another answer in plain text if one was not satisfied with the suggestions. Very few people answered 4, so it does not stand out on the graph, but you can clearly see answers 1, 2 and 3 circled in red on the first graph. In green are answers that mentioned two of the suggested answers and in yellow answers that gave variations of one of the suggested answers, e.g. global warming instead of climate change or water pollution instead of air pollution. A significant portion of people refused to choose and said that all suggestions were equally important.

Our visualization technique mostly works with respect to our objective for short sentences, less for longer ones. These are instead gathered in the middle, where more groups can be found as you can see here:

Long answers are hard to classify as belonging to one group or another, and are likely to be far from any other answer, as mentioned before. We believe the big cloud in the middle to be an artifact of the projection, that gathers such isolated answers, more than of the embeddings.
Overall, the result is rather satisfying, given the time spent. There is, of course, much room for improvement, though, with at least three directions to explore:
  • text preprocessing: we did tokenization, stop words removal and lemmatization, which are standard practices, but more thought could be put into it. For instance, a spelling corrector should help;
  • multi-sentence answers: can we separate semantically independent parts of long answers to give them tags ? can we better combine the sentence embeddings of a multi-sentence answer than a simple average ?
  • better sentence embeddings.

On the last point, one could, of course, use a more complex model than ours, e.g. SkipThought and InferSent. More recent models such as QuickThought¹⁰ or multi-task learning approaches have also been proposed¹¹ ¹². However, all these models are much harder to use than ours. Trained encoders are available in English but not in French, so one would have to retrain them. For supervised approaches, the data doesn’t even exist in French. SkipThought is unsupervised but takes several weeks to train. The best chance would be QuickThought but still it requires a book corpus in French and a day of training on a powerful GPU (a Titan X in the paper).
It is instead very easy to obtain satisfying results quickly with our model and OPU hardware. No training, any language works as long as word embeddings are available (fastText supports 157 languages), only the projection time (less than a minute in our case) and enough RAM to store the results are needed. The disadvantage is that the pooling function in our model loses word ordering information. The next step for us is therefore to try a time-aware model such as an Echo State Network, which was used with success in Wieting and Kiela (2019) and which can also be implemented very efficiently at large scale on an OPU¹³.
More info on LightOn’s OPU technology can be found at www.lighton.io. A remote access to an OPU through the LightOn Cloud can be granted, by invitation.

[1] Tomas Mikolov et al. “Efficient estimation of word representations in vector space.” arXiv preprint arXiv:1301.3781 (2013).
[2] Ryan Kiros et al. “Skip-thought vectors.” Advances in Neural Information Processing Systems. 2015.
[3] Alexis Conneau et al. “Supervised learning of universal sentence representations from natural language inference data.” arXiv preprint arXiv:1705.02364 (2017).
[4] John Wieting and Douwe Kiela. “No training required: Exploring random encoders for sentence classification.” arXiv preprint arXiv:1901.10444 (2019).
[5] Alexis Conneau and Douwe Kiela. “Senteval: An evaluation toolkit for universal sentence representations.” arXiv preprint arXiv:1803.05449 (2018).
[6] Piotr Bojanowski et al. “Enriching word vectors with subword information. CoRR abs/1607.04606.” (2016).
[7] Julien Tissier, Guillaume Gravier and Amaury Habrard. “Near-lossless Binarization of Word Embeddings.” arXiv preprint arXiv:1803.09065 (2018).
[8] Pentti Kanerva “Hyperdimensional computing: An introduction to computing in distributed representation with high-dimensional random vectors.” Cognitive computation 1.2 (2009): 139–159.
[9] Laurens van der Maaten and Geoffrey Hinton. “Visualizing data using t-SNE.” Journal of Machine Learning Research 9.Nov (2008): 2579–2605.
[10] Lajanugen Logeswaran and Honglak Lee. “An efficient framework for learning sentence representations.” arXiv preprint arXiv:1803.02893 (2018).
[11] Sandeep Subramanian et al. “Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning.” CoRRabs/1804.00079 (2018): n. pag.
[12] Daniel Cer et al. “Universal sentence encoder.” arXiv preprint arXiv:1803.11175 (2018).
[13] Jonathan Dong et al. “Scaling up Echo-State Networks with multiple light scattering.” 2018 IEEE Statistical Signal Processing Workshop (2018), arXiv:1609.05204

The author François Boniface, Machine Learning R&D engineer at LightOn AI Research

Join the CompressiveSensing subreddit or the Facebook page and post there !