## Saturday, April 20, 2019

### Videos: Sublinear Algorithms and Nearest-Neighbor Search workshop, Nov. 27 – Nov. 30, 2018 (Simon Institute and Kavli Foundation)

The Sublinear Algorithms and Nearest-Neighbor Search workshop is part of the Program on Foundations of Data Science sponsored by the Simons Institute for the Theory of Computing at Berkeley and the Kavli Foundation. It took place Nov. 27 – Nov. 30, 2018 in Berkeley. Thank you to the organizer to make this workshop a reality: Robert Krauthgamer , Artur Czumaj , Aarti Singh , Rachel Ward . The introduction for the workshop goes as follows:

Many applications require extreme computational efficiency, where the usage of resources, like runtime, storage, or data samples, is sublinear in the size of the input, and the workshop will cover different areas where this topic is studied, and explore connections between them. Specifically, this topic received a lot of attention within Theoretical Computer Science, often motivated by advances in various application areas, and the workshop will review recent developments, such as algorithms for data streams, sketching, dimensionality reduction, graph sparsification, and property testing. In addition, the workshop will examine connections with linear and sublinear methods in Machine Learning, Statistics, Signal Processing and related areas, such as the well-known connections with compressed sensing, sparse recovery, and nearest-neighbor search. Some more recent connections are to online bandit problems and to distributed optimization, where constraints on algorithmic resources are just starting to be considered; such problems may be amenable to techniques from data-stream algorithms.

Here are the videos:

Learning-Augmented Sketches for Frequency EstimationPiotr Indyk (Massachusetts Institute of Technology)
Classical streaming algorithms typically do not leverage data properties or patterns in their input. We propose to augment such algorithms with a learning model that enables them to exploit data properties without being specific to a particular pattern or property. We focus on the problem of estimating the frequency of elements in a data stream, a fundamental problem with applications in network measurements, natural language processing, and security. We propose a new framework for designing frequency estimation streaming algorithms that automatically learn to leverage the properties of the input data. We present a theoretical analysis of the proposed algorithms and prove that, under natural assumptions, they have lower space complexity than prior algorithms. We also evaluate our algorithms on two problems ? monitoring Internet traffic and tracking the popularity of search queries ? and demonstrate their performance gains. Joint work with Chen-Yu Hsu, Dina Katabi and Ali Vakilian.

In sparse recovery/compressed sensing, one can estimate a k-sparse vector in n dimensions with only Theta(k log n) nonadaptive linear measurements. With adaptivity -- if each measurement can be based on the previous ones -- this reduces to O(k log log n). But what happens if the measurement matrices can only be chosen in a few rounds, as seen (for example) in constant-pass streaming algorithms? This talk will give upper and lower bounds, showing (up to a log^* k factor) that R rounds of adaptivity require Theta(k log^{1/R} n) measurements.

Universal Sketches,Vladimir Braverman (Johns Hopkins University)
Streaming and sketching algorithms have found many applications in computer science and other areas. A typical sketching algorithm approximates one function. Given a class of functions F, it is natural to ask if it is possible to compute a single sketch S that will approximate every function f from F. We call S a “universal sketch” for F. In this talk we will discuss results on universal sketches for several classes of functions. For example, we will describe a sketch that approximates a sub-class of symmetric norms (a norm is symmetric if it is invariant under sign-flips and coordinate-permutations) and outline a connection between universal sketches and concentration of measure and Milman’s theorem. Also, we will describe a recent result for subset (i.e. 0-1 weighted) l0 and l1 norms. For these problems we obtain a nearly optimal upper and lower bounds for streaming space complexity. We will discuss the applicability of universal sketches for Software Defined Networks (SDN). For SDN, we will present the UnivMon (short for Universal Monitoring) framework that can simultaneously achieve both generality and high ﬁdelity across a broad spectrum of monitoring tasks.
This talk is based on joint works with Jaroslaw Blasiok, Stephen R. Chestnut, Robert Krauthgamer and Lin F. Yang (STOC 2017), with Robert Krauthgamer and Lin F. Yang (submitted) and with Zaoxing Liu, Antonis Manousis, Gregory Vorsanger, and Vyas Sekar (HotNets 2015, SIGCOMM 2016).

Approximating the Cost of a Metric K-Nearest Neighbor Graph in Sublinear TimeChristian Sohler (Technische Universität Dortmund and Google Switzerland)
Let (X,d) be an n-point metric space. We assume that (X,d) is given in the distance oracle model, i.e., X={1,...,n} and for every pair of points x,y from X we can query their distance d(x,y) in constant time. A k-nearest neighbor (k-NN) graph} for (X,d) is a directed graph G=(V,E) that has an edge to each of v's k nearest neighbors. We use cost(G) to denote the sum of edge weights of G.
In this paper, we study the problem of approximating cost(G) in sublinear time, when we are given oracle access to the metric space (X,d) that defines G. Our goal is to develop an algorithm that solves this problem faster than the time required to compute G. To this end, we develop an algorithm that in time ~O(min (n k^{3/2} / eps^6, n^2 / (eps^2 k))) computes an estimate K for the cost of the minimum spanning tree that satisfies with probability at least 2/3
|cost(G) - K | less or equal to eps (cost(G) + mst(X))
where mst(X) denotes the cost of the minimum spanning tree of (X,d).
Joint work with Artur Czumaj. Work was done as part of the speaker's affiliation with Google Switzerland.

Independent samples from an unknown probability distribution p on a domain of size k are distributed across n players, with each player holding one sample. Each player can send a message to a central referee in a simultaneous message passing (SMP) model of communication, whose goal is to solve a pre-specified inference problem. The catch, however, is that each player cannot simply send their own sample to the referee; instead, the message they send must obey some (local) information constraint. For instance, each player may be limited to communicating only L bits, where L << log k; or they may seek to reveal as little information as possible, and preserve local differentially privacy. We propose a general formulation for inference problems in this distributed setting, and instantiate it to two fundamental inference questions, learning and uniformity testing. We study the role of randomness for those questions, and obtain striking separations between public- and private-coin protocols for the latter, while showing the two settings are equally powerful for the former. (Put differently, sharing with your neighbors does help a lot for the test, but not really for the learning.) Based on joint works with Jayadev Acharya (Cornell University), Cody Freitag (Cornell University), and Himanshu Tyagi (IISc Bangalore).

We initiate the study of the role of erasures in local decoding and use our understanding to prove a separation between erasure-resilient and tolerant property testing. Local decoding in the presence of errors has been extensively studied, but has not been considered explicitly in the presence of erasures. Motivated by applications in property testing, we begin our investigation with local {\em list} decoding in the presence of erasures. We prove an analog of a famous result of Goldreich and Levin on local list decodability of the Hadamard code. Specifically, we show that the Hadamard code is locally list decodable in the presence of a constant fraction of erasures, arbitrary close to 1, with list sizes and query complexity better than in the Goldreich-Levin theorem. We use this result to exhibit a property which is testable with a number of queries independent of the length of the input in the presence of erasures, but requires a number of queries that depends on the input length, $n$, for tolerant testing. We further study {\em approximate} locally list decodable codes that work against erasures and use them to strengthen our separation by constructing a property which is testable with a constant number of queries in the presence of erasures, but requires $n^{\Omega(1)}$ queries for tolerant testing. Next, we study the general relationship between local decoding in the presence of errors and in the presence of erasures. We observe that every locally (uniquely or list) decodable code that works in the presence of errors also works in the presence of twice as many erasures (with the same parameters up to constant factors). We show that there is also an implication in the other direction for locally decodable codes (with unique decoding): specifically, that the existence of a locally decodable code that works in the presence of erasures implies the existence of a locally decodable code that works in the presence of errors and has related parameters. However, it remains open whether there is an implication in the other direction for locally {\em list} decodable codes. (Our Hadamard result shows that there has to be some difference in parameters for some settings.) We relate this question to other open questions in local decoding. Based on joint work with Noga Ron-Zewi and Nithin Varma.

Sublinear Time Local-Access Random GeneratorsRonitt Rubinfeld (Massachusetts Institute of Technology)

No abstract available.

Locality sensitive hashing (LSH) is a popular technique for nearest neighbor search in high dimensional data sets. Recently, a new view at LSH as a biased sampling technique has been fruitful for density estimation problems in high dimensions. Given a set of points and a query point, the goal (roughly) is to estimate the density of the data set around the query. One way to formalize this is by kernel density estimation: Given a function that decays with distance and represents the "influence" of a data point at the query, sum up this influence function over the data set. Yet another way to formalize this problem is by counting the number of data points within a certain radius of the query. While these problems can easily be solved by making a linear pass over the data, this can be prohibitive for large data sets and multiple queries. Can we preprocess the data so as to answer queries efficiently? This talk will survey several recent papers that use locality sensitive hashing to design unbiased estimators for such density estimation problems and their extensions. This talk will survey joint works with Arturs Backurs, Piotr Indyk, Vishnu Natchu, Paris Syminelakis and Xian (Carrie) Wu.

We consider algorithms that take an unlabeled data set and label it in its entirety, given the ability to interact with a human expert. The goal is to minimize the amount of interaction while producing a labeling that satisfies an (epsilon, delta) guarantee: with probability at least 1-delta over the randomness in the algorithm, at most an epsilon fraction of the labels are incorrect. Scenario 1: The algorithm asks the expert for labels of specific points. This is the standard problem of active learning, except that the final product is a labeled data set rather than a classifier. Scenario 2: The expert also provides "weak rules" or helpful features. We will summarize the state of the art on these problems, in terms of promising algorithms and statistical guarantees, and identify key challenges and open problems.

An Optimal Space Lower Bound for Approximating MAX-CUTMichael Kapralov (Ecole Polytechnique Federale de Lausanne)
We consider the problem of estimating the value of MAX-CUT in a graph in the streaming model of computation. At one extreme, there is a trivial $2$-approximation for this problem that uses only $O(\log n)$ space, namely, count the number of edges and output half of this value as the estimate for the size of the MAX-CUT. On the other extreme, for any fixed $\eps > 0$, if one allows $\tilde{O}(n)$ space, a $(1+\eps)$-approximate solution to the MAX-CUT value can be obtained by storing an $\tilde{O}(n)$-size sparsifier that essentially preserves MAX-CUT value.
Our main result is that any (randomized) single pass streaming algorithm that breaks the $2$-approximation barrier requires $\Omega(n)$-space, thus resolving the space complexity of any non-trivial approximations of the MAX-CUT value to within polylogarithmic factors in the single pass streaming model. We achieve the result by presenting a tight analysis of the Implicit Hidden Partition Problem introduced by Kapralov et al.[SODA'17] for an arbitrarily large number $k$ of players. In this problem a number of players receive random matchings of $\Omega(n)$ size together with random bits on the edges, and their task is to determine whether the bits correspond to parities of some hidden bipartition, or are just uniformly random.
Unlike all previous Fourier analytic communication lower bounds, our analysis does not directly use bounds on the $\ell_2$ norm of Fourier coefficients of a typical message at any given weight level that follow from hypercontractivity. Instead, we use the fact that graphs received by players are sparse (matchings) to obtain strong upper bounds on the $\ell_1$ norm of the Fourier coefficients of the messages of individual players using their special structure, and then argue, using the convolution theorem, that similar strong bounds on the $\ell_1$ norm are essentially preserved (up to an exponential loss in the number of players) once messages of different players are combined. We feel that our main technique is likely of independent interest.

Any graph with maximum degree Delta admits a proper vertex coloring with Delta+1 colors that can be found via a simple sequential greedy algorithm in linear time and space. But can one find such a coloring via a sublinear algorithm? In this talk, I present new algorithms that answer this question in the affirmative for several canonical classes of sublinear algorithms including graph streaming, sublinear time, and massively parallel computation (MPC) algorithms. At the core of these algorithms is a remarkably simple meta-algorithm for the (Delta+1) coloring problem: Sample O(log n) colors for each vertex uniformly at random from the Delta+1 colors and then find a proper coloring of the graph using the sampled colors; our main structural result states that the sampled set of colors with high probability contains a proper coloring of the input graph.

Efficient algorithms for k-means clustering frequently converge to suboptimal partitions, and given a partition, it is difficult to detect k-means optimality. We discuss an a posteriori certifier of approximate optimality for k-means clustering based on Peng and Wei's semidefinite relaxation of k-means.

Efficient algorithms for k-means clustering frequently converge to suboptimal partitions, and given a partition, it is difficult to detect k-means optimality. In this paper, we develop an a posteriori certifier of approximate optimality for k-means clustering. The certifier is a sub-linear Monte Carlo algorithm based on Peng and Wei's semidefinite relaxation of k-means. In particular, solving the relaxation for small random samples of the dataset produces a high-confidence lower bound on the k-means objective, and being sub-linear, our algorithm is faster than k-means++ when the number of data points is large. If the data points are drawn independently from any mixture of two Gaussians over R^m with identity covariance, then with probability 1?O(1/m), our poly(m)-time algorithm produces a 3-approximation certificate with 99% confidence (no separation required). We also introduce a linear-time Monte Carlo algorithm that produces an O(k) additive approximation lower bound.

We provide a novel ? and to the best of our knowledge, the first ? algorithm for high dimensional sparse regression with corruptions in explanatory and/or response variables. Our algorithm recovers the true sparse parameters in the presence of a constant fraction of arbitrary corruptions. Our main contribution is a robust variant of Iterative Hard Thresholding. Using this, we provide accurate estimators with sub-linear sample complexity. Our algorithm consists of a novel randomized outlier removal technique for robust sparse mean estimation that may be of interest in its own right: it is orderwise more efficient computationally than existing algorithms, and succeeds with high probability, thus making it suitable for general use in iterative algorithms.

We study the PCA and column subset selection problems in matrices in an online setting, where the columns arrive one after the other. In the context of column subset selection, the goal is to decide whether to include or discard a column, as it arrives. We design a simple algorithm that includes at most O(k \polylog n) columns overall and achieves a multiplicative (1+\epsilon) error compared to the best rank-k approximation of the full matrix. This result may be viewed as an analog of the classic result of Myerson on online clustering.

The need for fast computation typically requires tradeoffs with statistical accuracy; here we are interested in whether computation can be significantly improved without trading-off accuracy.
In particular, for best possible accuracy in NN prediction, the number of neighbors generally needs to grow as a root of n (sample size), consequently limiting NN-search (any technique) to order of root of n complexity; in other words, expensive prediction seems unavoidable, even while using fast search methods, if accuracy is to be optimal. Unfortunately, the usual alternative is to tradeoff accuracy.
Interestingly, we show that it is possible to maintain accuracy, while reducing computation (at prediction time) to just O(log n), through simple bias and or variance correction tricks applied after data quantization or subsampling, together with (black box) fast search techniques. Furthermore, our analysis yields clear insights into how much quantization or subsampling is tolerable if optimal accuracy is to be achieved.
Our theoretical insights are validated through extensive experiments with large datasets from various domains.
The talk is based on a series of works with N. Verma, and with L. Xue.

We establish a generic reduction from nonlinear spectral gaps of metric spaces to space partitions, in the form of data-dependent Locality-Sensitive Hashing. This yields a new approach to the high-dimensional Approximate Near Neighbor Search problem (ANN). Using this reduction, we obtain a new ANN data structure under an arbitrary d-dimensional norm, where the query algorithm makes only a sublinear number of probes into the data structure. Most importantly, the new data structure achieves a O(log d) approximation for an arbitrary norm. The only other such generic approach, via John's ellipsoid, would achieve square-root-d approximation only. Joint work with Assaf Naor, Aleksandar Nikolov, Ilya Razenshteyn, and Erik Waingarten.

I will show the first approximate nearest neighbor search data structure for a general d-dimensional normed space with sub-polynomial in d approximation.
The main tool is a finite-dimensional quantitative version of a theorem of Daher, which yields a Holder homeomorphism between small perturbations of a normed space of interest and a Euclidean space. To make Daher's theorem algorithmic, we employ convex programming to compute the norm of a vector in a space, which is the result of complex interpolation between two given normed spaces.
Based on a joint work (FOCS 2018) with Alex Andoni, Assaf Naor, Sasho Nikolov and Erik Waingarten.

Theoretical work on high-dimensional nearest neighbor search has focused on the setting where a single point is sought within a known search radius, and an acceptable approximation ratio c is given. Locality Sensitive Hashing is a powerful framework for addressing this problem. In practice one usually seeks the (exact) k nearest points, the search radius is unknown, and the parameter c must be chosen in a way that depends on the data distribution. Though reductions of the latter problem to the former exist, they incur polylogarithmic overhead in time and/or space, which in turn make them unattractive in many practical settings. We address this discrepancy between theory and practice by suggesting new, simple, more efficient reductions for solving the k-Nearest Neighbor search problem using Locality Sensitive Hashing. Joint work with Tobias Christiani and Mikkel Thorup.
The main story of this talk is how theoretical ideas on randomized sampling algorithms became part of production code at Twitter. The specific context is finding pairs of similar items: a classic algorithmic problem that is an integral part of recommendation systems. In most incarnations, it boils down to finding high inner products among a large collection of vectors, or alternately high entries in a matrix product. Despite a rich literature on this topic (and despite Twitter's significant compute resources), none of the existing methods scaled to "industrial sized" inputs, which exceed hundreds of billions of non-zeros. I will talk about a distributed algorithm for this problem, that combines low-dimension projections (hashes) with path-sampling techniques (wedges). There is some cute math behind the algorithm, and we were able to run it in production on Twitter's recommendation system. Joint work with Aneesh Sharma (Twitter) and Ashish Goel (Stanford).

In this talk I will discuss how to recover spectral approximations to broad classes of structured matrices using only a polylogarithmic number of adaptive linear measurements to either the matrix or its inverse. Leveraging this result I will discuss how to achieve faster algorithms for solving a variety of linear algebraic problems including solving linear systems in the inverse of symmetric M-matrices (a generalization of Laplacian systems), solving linear systems that are constant spectral approximations of Laplacians (or more generally, SDD matrices), and recovering a spectral sparsifier of a graph using only a polylogarithmc number of matrix vector multiplies. More broadly this talk will show how to leverage a number of recent approaches to spectral sparsification towards expanding the robustness and scope of recent nearly linear time linear system solving research, and providing general matrix recovery machinery that may serve as a stepping stone for faster algorithms. This talk reflects joint work with Arun Jambulapati and Kiran Shiragur.

Spatial Scan Statistics measure and detect anomalous spatial behavior, specifically they identify geometric regions where significantly more of a measured characteristic is found than would be expected from the background distribution. These techniques have been used widely in geographic information science, such as to pinpoint disease outbreaks. However, until recently, available algorithms and software only scaled to at most a thousand or so spatial records. In this work I will describe how using coresets, efficient constructions, and scanning algorithms, we have developed new algorithms and software that easily scales to millions or more data points. Along the way we provide new efficient algorithms and constructions for eps-samples and eps-nets for various geometric range spaces. This is a case where subtle theoretical improvements of old structures from discrete geometry actually result in substantial empirical improvements.

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !

## Friday, April 19, 2019

### Shedding Light on the “Grand Débat”

This blog post was initially featured on Medium. At LightOn, we decided to start a blog and explain what can be done with the random projections performed by our Optical Processing Unit (OPU). Since most people think an optics based system can do only images, we decided to illustrate our technology on an NLP task. Here is the article:

This article is LightOn’s AI Research (LAIR) first contribution focused on applying LightOn’s Optical Processing Unit (OPU) hardware to generic large-scale Machine Learning tasks. Today we tackle the recently released public dataset of a nation-wide public debate and our first trial in using LightOn’s OPU for a Natural Language Processing task.

As a response to the “Gilets Jaunes” social unrest in France, President Emmanuel Macron declared the opening of a nation-wide citizen debate (“Grand Débat National”), as a way to collect the opinion of the French people on a number of subjects. Aside from public meetings taking place throughout the country, the “Grand Débat” took the form of an online platform where people could answer questions around four major themes: the environment, taxes, state organization, and democracy / citizenship. For every theme, the survey was divided in two parts: the first part featured multiple-choice surveys, while the second part included open-ended questions. A subset of the answers was made openly available by Etalab, the public organization responsible for the Open Data policy of the French government. The data can be found here.
The deadline for survey submissions to the “Grand Débat” was March 18 2019. The French government provided a first analysis on Monday, four days ago. At LightOn, we have been looking for applications of our Optical Processing Unit (OPU) to various Machine Learning problems including Natural Language Processing (NLP). We therefore took this opportunity to feature some exploratory tests on this new dataset. It should be emphasized that no one at LightOn is currently an expert in NLP: the goal of this article is only to demonstrate the kind of data processing that can be done using an OPU. We want to show how this can be done, in only a few days of work, and with no specific knowledge of NLP but a generic Machine Learning background.
The first NLP application of the OPU technology was to form sentence embeddings from word embeddings. Word embeddings are vector representation of words introduced in their modern form by Mikolov et al. (2013)¹. They have been extremely successful as a first step of virtually all NLP models: we replace vanilla one-hot-encoded vectors of dimension the size of the vocabulary with these typically 300-dimensional dense real-valued vectors. The space of word embeddings even has a sensible arithmetic, the classical example being KING — MAN + WOMAN = QUEEN. To obtain these embeddings, a linear neural network is trained to predict the surrounding words given the current word in a sliding window over a huge text corpus (this model is called skip-gram, predicting the current word from the context is also possible). At the end of the training, the projection columns of the weight matrix are the word embeddings. You can see it as a kind of shallow transfer learning, where only the first layer is pre-trained.
Sentence embeddings target similar representations, but for whole sentences. It is a much harder task than word embeddings because, while words are fixed entities and in finite and reasonable number (at least for a single language), sentences are compositional and sequential in nature. They have several degrees of freedom stemming from meaning, word choice, syntax, tone, length, etc. This diversity not only means that there is more information to encode, but also that it’s virtually impossible to save an embedding for every possible sentence. Sentence embeddings instead have to be formed on-the-fly from their constituent words.
At the time of this writing, some of the leading models for sentence embeddings are SkipThought² and InferSent³. In particular, the paper that sparked our interest in sentence embeddings was the recent work of Wieting and Kiela (2019)⁴. This work shows that it is possible to obtain results on SentEval⁵ comparable to these advanced models with no training at all. One of the models used in this paper, Bag Of Random Embedding Projections(BOREP), uses random projections, which happens to be the operation an OPU excels at. As a result, we decided to give it a try. We followed BOREP but replaced the linear random projection with an OPU transform — complex-valued random projection followed by an element-wise non-linearity. The model is very simple: we get the embedding of each word in a sentence, we project it to a — much — higher dimension and then we apply a pooling function (the mean in practice) to obtain the embedding of the sentence.

The input to the current OPU prototypes -with remote access available to selected users - is a binary vector, meaning in this case that we need to use binary word embeddings. Since there is no such native embedding, we need to binarize real-valued word embeddings. For the latter, we chose fastText⁶, as it is among the best-performing word embedding method to date. To binarize the embeddings, we could use any standard binary encoding scheme. However it is hard to know a priori how destructive it would be for this kind of data. A possibly smarter thing to do is to train an autoencoder to do so. Luckily, some people did just that: binarize word embeddings using an autoencoder⁷. The associated github repo is in C, does not make use of a GPU and cannot handle large embeddings due to memory limitations, so we re-implemented it in PyTorch. It is hard to tell anything from the loss, so we evaluated our binary word embeddings directly on SentEval and obtained satisfying results for most tasks, except Semantic Textual Similarity (STS 12–16), for which there seems to be an unavoidable information loss. On a number of tasks, they performed even better than the original ones, probably because of their higher dimension. Indeed, the paper focuses on memory and computation savings, so the authors use rather low dimensional binary embeddings (up to 512 bits for an original dimension of 300 real numbers). Our goal is instead to obtain binary embeddings with the smallest possible information loss, so we went to much higher dimensions. Keep in mind that producing random projections on the OPU is essentially independent of the size — currently up to dimensions of 1 million ! In the end we used 3000 bits for each vector, so 10 bits per real number instead of 32, which seems reasonable.
Once this is done, we can actually use the OPU to form sentence embeddings using the aforementioned BOREP method. Evaluating it on SentEval gives satisfying results, in many tasks superior to that of the paper (even the Echo State Network). It should be noted that we didn’t find it useful to go above 15,000 dimensions, meaning that a high-end GPU could also have done the job in this case.
There are several reasons why increasing the dimensionality could increase performance on downstream tasks. The first is Cover’s theorem: projecting data non-linearly to a higher dimension increases the chance of them being linearly separable in the new high-dimensional space. This, of course, plays a role but the fact that a linear projection also improves performance in the Wieting and Kiela paper proves that simply adding parameters to the model already helps a lot. More interestingly, we can draw a parallel with work on hyperdimensional computing⁸. In very high-dimensional spaces (several thousand dimensions), the sheer volume available makes it possible to represent an entire dataset with no two points close to each other. The distances between two random points follow a very narrow Gaussian distribution: all points are more or less equidistant. In this setting, the mean of a few vectors is more similar to each of these vectors than any other point. The mean vector is then not merely a lossy summary of the original vectors but actually represents the set that contains them. We can thus expect the mean of the high-dimensional projections of the word embeddings to be truly more representative of the sentence than a mean computed in a lower-dimensional space, irrespective of the fact that the model used in downstream applications will have more parameters to accommodate this higher dimensionality.
For our task of dealing with answers to open-ended questions, which is notoriously difficult, we decided that an interactive visualisation would be a nice way to explore the answers. The goal was to see clusters of similar answers as a summary of what people had to say on the matter. So we chose a specific question and formed an embedding for every answer as if it were a single sentence. From visual observations, if an answer contains several sentences, they are most of the time related, so grouping them seems a reasonable first-attempt assumption. We end up with almost a hundred thousand points in dimension 15,000 (which is also why going above 15,000 would have been problematic) and want to visualise them. A PCA to 2 or 3 dimensions doesn’t give a satisfying result, so we first keep the first 50 principal components and then use t-SNE⁹ do obtain a 2D representation of the data. The result exhibits some interesting patterns but upon inspection, we realize that the global structure doesn’t make much sense. Indeed, t-SNE only preserves local neighborhoods and loses any sense of global structure. Clusters of loosely similar answers are thus not close to each other and can even be on opposite sides of the graph. Increasing the perplexity of t-SNE is supposed to favor global structure over local one, but this had little effect, so we used another technique. t-SNE is an iterative, stochastic method requiring an initial projection. This initialization is usually random but it doesn’t have to. The first two principal components are the best 2D representation of the global structure of the data, so we use them (in practice taking the first two columns of the 50-component PCA we already computed) as the initial state of t-SNE. This allows us to preserve global structure while benefiting from the superior visualizing power of t-SNE. Follow this link to see an interactive version (in French) of this visualization. It only contains 10,000 answers, sampled uniformly without replacement, so that the interactivity remains smooth. Some annotated screenshots are shown just below. What you see are the answers to the first question of the “ecology” (environment) questionnaire. The question was:
What is to you the most important concrete problem regarding ecology ?
1. Air pollution
2. Biodiversity and species extinction
3. Climate change
4. Shore erosion
5. Other: answer in plain text

It was a semi-open question with 4 suggested answers but also the possibility to write another answer in plain text if one was not satisfied with the suggestions. Very few people answered 4, so it does not stand out on the graph, but you can clearly see answers 1, 2 and 3 circled in red on the first graph. In green are answers that mentioned two of the suggested answers and in yellow answers that gave variations of one of the suggested answers, e.g. global warming instead of climate change or water pollution instead of air pollution. A significant portion of people refused to choose and said that all suggestions were equally important.

Our visualization technique mostly works with respect to our objective for short sentences, less for longer ones. These are instead gathered in the middle, where more groups can be found as you can see here:

Long answers are hard to classify as belonging to one group or another, and are likely to be far from any other answer, as mentioned before. We believe the big cloud in the middle to be an artifact of the projection, that gathers such isolated answers, more than of the embeddings.
Overall, the result is rather satisfying, given the time spent. There is, of course, much room for improvement, though, with at least three directions to explore:
• text preprocessing: we did tokenization, stop words removal and lemmatization, which are standard practices, but more thought could be put into it. For instance, a spelling corrector should help;
• multi-sentence answers: can we separate semantically independent parts of long answers to give them tags ? can we better combine the sentence embeddings of a multi-sentence answer than a simple average ?
• better sentence embeddings.

On the last point, one could, of course, use a more complex model than ours, e.g. SkipThought and InferSent. More recent models such as QuickThought¹⁰ or multi-task learning approaches have also been proposed¹¹ ¹². However, all these models are much harder to use than ours. Trained encoders are available in English but not in French, so one would have to retrain them. For supervised approaches, the data doesn’t even exist in French. SkipThought is unsupervised but takes several weeks to train. The best chance would be QuickThought but still it requires a book corpus in French and a day of training on a powerful GPU (a Titan X in the paper).
It is instead very easy to obtain satisfying results quickly with our model and OPU hardware. No training, any language works as long as word embeddings are available (fastText supports 157 languages), only the projection time (less than a minute in our case) and enough RAM to store the results are needed. The disadvantage is that the pooling function in our model loses word ordering information. The next step for us is therefore to try a time-aware model such as an Echo State Network, which was used with success in Wieting and Kiela (2019) and which can also be implemented very efficiently at large scale on an OPU¹³.
More info on LightOn’s OPU technology can be found at www.lighton.io. A remote access to an OPU through the LightOn Cloud can be granted, by invitation.

References
[1] Tomas Mikolov et al. “Efficient estimation of word representations in vector space.” arXiv preprint arXiv:1301.3781 (2013).
[2] Ryan Kiros et al. “Skip-thought vectors.” Advances in Neural Information Processing Systems. 2015.
[3] Alexis Conneau et al. “Supervised learning of universal sentence representations from natural language inference data.” arXiv preprint arXiv:1705.02364 (2017).
[4] John Wieting and Douwe Kiela. “No training required: Exploring random encoders for sentence classification.” arXiv preprint arXiv:1901.10444 (2019).
[5] Alexis Conneau and Douwe Kiela. “Senteval: An evaluation toolkit for universal sentence representations.” arXiv preprint arXiv:1803.05449 (2018).
[6] Piotr Bojanowski et al. “Enriching word vectors with subword information. CoRR abs/1607.04606.” (2016).
[7] Julien Tissier, Guillaume Gravier and Amaury Habrard. “Near-lossless Binarization of Word Embeddings.” arXiv preprint arXiv:1803.09065 (2018).
[8] Pentti Kanerva “Hyperdimensional computing: An introduction to computing in distributed representation with high-dimensional random vectors.” Cognitive computation 1.2 (2009): 139–159.
[9] Laurens van der Maaten and Geoffrey Hinton. “Visualizing data using t-SNE.” Journal of Machine Learning Research 9.Nov (2008): 2579–2605.
[10] Lajanugen Logeswaran and Honglak Lee. “An efficient framework for learning sentence representations.” arXiv preprint arXiv:1803.02893 (2018).
[11] Sandeep Subramanian et al. “Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning.” CoRRabs/1804.00079 (2018): n. pag.
[12] Daniel Cer et al. “Universal sentence encoder.” arXiv preprint arXiv:1803.11175 (2018).
[13] Jonathan Dong et al. “Scaling up Echo-State Networks with multiple light scattering.” 2018 IEEE Statistical Signal Processing Workshop (2018), arXiv:1609.05204

The author François Boniface, Machine Learning R&D engineer at LightOn AI Research

Join the CompressiveSensing subreddit or the Facebook page and post there !

## Thursday, April 18, 2019

### Video: Robust and High-Dimensional Statistics workshop - Oct. 29 – Nov. 2, 2018, (Simon Institute and Kavli Foundation)

The Robust and High-Dimensional Statistics workshop is part of the Program on Foundations of Data Science sponsored by the Simons Institute for the Theory of Computing at Berkeley and the Kavli Foundation. It took place Oct. 29 – Nov. 2, 2018 in Berkeley. Thank you to the organizer to make this workshop a reality: Andrea Montanari, Emmanuel Candès, Ilias Diakonikolas, Santosh Vempala. Here are the videos:

This workshop will focus on recent developments in high-dimensional statistics, with an emphasis on different notions of robustness, the extent to which recent developments in theoretical computer science can lead to improvements with respect to traditional statistical metrics, challenges arising when the number of data points and the number of features diverge at similar rates, etc. Other potential topics are inference and causality, as well as inference after selection, i.e., data snooping and problems of multiple inference.

We discuss new techniques for approximating the mean of a Gaussian in the presence of a large fraction of adversarial errors. We show that by taking advantage of higher moments of these distributions, we can obtain errors close to the information-theoretic optimum, and present an application of this to learning mixtures of spherical Gaussians.

Mixture Models, Robustness, and Sum of Squares ProofsJerry Li (Massachusetts Institute of Technology)
We use the Sum of Squares (SoS) method to develop a new efficient algorithm for clustering and mean estimation in well-separated high-dimensional mixture models, substantially improving upon the statistical guarantees achieved by previous efficient algorithms. In particular, we study mixtures of k distributions, where every pair of distributions has means separated by separated by at least k^epsilon for any epsilon > 0. In the special case of spherical Gaussian mixtures, we give a k^O(1/epsilon^2)-time algorithm that learns the means of the components in the mixture and accurately clusters samples from the mixture. This is the first algorithm to improve on greedy (?single-linkage?) and spectral clustering, breaking a long-standing barrier for efficient algorithms at separation k^1/4. Our techniques are based on adapting algorithmic ideas from robust statistics, and are potentially of independent interest. Our main algorithm for learning mixture models provides an entirely SoS interpretation of the convex programming framework of [Diakonikolas et al, FOCS 16]. We show that many of the proofs from that paper can be replaced with much simpler proofs using only basic concentration and Holder's inequality, which allows us to approach this problem via SoS. As a corollary of this, we also obtain improved rates for robust mean estimation in certain regimes. Joint work with Sam Hopkins (Berkeley).

What is the effect of robustness on the computational complexity of high-dimensional estimation? In this talk, I will describe a technique that establishes computational-statistical tradeoffs for a range of robust estimation problems using the lens of the Statistical Query (SQ) complexity. The prototypical applications of our technique will be for the problems of robust mean and covariance estimation. The talk will be based on joint works with Daniel Kane (UCSD) and Alistair Stewart (USC).

We give the first polynomial-time algorithm for performing linear or polynomial regression resilient to adversarial corruptions in both examples and labels. Given a sufficiently large (polynomial-size) training set drawn i.i.d. from distribution D and subsequently corrupted on some fraction of points, our algorithm outputs a linear function whose squared error is close to the squared error of the best-fitting linear function with respect to D, assuming that the marginal distribution of D over the input space is certifiably hypercontractive. This natural property is satisfied by many well-studied distributions such as Gaussian, strongly log-concave distributions and, uniform distribution on the hypercube among others. We also give a simple statistical lower bound showing that some distributional assumption is necessary to succeed in this setting. These results are the first of their kind and were not known to be even information-theoretically possible prior to our work.

Classical hypothesis testing seeks to decide whether given data is signal or noise. Likelihood ratio (LR) tests are known to minimize the probability of false positive (FP) for any given probability of false negative (FN).
We consider data which is either all noise - Eg. drawn from a standard Gaussian N(0,1) - or mostly noise with a weak signal - Eg. drawn from a Gaussian mixture: (epsilon) N(\mu,1) + (1-epsilon)N(0,1). We seek tests for which both FP and FN go to zero as the number of iid samples goes to infinity. We show essentially that ideal tests exist if the chi-squared distance between signal and noise is higher than a certain threshold.
Interestingly, it turns out that the best tests do not use LR, but a related, yet different quantity.
The proofs are simple and the result is work in progress. The talk will describe things from first principles.
Joint Work with Richard Karp.

Graph distances have proven quite useful in machine learning/statistics, particularly in the estimation of Euclidean or geodesic distances. The talk will include a partial review of the literature, and then present more recent developments on the estimation of curvature-constrained distances on a surface, as well as on the estimation of Euclidean distances based on an unweighted and noisy neighborhood graph.

We develop model-based methods for solving stochastic convex optimization problems, introducing the approximate-proximal point, or aProx, family, which includes stochastic subgradient, proximal point, and bundle methods. When the modeling approaches we propose are appropriately accurate, the methods enjoy stronger convergence and robustness guarantees than classical approaches, even though the model-based methods typically add little to no computational overhead over stochastic subgradient methods. For example, we show that improved models converge with probability 1 and enjoy optimal asymptotic normality results under weak assumptions; these methods are also adaptive to a natural class of what we term easy optimization problems, achieving linear convergence under appropriate strong growth conditions on the objective. Our substantial experimental investigation shows the advantages of more accurate modeling over standard subgradient methods across many smooth and non-smooth optimization problems.

Heterogeneity across different sub-populations or "homogeneous blocks" can be beneficially exploited for causal inference and novel robustness, with wide-ranging prospects for various applications. The key idea relies on a notion of probabilistic invariance or stability: it opens up new insights for formulating causality as a certain risk minimization problem with a corresponding notion of robustness. The novel methodology has connections to instrumental variable regression and robust optimization.

Uniform stability of a learning algorithm is a classical notion of algorithmic stability introduced to derive high-probability bounds on the generalization error (Bousquet and Elisseeff, 2002). Specifically, for a loss function with range bounded in $[0,1]$, the generalization error of $\gamma$-uniformly stable learning algorithm on $n$ samples is known to be at most $O((\gamma +1/n) \sqrt{n \log(1/\delta)})$ with probability at least $1-\delta$. Unfortunately, this bound does not lead to meaningful generalization bounds in many common settings where $\gamma \geq 1/\sqrt{n}$. Here we prove substantially stronger generalization bounds for uniformly stable algorithms without any additional assumptions. First, we show that the generalization error in this setting is at most $O(\sqrt{(\gamma + 1/n) \log(1/\delta)})$ with probability at least $1-\delta$. In addition, we prove a tight bound of $O(\gamma^2 + 1/n)$ on the second moment of the generalization error. The best previous bound on the second moment of the generalization error is $O(\gamma + 1/n)$. Our proofs are based on new analysis techniques and results imply substantially stronger generalization guarantees for several well-studied algorithms. Joint work with Jan Vondrak (Stanford).

Double machine learning provides $\sqrt{n}$-consistent estimates of parameters of interest even when high-dimensional or nonparametric nuisance parameters are estimated at an $n^{-1/4}$ rate. The key is to employ Neyman-orthogonal moment equations which are first-order insensitive to perturbations in the nuisance parameters. We show that the $n^{-1/4}$ requirement can be improved to $n^{-1/(2k+2)}$ by employing a k-th order notion of orthogonality that grants robustness to more complex or higher-dimensional nuisance parameters. In the partially linear regression setting popular in causal inference, we show that we can construct second-order orthogonal moments if and only if the treatment residual is not normally distributed. Our proof relies on Stein's lemma and may be of independent interest. We conclude by demonstrating the robustness benefits of an explicit doubly-orthogonal estimation procedure for treatment effect.

The talk will first review the problem of robust subspace recovery, which seeks an underlying low-dimensional subspace in a data set that is possibly corrupted with outliers. The emphasis will be on surveying existing theoretical guarantees and tradeoffs. New results for adversarial outliers will also be mentioned. Following this, other related problems will be discussed, along with new results for one of these problems.

We consider the Sherrington-Kirkpatrick model of spin glasses with ferromagnetically biased couplings. For a specific choice of the couplings mean, the resulting Gibbs measure is equivalent to the Bayesian posterior for a high-dimensional estimation problem known as "Z2 synchronization". Statistical physics suggests to compute the expectation with respect to this Gibbs measure (the posterior mean in the synchronization problem), by minimizing the so-called Thouless-Anderson-Palmer (TAP) free energy, instead of the mean field (MF) free energy. We prove that this identification is correct, provided the ferromagnetic bias is larger than a constant (i.e. the noise level is small enough in synchronization). Namely, we prove that the scaled l_2 distance between any low energy local minimizers of the TAP free energy and the mean of the Gibbs measure vanishes in the large size limit. Our proof technique is based on upper bounding the expected number of critical points of the TAP free energy using the Kac-Rice formula.

Logistic regression is arguably the most widely used and studied non-linear model in statistics. Classical maximum likelihood theory provides asymptotic distributions for the maximum likelihood estimate (MLE) and the likelihood ratio test (LRT), which are universally used for inference. Our findings reveal, however, when the number of features p and the sample size n both diverge, with the ratio p/n converging to a positive constant, classical results are far from accurate. For a certain class of logistic models, we observe, (1) the MLE is biased, (2) variability of the MLE is much higher than classical results and (3) the LRT is not distributed as a Chi-Squared. We develop a new theory that quantifies the asymptotic bias and variance of the MLE, and characterizes asymptotic distribution of the LRT under certain assumptions on the distribution of the covariates. Empirical results demonstrate that our asymptotic theory provides extremely accurate inference in finite samples. These novel results depend on the underlying regression coefficients through a single scalar, the overall signal strength, which can be estimated efficiently. This is based on joint work with Emmanuel Candes and Yuxin Chen.

This talk studies hypothesis testing and confidence interval construction in high-dimensional linear models where robustness is view in the context of whether the data is truly sparse. We will show new concepts of uniform and essentially uniform non-testability that allow the study of the limitations of tests across a broad set of alternatives. Uniform non-testability identifies an extensive collection of alternatives such that the power of any test, against any alternative in this group, is asymptotically at most equal to the nominal size, whereas minimaxity shows the existence of one, particularly "bad" alternative. Implications of the new constructions include new minimax testability results that in sharp contrast to existing results do not depend on the sparsity of the model parameters and are therefore robust. We identify new tradeoffs between testability and feature correlation. In particular, we show that in models with weak feature correlations minimax lower bound can be attained by a confidence interval whose width has the parametric rate regardless of the size of the model sparsity.

In modern science, often data collection precedes the careful specification of hypotheses. Large datasets are mined, testing a large number of possible hypotheses, with the goal of identifying those that hold promise for follow-up. In this framework, controlling the False Discovery Rate is an appropriate criterion to avoid investing time and resources into non viable leads. In many contexts, the initial large collection of explored hypotheses is somewhat redundant: in an effort to maximize power, the same scientific statement can be probed with a number of related hypotheses. For example, the association between a phenotype and one genetic locus can be investigated by exploring the association between the phenotype and many genetic variants in the locus. After the first pass through the data is completed, however, and it is time to take stock of the identified scientific leads, this redundancy is corrected: in the example above, rather than reporting all variants associated with a phenotype, scientists routinely report only a ?lead? variant, selected to represent the entire locus. Because the false discovery proportion is crucially defined with reference to the total set of discoveries, however, these subsets of discoveries identified post-hoc are not equipped with guarantees of FDR control. To overcome this problem, we note that if the criterion with which discoveries will be filtered can be specified in advance, it is possible to modify the Benjamini Hochberg procedure to result in a focused set of discoveries with FDR guarantees. Our framework allows not only subsetting of discoveries, but also their prioritization with weights reflecting the extent to which they provide insight into distinct scientific questions. We illustrate our methodology with examples from gene set enrichment on the Gene Ontology, a collection of hypotheses organized in a directed acyclic graph.
The talk will be based on joint work with Eugene Katsevich (Stanford) and Marina Bogomolov (Technion).

The Benjamin--Hochberg (BH) procedure and many of its generalizations are empirically observed to control the false discovery rate (FDR) much beyond the regimes that are known to enjoy provable FDR control. To address this gap, this talk introduces some new results that imply the robustness of the BH procedure and certain related procedures for FDR control. First, we show that FDR control is maintained up to a small multiplicative factor under arbitrary dependence between false null test statistics and independent true null test statistics. The proof technique is based on a new backward submartingale argument. Next, we further extend the FDR control to the case where the null test statistics exhibit certain positive dependence, implying that the null distribution plays an essential role in FDR control. We conclude the talk by introducing a weak version of FDR control for which the BH procedure is robust against any adversarial false null test statistics. Part of this talk is based on joint work with Cynthia Dwork (Harvard) and Li Zhang (Google).

We consider linear regression in the high-dimensional regime where the number of parameters exceeds the number of samples (p greater than n) and assume that the high-dimensional parameters vector is sparse. We develop a framework for testing general hypotheses regarding the model parameters. Our framework encompasses testing whether the parameter lies in a convex cone, testing the signal strength, and testing arbitrary functionals of the parameter. We show that the proposed procedure controls the false positive rate and also analyze the power of the procedure. Our numerical experiments confirm our theoretical findings and demonstrate that we control false positive rate near the nominal level, and have high power. By duality between hypotheses testing and confidence intervals, the proposed framework can be used to obtain valid confidence intervals for various functionals of the model parameters. For linear functionals, the length of confidence intervals is shown to be minimax rate optimal. [This talks is based on a joint work with Jason Lee.]

Estimating a high-dimensional sparse covariance matrix from a limited number of samples is a fundamental problem in contemporary data analysis. Most proposals to date, however, are not robust to outliers or heavy tails. Towards bridging this gap, we consider estimating a sparse shape matrix from n samples following a possibly heavy tailed elliptical distribution. We propose estimators based on thresholding either Tyler's M-estimator or its regularized variant. We prove that under suitable conditions the regularized variant can be computed efficiently in practical polynomial time. Furthermore, we prove that in the joint limit as dimension p and sample size n tend to infinity with p/n tending to a constant, our proposed estimators are minimax rate optimal. Results on simulated data support our theoretical analysis.

We study polynomial time algorithms for estimating the mean of a d-dimensional random vector X from n independent samples X1,?,Xn when X may be heavy-tailed. In particular, we assume only that X has finite mean and covariance. In this setting, the radius of confidence intervals achieved by the empirical mean are large compared to the case that X is Gaussian or sub-Gaussian. We offer the first polynomial time algorithm to estimate the mean with sub-Gaussian-style confidence intervals even when X has only finite second moments. Our algorithm is based on a new semidefinite programming relaxation of a high-dimensional median. Previous estimators which assumed only existence of O(1) moments of X either sacrifice sub-Gaussian performance or are only known to be computable via brute-force search procedures requiring exp(d) time.

Robust estimation under Huber's $\epsilon$-contamination model has become an important topic in statistics and theoretical computer science. Rate-optimal procedures such as Tukey's median and other estimators based on statistical depth functions are impractical because of their computational intractability. In this talk, I will discuss an intriguing connection between f-GANs and various depth functions through the lens of f-Learning. Similar to the derivation of f-GAN, I will show that these depth functions that lead to rate-optimal robust estimators can all be viewed as variational lower bounds of the total variation distance in the framework of f-Learning. This connection opens the door of computing robust estimators using tools developed for training GANs. In particular, I will show that a JS-GAN that uses a neural network discriminator with at least one hidden layer is able to achieve the minimax rate of robust mean and covariance matrix estimation under Huber's $\epsilon$-contamination model. Interestingly, the hidden layers for the neural net structure in the discriminator class is shown to be necessary for robust estimation.

Optimizing a non-convex function is hard in general. Existing analysis for non-convex optimization then relies on problem specific structures that may not be robust to adversarial perturbations. In this talk, we will see two scenarios where the standard non-convex optimization techniques are not robust, and show how to modify the algorithms to handle adversarial perturbations. The first scenario considers the matrix completion problem against a semi-random adversary that can reveal more entries of the matrix. Although this weak adversary is harmless to convex relaxations, we show that it can ruin non-convex approaches, and give a fast algorithm to fix this problem. The second scenario considers the more general setting where one does not have access to the exact function to optimize, where we show that it is still possible to find an approximate local optimal solution. Based on joint works with Yu Cheng, Chi Jin, Lydia Liu, Michael I. Jordan.

Probabilistic models are widely used in statistical inference, and for understanding the computational tractability of several unsupervised learning tasks. However, a common criticism of most existing learning algorithms is that their guarantees strongly rely on the unrealistic assumption that the instance is generated exactly from the model. Semi-random models provide a framework to reason about robustness of algorithms to modeling errors by incorporating both adversarial and random choices in instance generation. In this talk, I will describe new semi-random models for two different problems in unsupervised learning: dictionary learning, and clustering mixtures of Gaussians. In dictionary learning the task is to learn a hidden incoherent basis/dictionary A given data Y=AX that represents unknown sparse linear combinations (corresponding to columns of X) of the basis elements. Existing algorithms learn A and X efficiently, assuming strong distributional assumptions about both the supports of the columns of X and the values of these non-zero entries. In the first part of the talk, I will describe a more general semi-random model for dictionary learning aimed at capturing almost arbitrary distributions for the sparse supports, and give a new polynomial time algorithm for learning incoherent over-complete dictionaries under the semi-random model. Finally (time permitting), I will describe a natural semi-random model for k-means clustering that generalizes mixtures of k Gaussians, and demonstrate how the Lloyds algorithm successfully recovers the ground-truth clustering up to near optimal accuracy. Based on joint works with Pranjal Awasthi.

Modern applications increasingly involve high-dimensional and heterogeneous data, e.g., datasets formed by combining numerous measurements from myriad sources. Principal Component Analysis (PCA) is a classical method for reducing dimensionality by projecting data onto a low-dimensional subspace capturing most of their variation, but it does not robustly recover underlying subspaces in the presence of heteroscedastic noise. Specifically, PCA suffers from treating all data samples as if they are equally informative. We will discuss the consequences of this on performance, which lead us naturally to consider weighting PCA in such a way that we give less influence to samples with larger noise variance. Doing so better recovers underlying principal components, but precisely how to choose the weights turns out to be an interesting problem. Surprisingly, we show that whitening the noise by using inverse noise variance is sub-optimal. Our analysis provides expressions for the asymptotic recovery of underlying low-dimensional components from samples with heteroscedastic noise in the high-dimensional regime. We derive optimal weights and characterize the performance of optimally weighted PCA. This is work in collaboration with David Hong and Jeff Fessler.

We study transformations of shapes into representations that allow for analysis using standard statistical tools. The transformations are based on Euler integration and are of interest for their mathematical properties as well as their applications to science and engineering, because they provide a way of summarizing shapes in a topological, yet quantitative, way. By using an inversion theorem, we show that both transforms are injective on the space of shapes---each shape has a unique transform. By making use of a stratified space structure on the sphere, induced by hyperplane divisions, we prove additional uniqueness results in terms of distributions on the space of Euler curves. The main theoretical result provides the first (to our knowledge) finite bound required to specify any shape in certain uncountable families of shapes, bounded below by curvature. This result is perhaps best appreciated in terms of shattering number or the perspective that any point in these particular moduli spaces of shapes is indexed using a tree of finite depth. We also show how these transformations can be used in practice for medical imaging applications as well as for evolutionary morphology questions.

We present a new method for high-dimensional linear regression when a scale parameter of the error is unknown. The proposed estimator is based on a penalized Huber M-estimator, for which theoretical results on estimation error have recently been proposed in high-dimensional statistics literature. However, variance of the error term in the linear model is intricately connected to the parameter governing the shape of the Huber loss. The main idea is to use an adaptive technique, based on Lepski's method, to overcome the difficulties in solving a joint nonconvex optimization problem with respect to the location and scale parameters.

k-means and k-medians under Dimension ReductionYury Makarychev (Toyota Technological Institute at Chicago)

Consider an instance of Euclidean k-means or k-medians clustering. We prove that a dimension reduction projection onto a space of dimension d ~ log k preserves the cost of the optimal clustering within a factor of 1 + epsilon w.h.p. Crucially, the dimension d does not depend on the total number of points n in the instance. The result also applies to other variants of the k-clustering problem. The result strengthens the result by Cohen, Elder, Musco, Musco, and Persu, who showed that the value of k-means is approximately preserved when d ~ k. No bounds on d were previously known for k-medians. Joint work with Konstantin Makarychev and Ilya Razenshteyn.

Symmetric properties of distributions arise in multiple settings. For each of these, separate estimators and analysis techniques have been developed. Recently, Orlitsky et al showed that a single estimator that maximizes profile maximum likelihood (PML) is sample competitive for all symmetric properties. Further, they showed that even a 2^{n^{1-delta}}-approximate maximizer of the PML objective can serve as such a universal plug-in estimator. (Here n is the size of the sample). Unfortunately, no polynomial time computable PML estimator with such an approximation guarantee was known. We provide the first such estimator and show how to compute it in time nearly linear in n. We also present some preliminary experimental results. Joint work with Kiran Shiragur and Aaron Sidford.

Realizing RobustnessGautam Kamath (Massachusetts Institute of Technology)

Over the last few years, there has been significant theoretical work in robust high-dimensional statistical estimation. These results seem aligned with modern goals in practical machine learning, where high-dimensional data is ubiquitous, and robustness and security are paramount. This raises the question: are these advances purely theoretical, or can we reap their benefits in the real world? In this talk, I will describe some first steps towards answering this question positively, including evidence that these theoretical advances may be realizable. I will discuss applications to exploratory data analysis and robust stochastic optimization on synthetic and real-world datasets.
Based on joint works with Ilias Diakonikolas, Daniel Kane, Jerry Li, Ankur Moitra, Jacob Steinhardt, and Alistair Stewart. Papers available at https://arxiv.org/abs/1703.00893 and https://arxiv.org/abs/1803.02815.

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.