## Tuesday, December 11, 2012

### #NIPS2012 Workshop presentations

Yesterday. I pointed to two blog entries by Suresh (NIPS ruminations I) and Hal (NIPS stuff...) on some very thoughtful views of NIPS2012 that just occurred in Lake Tahoe.  Go read them, I'll wait. Here is a list of NIPS2012 Workshop presentations that had pdfs and are of interest to issues mentioned here on Nuit Blanche (all the abstracts of the conferences are here). The main lectures in each of these workshops are not available for most of them but they surely will show up in some shape or form when the videos of the NIPS conference come out. Stay tuned. In the meantime, enjoy:

Other papers included:
• T. Jung and D. ErnstBiorthogonalization Techniques for Least Squares Temporal Difference Learning
Abstract
We consider Markov reward processes and study OLS-LSTD, a framework for selecting basis functions from a set of candidates to obtain a sparse representation of the value  function in the context of least squares temporal difference learning. To support efficient both udating and downdating operations, OLS-LSTD uses a biorthogonal representation for the selected basis vectors. Empirical comparisons with the recently proposed MP and LARS frameworks for LSTD are made.
•  M. Yaghoobi and M. DaviesRelaxed Analysis Operator Learning
Abstract
The problem of analysis operator learning can be formulated as a constrained optimisation problem. This problem has been approximately solved using projected gradient or geometric gradient descent methods. We will propose a relaxation for the constrained analysis operator learning in this paper. The relaxation has been suggested here to, a) reduce the computational complexity of the optimisation and b) include a larger set of admissible operators. We will show here that an appropriate relaxation can be useful in presenting a projection-free optimisation algorithm, while preventing the problem to become ill-posed. The relaxed optimisation objective is not convex and it is thus not always possible to find the global optimum. However, when a rich set of training samples are given, we empirically show that the desired synthetic analysis operator is recoverable, using the introduced sub-gradient descent algorithm.
•  Y. Chen, T. Pock, and H. Bischof
• AbstractWe consider the analysis operator and synthesis dictionary learning problems based on the $\ell_1$ regularized sparse representation model. We first reveal the internal relations between analysis-prior and synthesis-prior based models, and then introduce an approach to learn both analysis operator and synthesis dictionary simultaneously by using a unified framework of bi-level optimization. Our aim is to learn a meaningful operator(dictionary) such that the minimum energy solution of the analysis(synthesis)-prior based model is as close as possible to the ground-truth. Moreover, we demonstrate the effectiveness of our leaning model by applying the learned analysis operator(dictionary) to image denoising task and comparing the performance with state-of-art methods. Under this unified framework, we can compare the performance of the two types of priors.

•  B. Mailhé and M. PlumbleyLarge Step Gradient "Descent" for Dictionary Learning
Abstract
This work presents a new algorithm for dictionary learning. Existing algorithms such as MOD and K-SVD often fail to find the best dictionary because they get trapped in a local minimum. Olshausen and Field’s Sparsenet algorithm relies on a fixed step projected gradient descent. With the right step, it can avoid local local minima and converge towards the global minimum. The problem then be- comes to find the right step size. In this work we provide the expression of the optimal step for the gradient descent but we use for descent is twice as large for the gradient descent. That large step allows the descent to bypass local minima and yields significantly better results than existing algorithms. The algorithms are compared on synthetic data. Our method outperforms existing algorithms both in approximation quality and in perfect recovery rate if an oracle support for the sparse representation is provided.
•  P. Sprechmann, A. Bronstein, and G. SapiroLearning Robust Low-Rank Representations
AbstractIn this paper we present a comprehensive framework for learning robust low-rank representations by combining and extending recent ideas for learning fast sparse coding regressors with structured non-convex optimization techniques. This approach connects robust principal component analysis (RPCA) with dictionary learning methods and allows its approximation via trainable encoders. We propose an efficient feed-forward architecture derived from an optimization algorithm designed to exactly solve robust low dimensional projections. This architecture, in combination with different training objective functions, allows the regressors to be used as online approximants of the exact offline RPCA problem or as RPCA-based neural networks. Simple modifications of these encoders can handle challenging extensions, such as the inclusion of geometric data transformations. We present several examples with real data from audio and video processing. When used to approximate RPCA, our basic implementation shows several orders of magnitude speedup compared to the exact solvers with almost no performance degradation. We show the strength of the inclusion of learning into the RPCA approach on a music source separation application, where the encoders outperform the exact RPCA algorithms, which are already reported to produce state-of-the-art results on a benchmark database. Video applications are demonstrated as well. Our preliminary implementation on an iPad shows faster-than-real-time performance with minimal latency.
•  M. LopesEstimating Unknown Sparsity in Compressed Sensing
Abstract
Within the framework of compressed sensing, the sparsity of the unknown signal x\in\R^p determines how many measurements n are needed for reliable recovery, e.g. n\gtrsim \|x\|_0 \log(p/\|x\|_0). However, when \|x\|_0 is unknown, the choice of $n$ is problematic. In recent work [1], we have shown that estimating \|x\|_0 from linear measurements is generally intractable. This has led us to consider an alternative measure of sparsity s(x):=\|x\|_1^2/\|x\|_2^2, which is more amenable to estimation, and is a sharp lower bound on \|x\|_0. We show that s(x) is an effective surrogate for many reasons. First, s(x) is a sensible measure of sparsity even when all coordinates of x are non-zero. Second, we prove that for general signals x, the condition n\gtrsim s(x)\log(p/s(x)) is necessary for reliable recovery, and the condition n\gtrsim s(x)\log(p/s(x))\log p is sufficient. Third, we prove that when x is a non-negative signal, s(x) can be estimated from O(1) measurements. In particular, we propose a simple estimator \hat{s}(x), and derive \emph{dimension-free} concentration bounds that imply strong ratio-consistency as (n,p)\to\infty. Lastly, we confirm with simulations that a reliable data-driven choice of n can be derived from \hat{s}(x).
•  R. Mehrotra, D. Chu, S.A. Haider, and I.A. KakadiarisIt takes two to tango : Coupled Dictionary Learning for Cross Lingual Information Retrieval
Abstract
Automatic text understanding has been an unsolved research problem for many years. This partially results from the dynamic and diverging nature of human languages, which ultimately results in many different varieties of natural language. These variations range from the individual level, to regional and social dialects, and up to seemingly separate languages and language families. However, in recent years there have been considerable achievements in data driven approaches to computational linguistics exploiting the redundancy in the encoded information and the structures used. Those approaches are mostly not language specific or can even exploit redundancies across languages. Representing documents by vectors that are independent of languages enhances the performance of cross-lingual tasks such as \textit{comparable document retrieval} and \textit{mate retrieval}.
\In this paper, we explore the use of Dictionary based approaches to solve the task of cross-lingual information retrieval. We propose a new dictionary learning algorithm for learning a pair of coupled dictionary pair representing basis atoms in a pair of languages, alongside learning two mapping functions which help in transforming representations learnt in one language to the other. Such transformations are necessary for the task of finding similar documents in a different language and hence finds immense application for various cross-lingual information retrieval tasks. We present an optimization procedure that iterates between two objectives and uses the K-SVD formulation to efficiently compute the parameters involved. We evaluate our algorithm on the task of cross-lingual comparable document retrieval and compare our results with existing approaches; the results highlight the efficacy of our method.

•  C. Gao and B. EngelhardtA Sparse Factor Analysis Model for High Dimensional Latent Spaces
Abstract
Inducing sparsity in factor analysis has become increasingly important as appli- cations have arisen that are best modeled by a high dimensional, sparse latent space, and the interpretability of this latent space is critical. Applying latent fac- tor models with a high dimensional latent space but without sparsity yields non- sense factors that may be artifactual and are prone to overfittiing the data. In the Bayesian context, a number of sparsity-inducing priors have been proposed, but none that specifically address the context of a high dimensional latent space. Here we describe a Bayesian sparse factor analysis model that uses a general three parameter beta prior, which, given specific settings of hyperparameters, can reca- pitulate sparsity inducing priors with appropriate modeling assumptions and com- putational properties. We apply the model to simulated and real gene expression data sets to illustrate the model properties and to identify large numbers of sparse, possibly correlated factors in this space.
•  M. Seibert, S. Hawe, and M. KleinsteuberRiemannian Optimization for Analysis Operator and Dictionary LearningAbstractIn this work, we propose a learning technique that can be applied to both dictionary and analysis operator learning scenarios. We employ geometric optimization methods on product of spheres that enable us to update either the dictionary or the analysis operator as a whole. On one hand, this approach offers more flexibility than using a hard restriction to tight frames by regularizing the coherence with a penalty term. On the other hand, it allows to impose certain structural constraints on the learned dictionary/analysis operator that lead to improved numerical properties in applications.
• Roman Marchant and Fabio Ramos.
• Bayesian Optimisation for Intelligent Environmental Monitoring. [.pdf]
• Remi Bardenet, Mátyás Brendel, Balazs Kegl and Michele Sebag.
• SCOT: surrogate-based collaborative tuning for hyperparameter learning that remembers the past. [.pdf]
• R. Girdziuas, J. Janusevskis, and R. Le Riche.
• On Integration of Multi-Point Improvements [.pdf]
• R. Le Riche and R. Girdziuaas, J. Janusevskis.
• A Study of Asynchronous Budgeted Optimization. [.pdf]
• Xuezhi Wang, Roman Garnett, and Jeff Schneider.
• An Impact Criterion for Active Graph Search.[.pdf]
• Clement Chevalier and David Ginsbourger
• Fast computation of the multipoint Expected Improvement with applications in batch selection.[.pdf]
• Ali Jalali, Javad Azimi and Xiaoli Fern.
• Exploration vs Exploitation in Bayesian Optimization.[.pdf]
• Nadjib Lazaar, Youssef Hamadi, Said Jabbour, and Michele Sebag.
• Cooperation control in Parallel SAT Solving: a Multi-armed Bandit Approach. [.pdf]
• Christopher Amato, and Emma Brunskill.
• Diagnose and Decide: An Optimal Bayesian Approach. [.pdf]
• Matthew Tesch, Jeff Schneider and Howie Choset.
• Expensive Multiobjective Optimization and Validation with a Robotics Application.[.pdf]
• Zheng Wen, Branislav Kveton, and Sandilya Bhamidipati
• Learning to Discover: A Bayesian Approach. [.pdf]

Big Data Meets Computer Vision:
First International Workshop on Large Scale Visual Recognition and Retrieval

• A k-NN Approach for Scalable Image Annotation Using General Web Data. Mauricio Villegas (Universidad Politecnica de Val), Roberto Paredes (Universidad Politecnica de Valencia)
• Abstract: This paper presents a simple k-NN based image annotation method that relies only on automatically gathered Web data. It can easily change or scale the list of concepts for annotation, without requiring labeled training samples for the new concepts. In terms of MAP the performance is better than the results from the ImageCLEF 2012 Scalable Web Image Annotation Task on the same dataset. Although, in terms of F-measure they are equivalent, suggesting that a better method for choosing how many concepts to select per image is required. Large-scale issues are considered by means of linear hashing techniques. The use of dictionary definitions has been observed to be a useful resource for image annotation without manually labeled training data.
• URL to the latest version: http://mvillegas.info/pub/Villegas12_BIGVIS_kNN-Annotation.pdf
• Adaptive representations of scenes based on ICA mixture model. Wooyoung Lee (Carnegie Mellon University), Michael Lewicki (Case Western Reserve University)
• Abstract: To develop an adaptive representation based on rich statistical distributions of very large databases of scene images, we train a mixture model based on independent component analysis for full color scene images. The learned features of the model result in the improved scene category classification performance when compared with previous methods. Furthermore, the unsupervised classification of scene images performed by the model suggests that perceptual categories of scene images are to some extent based on the statistics of natural scenes. Our results show that features tailored for subgroups of data can be beneficial for more efficient repre- sentation for a large number of images.
• Workshop Paper: pdf
• Aggregating descriptors with local Gaussian metrics. Hideki Nakayama (The University of Tokyo)
• Abstract: Recently, large-scale image classification has made a remarkable progress because of the significant advancement in the representation of image features. To realize scalable systems that can handle millions of training samples and tens of thousands of categories, it is crucially important to develop discriminative image signatures that are compatible to linear classifiers. One of the promising approaches to realize this is to encode high-level statistics of local features. Many state-ofthe-art large-scale systems are following this approach and have made remarkable progress over the past few years. However, while first-order statistics are frequently used in many methods, the power of higher-order statistics has not received much attention. In this work, we propose an efficient method to exploit the second-order statistics of local features. For each visual word, the local features of training samples are modeled with a Gaussian, and descriptors from two images are compared using a Fisher vector with respect to the Gaussian. In experiments, we show the promising performance of our method.
• Workshop Paper: pdf
• Beyond Classification -- Large-scale Gaussian Process Inference and Uncertainty Prediction.Alexander Freytag (Friedrich Schiller University ), Erik Rodner (UC Berkeley,University of Jena), Paul Bodesheim (Computer Vision Group, University of Jena), Joachim Denzler (Computer Vision Group, University of Jena)
• Abstract: Due to the massive (labeled) data available on the web, a tremendous interest in large-scale machine learning methods has emerged in the last years. Whereas, most of the work done in this new area of research focused on fast and efficient classification algorithms, we show in this paper how other aspects of learning can also be covered using massive datasets. The paper briefly presents techniques allowing for utilizing the full posterior obtained from Gaussian process regression (predictive mean and variance) with tens of thousands of data points and without relying on sparse approximation approaches. Experiments are done for active learning and one-class classification showing the benefits in large-scale settings.
• Workshop Paper: pdf
• Classiﬁer-as-a-Service: Online Query of Cascades and Operating Points. Brandyn White (University of Maryland: Colleg), Andrew Miller (University of Central Florida), Larry Davis (University of Maryland: College Park)
• Abstract: We introduce a classiﬁer and parameter selection algorithm for Classiﬁer-as-a-Service applications where there are many components (e.g., features, kernels, classiﬁers) available to construct classiﬁcation algorithms. Queries specify varying requirements (i.e., quality and execution time), some of which may require combining classiﬁcation algorithms to satisfy; each query may have a different set of quality and execution time requirements (e.g., fast and precise, slow and thorough) and the set of images to which the classiﬁer is to be applied may be small (e.g., even a single image), necessitating a query resolution method that takes negligible time in comparison. When operating on large datasets, meeting design requirements automatically becomes essential to reducing costs associated with unnecessary computation and expert assistance. As queries specify requirements and not implementation details, additional components can be utilized naturally. Our query resolution method combines classiﬁers with complementary operating points (e.g., high recall algorithmic ﬁlter, followed by high precision human veriﬁcation) in a rejection-chain conﬁguration. Experiments are conducted on the SUN397[1] dataset; we achieve state-of-the-art classiﬁcation results and 1 m.s. query resolution times.
• URL to the latest version: http://bw-school.s3.amazonaws.com/nips2012-bigvision-classifier-as-a-service.pdf
• Creating a Big Data Resource from the Faces of Wikipedia. Md. Kamrul Hasan (Ecole Polytechnique Montreal), Christopher Pal (Ecole Polytechnique de Montreal)
• Abstract: We present the Faces of Wikipedia data set in which we have used Wikipedia to create a large database of identities and faces. To automatically extract faces for over 50,000 identities we have developed a state of the art face extraction pipeline and a novel facial co-reference technique. Our approach is based on graphical models and uses the text of Wikipedia pages, face attributes and similarities, as well as clues from various other sources. Our method resolves the name-face association problem jointly for all detected faces on a Wikipedia page. We provide this dataset to the community for further research in various forms including: manually labeled faces, automatically labeled faces using our co-reference technique, raw and processed faces as well as text and meta data features for further evaluations of extraction and co-reference methods.
• Large-scale image classification with lifted coordinate descent. Zaid Harchaoui (INRIA), Matthijs Douze (INRIA), Mattis Paulin (INRIA), Miro Dudik (Microsoft Research), Jerome Malick (CNRS)
• Abstract: With the advent of larger image classification datasets such as ImageNet, designing scalable and efficient multi-class classification algorithms is now an important challenge. We introduce a new scalable learning algorithm for large-scale multi-class image classification, using the trace-norm-type regularization penalties. Reframing the challenging non-smooth optimization problem into a surrogate infinite-dimensional optimization problem with a regular $\ell_1$--regularization penalty, we propose a simple and provably efficient lifted'' coordinate descent algorithm. Furthermore, we show how to perform efficient matrix computations in the compressed domain for quantized dense visual features, scaling up to 100,000s examples, 1,000s-dimensional features, and 100s of categories. Promising experimental results on the subsets of ImageNet are presented.
• Workshop Paper: pdf
• Learning from Incomplete Image Tags. Minmin Chen (Washington university), Kilian Weinberger, Alice Zheng
• Abstract: Obtaining high-quality training labels for learning can be an onerous task. In this paper, we look at the task of automatic image annotation, trained with only partial supervision. We propose MARCO, a novel algorithm that learns to predict the complete tag set of an image with the help of an auxiliary task that recovers the semantic relationship between tags. We formulate this as a convex programming problem and present an efficient optimization routine that iterates between two closed-form solution steps. We demonstrate on two real datasets that our approach out performs all competitors, especially with very sparsely labeled training images.
• URL to the latest version: www.cse.wustl.edu/~mchen/papers/msdajoint.pdf
• Loss-Specific Learning of Complex Hash Functions. Mohammad Norouzi (University of Toronto), David Fleet (University of Toronto), Ruslan Salakhutdinov (University of Toronto)
• Abstract: Motivated by large-scale multimedia applications we propose a framework for learning mappings from high-dimensional data to binary codes, while preserving semantic similarity. Binary codes are well suited to large-scale applications as they are storage efficient and permit fast exact kNN search. The framework is applicable to broad families of mappings, and two flexible classes of loss function. We overcome discontinuous optimization of the discrete mappings by minimizing a piecewise-smooth upper bound on empirical loss. Experiments show strong retrieval and classification results using no more than kNN on the binary codes.
• Workshop Paper: pdf
• Overcoming Dataset Bias: An Unsupervised Domain Adaptation Approach. Boqing Gong (U. of Southern California), Fei Sha (University of Southern California), Kristen Grauman (University of Texas at Austin)
• Abstract: Recent studies have shown that recognition datasets are biased. Paying no heed to those biases, learning algorithms often result in classifiers with poor cross-dataset generalization. We are developing domain adaptation techniques to overcome those biases and yield classifiers with significantly improved performance when generalized to new testing datasets. Our work enables us to continue to harvest the benefits of existing vision datasets for the time being. Moreover, it also sheds insights about how to construct new ones. In particular, we have raised the bar of collecting data --- the most informative data are those which cannot be classified well by learning algorithms adapting from existing datasets.
• Workshop Paper: pdf
• Picture Tags and World Knowledge. Lexing Xie (Australian national university)
• Abstract: This paper studies the use of everyday words to describe images. The common saying has it that {\em a picture is worth a thousand words}, here we ask {\em which thousand}? We propose a new method to exploit visual semantic structure by jointly analyzing three distinct resources: Flickr, ImageNet/WordNet, and ConceptNet. This allows us to quantify the visual relevance of both tags and their relationships, which in turn lead to an algorithm for image annotation that takes into account both image and tag features. We analyze over 5 million semantically tagged photos, their statistics allow us to observe tag utility and meanings. We have also obtained good results for image tagging, including generalizing to unseen tags We believe leveraging real-world knowledge is a very promising direction for image retrieval. Potential other applications include generating natural language descriptions of pictures, and validating the quality of commonsense knowledge.
• URL to the latest version: http://users.cecs.anu.edu.au/~xlx/proj/tagnet/
• Randomly Multi-view Clustering for Hashing. Caiming Xiong (SUNY at Buffalo), Jason Corso (SUNY at Buffalo)
• Abstract: This paper addresses the problem of efficient learning similarity preserving binary codes for fast retrieval in large-scale data collections. We propose a simple and efficient randomly multi-view clustering schema for finding hash functions so as to decrease hamming distance of the relatively close points: we first use PCA to reduce the dimensionality of data points and obtain compact representations, then multiple the new representation to Hadamard matrix so as to equalize the variance of each dimension; second find the $l$-bits binary code for all data points via randomly multi-view clustering that extract $l$-different view of data distribution by randomly choosing $k$ dimensions ,then for each view, we obtain 1-bit via partitioning data points into two clusters; finally achieve $l$ classifiers using max margin SVM with clustering result of each view as training data to predict the binary code for any query points. Our experiments show that our binary coding scheme results in better performance that several other state-of-the-art methods.
• Workshop Paper: pdf
• Semantic Kernel Forests from Multiple Taxonomies. Sung Ju Hwang (University of Texas, Austin), Fei Sha (University of Southern California), Kristen Grauman (University of Texas at Austin)
• Abstract: We propose a discriminative feature learning approach that leverages multiple hierarchical taxonomies representing different semantic views. For each taxonomy, we first learn a tree of semantic kernels, where each node has a Mahalanobis kernel optimized to distinguish between the classes in its children nodes. Then, using the resulting semantic kernel forest, we learn class-specific kernel combinations to select only those kernels relevant for category recognition, with a novel hierarchical regularizer that exploits the taxonomies’ structure. We demonstrate our method on challenging object recognition datasets.
• Workshop Paper: pdf
• Visually-Grounded Bayesian Word Learning. Yangqing Jia (UC Berkeley), Joshua Abbott (UC Berkeley), Joseph Austerweil (UC Berkeley), Thomas Griffiths (UC Berkeley), Trevor Darrell (UC Berkeley)
• Abstract: Learning the meaning of a novel noun from a few labelled objects is one of the simplest aspects of learning a language, but approximating human performance on this task is still a significant challenge. Current methods typically fail to find the appropriate level of generalization in a concept hierarchy for given stimulus. Recent work in cognitive science on Bayesian word learning partially addresses this challenge, but assumes that objects are perfectly recognized and has only been evaluated in small domains. We present a system for learning words directly from images, using probabilistic predictions generated by visual classifiers as the input to Bayesian word learning, and compare this system to human performance in a large-scale automated experiment. Combining the uncertain outputs of the visual classifiers with the ability to identify an appropriate level of abstraction that comes from Bayesian word learning allows the system to better capture the human word learning behaviors than previous approaches.
• URL to the latest version: http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-202.pdf

• Algebraic Topology and Machine Learning

1. The Generative Simplicial Complex to extract Betti numbers from unlabeled data - Maxime Maillot, Michael Aupetit and Gerard Govaert
2. Towards Multi-scale Heat Kernel Signatures for Point Cloud Models of Engineering Artifacts - Reed M. Williams and Horea T. Ilies
3. Hyekyung Lee, Matthew Arnold
4. Topological Analysis of Recurrent Systems - Vin De Silva, Primoz Skraba and Mikael Vejdemo-Johansson
5. Weighted Functional Brain Network Modeling via Network Filtration - Hyekyoung Lee, Hyejin Kang, Moo K. Chung, Bung-Nyun Kim and Dong Soo Lee
6. Topological Constraints and Kernel-Based Density Estimation - Florian T. Pokorny, Carl Henrik Ek,  Hedvig Kjellström, Danica Kragic
7. The topology of politics: voting connectivity in the US House of Representatives - Pek Yee Lum, Alan Lehmann, Gurjeet Singh Tigran Ishkhanov, Gunnar Carlsson, Mikael Vejdemo-Johansson
8. Parallel & scalable zig-zag persistent homology - Primoz Skraba and Mikael Vejdemo-Johansson
9. A new metric on the manifold of kernel matrices - Suvrit Sra
10. Persistent homology for natural language processing - Jerry Zhu
11. Persistent homology of collaboration networks - Jacobien Carstens and Kathy Horadam

Probabilistic Numerics

• Algorithmic inference approach to learn copulas Bruno Apolloni & Simone Bassis [download pdf]
• Achieving optimization invariance w.r.t. monotonous transformations of the objective function and orthogonal transformations of the representation Ilya Loshchilov, Marc Schoenauer, & Michèle Sebag [download pdf]
Personalizing Education With Machine Learning

Poster

poster session

Posters

Paper Title: Tensor Analyzers [pdf]
Author Names: Yichuan Tang*, University of Toronto; Ruslan Salakhutdinov, University of Toronto; Geoffrey Hinton, University of Toronto
Abstract: Factor Analysis is a statistical method that seeks to explain linear variations in data by using unobserved latent variables. Due to its additivenature, it is not suitable for modeling data that is generated by multiple groups of latent factors which interact multiplicatively. In this paper, we introduce Tensor Analyzers which are a multilinear generalization of Factor Analyzers. We describe a fairly efficient way of sampling from the posterior distribution over factor values and we demonstrate that these samples can be used in the EM algorithm for learning interesting mixture models of natural image patches and of images containing a variety of simple shapes that vary in size and color. Tensor Analyzers can also accurately recognize a face under significant pose and illumination variations when given only one previous image of that face. We also show that mixtures of Tensor Analyzers outperform mixtures of Factor Analyzers at modeling natural image patches and artificial data produced using multiplicative interactions.

Paper Title: Temporal Autoencoding Restricted Boltzmann Machine [arxiv]
Author Names: Chris Hausler, Freie Universität Berlin; Alex Susemihl*, Berlin Institute of Technology
Abstract: Much work has been done refining and characterizing the kinds of receptive fields learned by deep learning algorithms. A lot of this work has focused on the development of Gabor-like filters learned when enforcing sparsity constraints on a natural image dataset. Little work however has investigated how these filters might expand to the temporal domain, namely through training on natural movies. Here we investigate exactly this problem in established temporal deep learning algorithms as well as a new learning paradigm suggested here, the Temporal Autoencoding Restricted Boltzmann Machine (TARBM).

Paper Title: Deep Gaussian processes [arxiv]
Author Names: Andreas Damianou*, University of Sheffield; Neil Lawrence, University of Sheffield
Abstract: In this paper we introduce deep Gaussian process (GP) models. Deep GPs are a deep belief network based on Gaussian process mappings. Data is modeled as the output of a multivariate GP. The inputs to that Gaussian process are then governed by another GP. A single layer model is equivalent to a standard GP or the GP latent variable model (GPLVM). We perform inference in the model by approximate variational marginalization. This results in a strict lower bound on the marginal likelihood of the model which we use for model selection (number of layers and nodes per layer). Deep belief networks are typically applied to relatively large data sets using stochastic gradient descent for optimization. Our fully Bayesian treatment allows for the application of deep models even when data is scarce. Model selection by our variational bound shows that a five layer hierarchy is justified even when modelling a digit data set containing only 150 examples.

Paper Title: Modeling Laminar Recordings from Visual Cortex with Semi-Restricted Boltzmann Machines [pdf]
Author Names: Urs Koster*, UC Berkeley; Jascha Sohl-Dickstein, UC Berkeley; Bruno Olshausen, UC Berkeley
Abstract: The proliferation of high density recording techniques presents us with new challenges for characterizing the statistics of neural activity over populations of many neurons. The Ising model, which is the maximum entropy model for pairwise correlations, has been used to model the instantaneous state of a population of neurons.  This model suffers from two major limitations: 1) Estimation for large models becomes computationally intractable, and 2) it cannot capture higher-order dependencies.  We propose applying a more general maximum entropy model, the semi-restricted Boltzmann machine (sRBM), which extends the Ising model to capture higher order dependencies using hidden units. Estimation of large models is made practical using minimum probability flow, a recently developed parameter estimation method for energy-based models. The partition functions of the models are estimated using annealed importance sampling, which allows for comparing models in terms of likelihood.  Applied to 32-channel polytrode data recorded from cat visual cortex, these higher order models significantly outperform Ising models. In addition, extending the model to spatiotemporal sequences of states allows us to predict spiking based on network history. Our results highlight the importance of modeling higher order interactions across space and time to characterize activity in cortical networks.

Paper Title: A Two-stage Pretraining Algorithm for Deep Boltzmann Machines [pdf]
Author Names: Kyunghyun Cho*, Aalto University; Tapani Raiko, Aalto University; Alexander Ilin, Aalto University; Juha Karhunen, Aalto University
Abstract: A deep Boltzmann machine (DBM) is a recently introduced Markov random field model that has multiple layers of hidden units. It has been shown empirically that it is difficult to train a DBM with approximate maximum-likelihood learning using the stochastic gradient unlike its simpler special case, restricted Boltzmann machines (RBM). In this paper, we propose a novel pretraining algorithm that consists of two stages; obtaining approximate posterior distributions over hidden units from a simpler model and maximizing the variational lower-bound given the fixed hidden posterior distributions. We show empirically that the proposed method overcomes the difficulty in training DBMs from randomly initialized parameters and results in a better, or comparable, generative model when compared to the conventional pretraining algorithm.

Paper Title: Linear-Nonlinear-Poisson Neurons Can Do Inference On Deep Boltzmann Machines [pdf]
Author Names: Louis Shao*, The Ohio State University
Abstract: One conjecture in both deep learning and classical connectionist viewpoint is that the biological brain implements certain kinds of deep networks as its back-end. However, to our knowledge, a detailed correspondence has not yet been set up, which is important if we want to bridge between neuroscience and machine learning. Recent researches emphasized the biological plausibility of Linear-Nonlinear-Poisson (LNP) neuron model. We show that with neurally plausible choices of parameters, the whole neural network is capable of representing any Boltzmann machine and performing a semi-stochastic Bayesian inference algorithm lying between Gibbs sampling and variational inference.

Paper Title: Crosslingual Distributed Representations of Words [pdf]
Author Names: Alexandre Klementiev*, Saarland University; Ivan Titov, Saarland University; Binod Bhattarai, Saarland University
Abstract: Distributed representations of words have proven extremely useful in numerous natural language processing tasks.  Their appeal is that they can help alleviate data sparsity problems common to supervised learning.  Methods for inducing these representations require only unlabeled language data, which are plentiful for many natural languages.  In this work, we induce distributed representations for a pair of languages jointly.  We treat it as a multitask learning problem where each task corresponds to a single word, and task relatedness is derived from co-occurrence statistics in bilingual parallel data.  These representations can be used for a number of crosslingual learning tasks, where a learner can be trained on annotations present in one language and applied to test data in another.  We show that our representations are informative by using them for crosslingual document classification, where classifiers trained on these representations substantially outperform strong baselines when applied to a new language.

Paper Title: Deep Attribute Networks [arxiv]
Author Names: Junyoung Chung*, KAIST; Donghoon Lee, KAIST; Youngjoo Seo, KAIST; Chang D. Yoo, KAIST
Abstract: Obtaining compact and discriminative features is one of the major challenges in many of the real-world image classification tasks such as face verification and object recognition. One possible approach is to represent input image on the basis of high-level features that carry semantic meaning that humans can understand. In this paper, a model coined deep attribute network (DAN) is proposed to address this issue. For an input image, the model outputs the attributes of the input image without performing any classification. The efficacy of the proposed model is evaluated on unconstrained face verification and real-world object recognition tasks using the LFW and the a-PASCAL datasets. We demonstrate the potential of deep learning for attribute-based classification by showing comparable results with existing state-of-the-art results. Once properly trained, the DAN is fast and does away with calculating low-level features which are maybe unreliable and computationally expensive.

Paper Title: Accelerating sparse restricted Boltzmann machine training using non-Gaussianity measures [pdf]
Author Names: Sander Dieleman*, Ghent University; Benjamin Schrauwen
Abstract: In recent years, sparse restricted Boltzmann machines have gained popularity as unsupervised feature extractors. Starting from the observation that their training process is biphasic, we investigate how it can be accelerated: by determining when it can be stopped based on the non-Gaussianity of the distribution of the model parameters, and by increasing the learning rate when the learnt filters have locked on to their preferred configurations. We evaluated our approach on the CIFAR-10, NORB and GTZAN datasets.

Paper Title: Not all signals are created equal: Dynamic Objective Auto-Encoder for Multivariate Data [pdf]
Author Names: Martin Längkvist*, Örebro University; Amy Loutfi, Örebro University
Abstract: There is a representational capacity limit in a neural network defined by the number of hidden units. For multimodal time-series data, there could exist signals with various complexity and redundancy. One way of getting a higher representational capacity for such input data is to increase the number of units in the hidden layer. We propose a step towards dynamically change the number of units in the visible layer so that there is less focus on signals that are difficult to reconstruct and more focus on signals that are easier to reconstruct with the goal to improve classification accuracy and also better understand the data itself. A comparison with state-of-the-art architectures show that our model achieves a slightly better classification accuracy on the task of classifying various styles of human motion.

Paper Title: When Does a Mixture of Products Contain a Product of Mixtures? [arxiv]
Author Names: Guido Montufar Cuartas*, Pennsylvania State University; Jason Morton, Pennsylvania State University
Abstract: We prove results on the relative representational power of mixtures of products and products of mixtures; more precisely restricted Boltzmann machines. In particular we find that an exponentially larger mixture model, requiring an exponentially larger number of parameters, is required to represent the distributions that can be represented by the restricted Boltzmann machine. This formally confirms a common intuition.
Tools of independent interest are mode-based polyhedral approximations sensitive enough to compare even full-dimensional models, and characterizations of possible mode and support sets of both model classes. The title question is intimately related to questions in coding theory and the theory of hyperplane arrangements.

Paper Title: Kernels and Submodels of Deep Belief Networks [arxiv]
Author Names: Guido Montufar Cuartas*, Pennsylvania State University; Jason Morton, Pennsylvania State University
Abstract: We describe mixture of products represented by layered networks from the perspective of linear stochastic maps, or kernel transitions of probability distributions. This gives a unified picture of distributed representations arising from  Deep Belief Networks (DBN) and other networks without lateral interactions. We describe combinatorial and geometric properties of the set of kernels and concatenations of kernels realizable by DBNs as the parameters vary. We present explicit classes of probability distributions that can be learned by DBNs depending on the number of hidden layers and units that they contain. We use these submodels to bound the maximal and the expected Kullback-Leibler approximation errors of DBNs from above.

Paper Title: Online Representation Search and Its Interactions with Unsupervised Learning [pdf]
Author Names: Ashique Mahmood*, University of Alberta; Richard Sutton, University of Alberta
Abstract: We consider the problem of finding good hidden units, or features, for use in multilayer neural networks. Solution methods that generate candidate features, evaluate them, and retain the most useful ones (such as cascade correlation and NEAT), we call representation search methods. In this paper, we explore novel representation search methods in an online setting, compare them with two simple unsupervised learning algorithms that also scale online. We demonstrate that the unsupervised learning methods are effective only at the initial learning period. However, when combined with search strategies, they are able to improve representation with more data and perform better than either of search and unsupervised learning alone. We conclude that search has enabling effects on unsupervised learning in continual learning tasks.

Paper Title: Learning global properties of scene images from hierarchical representations [pdf]
Author Names: Wooyoung Lee*, Carnegie Mellon University; Michael Lewicki, Case Western Reserve University
Abstract: Scene images with similar spatial layout properties often display characteristic statistical regularities on a global scale. In order to develop an efficient code for these global properties that reflects their inherent regularities, we train a hierarchical probabilistic model to infer conditional correlational information from scene images. Fitting a model to a scene database yields a compact representation of global information that encodes salient visual structures with low dimensional latent variables. Using perceptual ratings and scene similarities based on spatial layouts of scene images, we demonstrate that the model representation is more consistent with perceptual similarities of scene images than the metrics based on the state-of-the-art visual features.

Paper Title: Theano: new features and speed improvements [arxiv]
Author Names: Pascal Lamblin*, Université de Montréal; Frédéric Bastien, Université de Montréal; Razvan Pascanu, Universite de Montreal; James Bergstra, Harvard University; Ian Goodfellow, Université de Montréal; Arnaud Bergeron, Université de Montréal; Nicolas Bouchard, Université de Montréal;David Warde-Farley, Université de Montréal; Yoshua Bengio, University of Montreal
Abstract: Theano is a linear algebra compiler that optimizes a user's symbolically-specified mathematical computations to produce efficient low-level implementations.  In this paper, we present new features and efficiency improvements to Theano, and benchmarks demonstrating Theano's performance relative to Torch7, a recently introduced machine learning library, and to RNNLM, a C++ library targeted at recurrent neural networks.

Paper Title: Deep Target Algorithms for Deep Learning [pdf]
Author Names: Pierre Baldi*, UCI; Peter Sadowski, UCI
Abstract: There are many algorithms for training shallow architectures, such as peceptrons, SVMs, and shallow neural networks. Backpropagation (gradient descent) works well for a few layers but breaks down beyond a certain depth due to the well-known problem of vanishing or exploding gradients, and similar observations can be made for other shallow training algorithms. Here we introduce a novel class of algorithms for training deep architectures. This class reduces the difficult problem of training a deep architecture to the easier problem of training many shallow architectures by providing suitable targets for each hidden layer without backpropagating gradients, hence the name of deep target algorithms. This approach is very general, in that it works with both differentiable and non-differentiable functions, and can be shown to be convergent under reasonable assumptions. It is demonstrated here by training a four-layer autoencoder of non-differentiable threshold gates and a a 21-layer neural network on the MNIST handwritten digit dataset.

Paper Title: Knowledge Matters: Importance of Prior Information for Optimization[pdf]
Author Names: Caglar Gulcehre*, University of Montreal; Yoshua Bengio, University of Montreal
Abstract: We explore the effect of introducing prior information into the intermediate level of neural networks for a learning task on which all the state-of-the-art machine learning algorithms tested failed to learn. We motivate our work from the hypothesis that humans learn such intermediate concepts from other individuals via a form of supervision or guidance using a curriculum. The experiments we have conducted provide positive evidence in favor of this hypothesis. In our experiments, a two-tiered MLP architecture is trained on a dataset with 64x64 binary inputs images, each image with three sprites. The ﬁnal task is to decide whether all the sprites are the same or one of them is different. Sprites are pentomino tetris shapes and they are placed in an image with different locations using scaling and rotation transformations. The ﬁrst level of the two-tiered MLP is pre-trained with intermediate level targets being the presence of sprites at each location, while the second level takes the output of the ﬁrst level as input and predicts the ﬁnal task target binary event. The two-tiered MLP architecture, with a few tens of thousand examples, was able to learn the task perfectly, whereas all other algorithms (include unsupervised pre-training, but also traditional algorithms like SVMs, decision trees and boosting) all perform no better than chance. We hypothesize that the optimization difﬁculty involved when the intermediate pre-training is not performed is due to the composition of two highly non-linear tasks. Our ﬁndings are also consistent with hypotheses on cultural learning inspired by the observations of optimization problems with deep learning, presumably because of effective local minima.

Paper Title: Understanding the exploding gradient problem  [pdf]
Author Names: Razvan Pascanu*, Universite de Montreal; Tomas Mikolov, Brno University of Technology; Yoshua Bengio, University of Montreal
Abstract: The process of training Recurrent Neural Networks suffers from several issues, making this otherwise elegant model hard to use in practice.
In this paper we focus on one such issue, namely the exploding gradient.Beside a careful and insightful description of the problem we propose a simple yet efficient solution, which by altering the direction of the gradient avoids taking large steps while still following a descent direction.

Paper Title: Joint Training of Partially-Directed Deep Boltzmann Machines [pdf]
Author Names: Ian Goodfellow*,  Universite de Montreal; Aaron Courville, Universite de Montreal; Yoshua Bengio, Universite de Montreal
Abstract: We introduce a deep probabilistic model which we call the partially directed deep Boltzmann machine (PD-DBM). The PD-DBM is a model of real-valued data based on the deep Boltzmann machine (DBM) and the spike-and-slab sparse coding (S3C) model. We offer a hypothesis for why DBMs may not be trained succesfully without greedy layerwise training, and motivate the PD-DBM as a modified DBM that can be trained jointly.

Paper Title: Robust Subspace Clustering [pdf]
Author Names: Mahdi Soltanolkotabi*, Stanford; Ehsan Elhamifar, ; Emmanuel Candes, Stanford
Abstract: Subspace clustering is the problem of finding a multi-subspace representation that best fits a collection of points taken from a high-dimensional space. In this paper, we show that robust subspace clustering is possible using a tractable algorithm, which is a natural extension of Sparse Subspace Clustering (SSC). We prove that our methodology can learn the underlying subspaces under minimal requirements on the orientation of the subspaces, and on the number of samples needed per subspace. Stated differently, this work shows that it is possible to denoise a full-rank matrix if the columns lie close to a union of lower dimensional subspaces. We also provide synthetic as well as real data experiments demonstrating the effectiveness of our approach.

Paper Title: Regularized Auto-Encoders Estimate Local Statistics [pdf]
Author Names: Guillaume Alain, Universite de Montreal; Yoshua Bengio*, University of Montreal
Abstract: What do auto-encoders learn about the underlying data generating distribution? Recent work suggests that some auto-encoder variants do a good job of capturing the local manifold structure of the unknown data generating density.  This paper clarifies these previous intuitive observations by showing that minimizing a particular form of regularized reconstruction error yields a reconstruction function that locally characterizes the shape of the data generating density. More precisely, we show that the auto-encoder captures the local mean and local covariance (the latter being related to the tangent plane of a manifold near which density concentrates) as well as the first and second derivatives of the density, thereby connecting to previous work linking denoising auto-encoders and   score matching for a particular form of energy function.  Instead, the theorems provided here are completely generic and do not depend on the parametrization of the auto-encoder: they show what the auto-encoder would tend to if given enough capacity and examples. These results are for a training criterion that is locally equivalent to the denoising auto-encoder training criterion, and involves a contractive penalty, but applied on the whole reconstruction function rather than just on the encoder.  One can consider the proposed training criterion as a convenient alternative to maximum likelihood, i.e., without partition function, similarly to score matching. Finally, we make the connection to existing sampling algorithms for such autoencoders, based on an MCMC walking near the high-density manifold.

Paper Title: Attribute Based Object Identification [pdf]
Author Names: Yuyin Sun, University of Washington; Liefeng Bo*, Intel Science and Technology C; Dieter Fox
Abstract: Over the last years, the robotics community has made substantial progress in detection and 3D pose estimation of known and unknown objects. However, the question of how to identify objects based on language descriptions has not been
investigated in detail. While the computer vision community recently started to investigate the use of attributes for object recognition, these approaches do not consider the task settings typically observed in robotics, where a combination of appearance attributes and object names might be used to identify specific objects in a scene. In this paper, we introduce an approach for identifying objects based on appearance and name attributes. To learn rich RGB-D features needed for attribute classification, we extend recently introduced sparse coding techniques so as to automatically learn attribute specific color and depth features. We use Mechanical Turk to collect a large data set of attribute descriptions of objects in the RGB-D object dataset. Our experiments show that learned attribute classifiers outperform previous instance based techniques for object identification. We also demonstrate that attribute specific features provide significantly better generalization to previously unseen attribute values, thereby enabling more rapid learning of new attribute values.

Paper Title: Learning High-Level Concepts by Training A Deep Network on Eye Fixations [pdf]
Author Names: Chengyao Shen*, National University of Singapo; Mingli Song, Zhejiang University; Qi Zhao, National University of Singapore
Abstract: Visual attention is the ability to select visual stimuli that are most behaviorally relevant among the many others. It allows us to allocate our limited processing resources to the most informative part of the visual scene. In this paper, we learn general high-level concepts with the aid of selective attention in a principled unsupervised framework, where a three layer deep network is built and greedy layer-wise training is applied to learn mid- and high- level features from salient regions of images. The network is demonstrated to be able to successfully learn meaningful high-level concepts such as faces and texts in the third-layer and mid-level features like junctions, textures, and parallelism in the second-layer. Unlike pre-trained object detectors that are recently included in saliency models to predict semantic objects, the higher-level features we learned are general base features that are not restricted to one or few object categories. A saliency model built upon the learned features demonstrates its competitive predictive power in natural scenes compared with existing methods.

Paper Title: Multipath Sparse Coding Using Hierarchical Matching Pursuit [pdf]
Author Names: Liefeng Bo*, Intel Science and Technology C; Xiaofeng Ren, ; Dieter Fox
Abstract: Complex real-world signals, such as images, contain discriminative structures that differ in many aspects including scale, invariance, and data channel. While progress in deep learning shows the importance of learning features through multiple layers, it is equally important to learn features through multiple paths. We propose Multipath Hierarchical Matching Pursuit (M-HMP), a novel feature learning architecture that combines a collection of hierarchical sparse features for image classification to capture multiple aspects of discriminative structures. Our building blocks are KSVD and batch orthogonal matching pursuit (OMP), and we apply them recursively at varying layers and scales. The result is a highly discriminative
image representation that leads to large improvements to the state-of-the-art on many standard benchmarks, e.g. Caltech-101, Caltech-256, MIT-Scenes and Caltech-UCSD Bird-200.

Paper Title: Jointly Learning and Selecting Features via Conditional Point-wise Mixture RBMs  [pdf]
Author Names: Kihyuk Sohn*, University of Michigan; Guanyu Zhou, ; Honglak Lee, University of Michigan
Abstract: Feature selection is an important technique for finding relevant features from high-dimensional data. However, the performance of feature selection methods is often limited by the raw feature representation. On the other hand, unsupervised feature learning has recently emerged as a promising tool for extracting useful features from data. Although supervised information can be exploited in the process of supervised fine-tuning (preceded by unsupervised pre-training), the training becomes challenging when the unlabeled data contain significant amounts of irrelevant information. To address these issues, we propose a new generative model, the conditional point-wise mixture restricted Boltzmann machine, which attempts to perform feature grouping while learning the features. Our model represents each input coordinate as a mixture model when conditioned on the hidden units, where each group of hidden units can generate the corresponding mixture component. Furthermore, we present an extension of our method that combines bottom-up feature learning and top-down feature selection in a coherent way, which can effectively handle irrelevant input patterns by focusing on relevant signals and thus learn more informative features. Our experiments show that our model is effective in learning separate groups of hidden units (e.g., that correspond to informative signals vs. irrelevant patterns) from complex, noisy data.

N00198979.jpg was taken on December 09, 2012 and received on Earth December 09, 2012. The camera was pointing toward ENCELADUS at approximately 434,725 miles (699,622 kilometers) away, and the image was taken using the BL1 and CL2 filters.

Image Credit: NASA/JPL/Space Science Institute

Join the CompressiveSensing subreddit or the Google+ Community and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.