Algorithms for modern Big Data analysis deal with both massive amount of samples and a large number of features (high-dimension). One way to cope with these challenges is to assume and discover the existence of localization in the data by uncovering its intrinsic geometry. This approach suggests that different data segments can be analyzed separately and then unified in order to gain an understanding of the whole phenomenon. Methods that utilize efficiently localized data are attractive for high-dimensional big data analysis, because they can be parallelized, and thus the computational resources, which are needed for their utilization, are realistic and affordable. These methods can explore local properties such as intrinsic dimension that vary among different pieces of data. This thesis presents two different methods to locally analyze large datasets for classification, clustering and anomaly detection. The first method localizes dictionary learning based on matrix factorization techniques. We utilize randomized LU decomposition and QR-decomposition algorithms to build dictionaries that describe different types of data. Then, these dictionaries are used to assign new samples to their respective class. One application in cyber security deals with learning of computer files and detecting executable code hidden in PDF files. In a different application, a dictionary learned from a normally behaving computer network data is used to detect anomalies in test data which may imply a cyber threat.
The second method is localized diffusion process (LDP), which constitutes a coarse-graining of the classic Diffusion Maps algorithm. In LDP, a Markov walk is calculated on small data point clouds instead of the original data points. This work establishes a theoretical foundation for the Localized Diffusion Folders for hierarchical data analysis.
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.