Wednesday, May 26, 2004

Managing complexity through compression

This paper on language trees and zipping was one of the first to show how the complexity of the very elaborate algorithms used by compression algorithms can be used to also measure the closeness of a specific file to another. In that case, they were comparing the language used for one specific text the declaration of human rights. They were able to do a very convincing sorting of the different languages and show how each of them were related to the other. A very impressive feat. Later Rudi Cilibrasi at CWI in the netherlands has been trying to think of other ways to do similar sorting, ryhtm and melody was the first experiment, he is also trying others. Instead of using the entropy discussion on the language tree zipping paper, he is using a different "distances" between objects. Much can be understood by reading this paper. All these publications show that some of the research in compression can be directly applied to the sorting of very complex objects. The higher the compression capability of an algorithm, the more likely one is obtaining the most important features of a document. Hence it would seem important to be able to use different compression schemes as can be shown here or here. However, the "clustering by compression" article mentions that this is only tangentially relevant. My guess is, it becomes only irrelevant when one uses a sufficiently good tool in the first place i.e. the zipping capability of winzip or gzip are good enough. But as the silesia corpus result shows, different compression tools do differently for different types of documents, so maybe there is a need to call on different compression capabilities to better provide sorting capabilities within specific sets. One has to note that schemes like Mpeg for movies are also compression engines and could be used equivalently.

In this respect, what I find somewhat very interesting is the possbility of using these compression tools on raw CCD Images, thereby eliminating aspect of the pictures that does not contain enough information. It so happens that another branch of engineering tries to do the same by putting meaning in the pictures at hand. Namely, using a variety of wavelet + curvelets + other bases in order to analyze the content of image.

In terms of day to day engineering, there is a whole field that could benefit from these techniques, namely multiphase flows. In multiphase flow, the topology of where different phases in pipes or containers is important for many different engineering processes. For instance, when one wants to pump very crude oil from deep down (like in canada), there is the need to lubricate the pipelines with water so that a concentric film of water touches the pipe and the core of the pipe is filled with oil. The pressure needed to suck up this mixture in this configuration is much less than if, for instance, the oil were to stick to the wall and the water would be in the core of the pipeline. Hence, the fact that water is on the outside is tremendously important for the pumping of that oil. This is called Lubricated pipelining (vertical flow is available here.) The configuration I am refering to is called annular flow in two-phase flow. Depending of the flow rate of either the water or the oil, the configuration may be far for the optimal topology described above and may yield very small bubbles of water in oil or vice versa. One of the main job of the engineer is trying to figure out by devising maps, when a specific flow regime would occur and hold or be unstable. Devising flow regime maps means building experiments that will explore the flow rates and the configurations of the fluids. When the experiment is done, a much needed element is categorizing the flows. In two phase flow for instance, we have names like bubbly flow, churn flow, annular flow, slug flow and so on. We also have pictures to describe them and much guess work goes into defining how the pictures in the litterature relates to what ones sees in the experiment. I would not be surprised to find out that a more automatic way of classifying these flows through the compression of the movies taken of the flow would yield a more universal classifying technique.

Other source of information on compression can be found here.

1 comment:

Anonymous said...

Those are some insightful points. I and several others have also thought that wavelets are appropriate for image coding and analysis with compression, and we have even begun investigating it. I think the idea of using compression to categorize fluid flow regimes is a great one; I have also looked at those diagrams, e.g. for the different types of airflow around an airfoil, and there is definitely an opportunity for a certain type of texture recognition. I think most compressors do well at texture recognition through normal, local, statistical analysis and so I expect this idea would work well if somebody tries it. Thanks for your interesting ideas about this work. Cheers,

Rudi Cilibrasi

Printfriendly