Wednesday, May 26, 2004

Managing complexity through compression

This paper on language trees and zipping was one of the first to show how the complexity of the very elaborate algorithms used by compression algorithms can be used to also measure the closeness of a specific file to another. In that case, they were comparing the language used for one specific text the declaration of human rights. They were able to do a very convincing sorting of the different languages and show how each of them were related to the other. A very impressive feat. Later Rudi Cilibrasi at CWI in the netherlands has been trying to think of other ways to do similar sorting, ryhtm and melody was the first experiment, he is also trying others. Instead of using the entropy discussion on the language tree zipping paper, he is using a different "distances" between objects. Much can be understood by reading this paper. All these publications show that some of the research in compression can be directly applied to the sorting of very complex objects. The higher the compression capability of an algorithm, the more likely one is obtaining the most important features of a document. Hence it would seem important to be able to use different compression schemes as can be shown here or here. However, the "clustering by compression" article mentions that this is only tangentially relevant. My guess is, it becomes only irrelevant when one uses a sufficiently good tool in the first place i.e. the zipping capability of winzip or gzip are good enough. But as the silesia corpus result shows, different compression tools do differently for different types of documents, so maybe there is a need to call on different compression capabilities to better provide sorting capabilities within specific sets. One has to note that schemes like Mpeg for movies are also compression engines and could be used equivalently.

In this respect, what I find somewhat very interesting is the possbility of using these compression tools on raw CCD Images, thereby eliminating aspect of the pictures that does not contain enough information. It so happens that another branch of engineering tries to do the same by putting meaning in the pictures at hand. Namely, using a variety of wavelet + curvelets + other bases in order to analyze the content of image.

In terms of day to day engineering, there is a whole field that could benefit from these techniques, namely multiphase flows. In multiphase flow, the topology of where different phases in pipes or containers is important for many different engineering processes. For instance, when one wants to pump very crude oil from deep down (like in canada), there is the need to lubricate the pipelines with water so that a concentric film of water touches the pipe and the core of the pipe is filled with oil. The pressure needed to suck up this mixture in this configuration is much less than if, for instance, the oil were to stick to the wall and the water would be in the core of the pipeline. Hence, the fact that water is on the outside is tremendously important for the pumping of that oil. This is called Lubricated pipelining (vertical flow is available here.) The configuration I am refering to is called annular flow in two-phase flow. Depending of the flow rate of either the water or the oil, the configuration may be far for the optimal topology described above and may yield very small bubbles of water in oil or vice versa. One of the main job of the engineer is trying to figure out by devising maps, when a specific flow regime would occur and hold or be unstable. Devising flow regime maps means building experiments that will explore the flow rates and the configurations of the fluids. When the experiment is done, a much needed element is categorizing the flows. In two phase flow for instance, we have names like bubbly flow, churn flow, annular flow, slug flow and so on. We also have pictures to describe them and much guess work goes into defining how the pictures in the litterature relates to what ones sees in the experiment. I would not be surprised to find out that a more automatic way of classifying these flows through the compression of the movies taken of the flow would yield a more universal classifying technique.

Other source of information on compression can be found here.

Sunday, May 23, 2004

In search of life on June 8th, 2004.

In order to find exo-planets, astronomers try to find configurations where planets pass in front of their own stars. In the solar system, since we live on the third planet and have only one moon, it is rare to see the same type of phenomenon close-by. It will be the case however for the Venus Transit on June 8th, 2004. This is a rare occurence, because even though Venus orbits faster around the Sun than Earth, it is not on the same inclination plane. The next Venus transit will occur again in 2012. The fascinating thing about Venus transits as opposed to  mercury transits events come from the fact that Venus has an atmosphere. The astronomy community will probably use this event to better interferometry techniques in order to recognize exo-planets light years away from us.

Unfortunately this Venis transit won't be observable from Texas. Since Moon eclipses, or Mercury/Venus transits are rare occurences, asteroids that have an orbit smaller than that of Earth are probaby good candidates to calibrate some of these interferometry techniques.

Thursday, May 20, 2004

Who's close to me

At long last I was looking for a service like this. If you live in one of the fifty largest cities of France, you can give them your address, the type of business you are interested in and it will provide a map with a choice of ten possibilities and ways on how to get there. Some people are trying this in the U.S. as well.

Wednesday, May 19, 2004

TEX-MEMS VI has been scheduled

After having started the first meeting back in 1999, we enjoyed seeing the meeting being held in other places and have a life on its own. So after five years, it comes back home. TEX-MEMS VI will be at Texas A&M University on September 9th, 2004. You can register online right now.

Can you old Gameboy be a savior ?

I read recently about this professor, Marcel Cremmel, in Strasbourg, France who, in order to help his brother who was in Madagascar and hit with Paludism, had to think of a way to devise a cheapo cardiograph. In his words «Tout a démarré par une visite chez mon frère à Madagascar. Il sortait d'une crise de paludisme assez aiguë et on lui avait prescrit un médicament qui peut avoir des inconvénients au niveau cardiaque». The medicine given to patients hit with Paludism is the Halofantrine which can trigger cardiac issues and therefore there is need for a way to measure heart rythms. The old GameBoy console just needs some reengineering to provide the needed measurement. According to this article (my translation) the console itself is not changed in the process, only the cartridge is modified. The new card is connected to three electrodes which are themselves connected to one foot and the two wrists. This set-up enables one to see on the screen if the heart is releasing itself right and at the right time after its contraction.

Tuesday, May 18, 2004

See it from the ground, see it from the sky

See it from the ground or see it from the sky

Urban legends and the propagation of rumors

So now I am reading this and I am thinking there must be some urban legend to it. First, there is no real date, there is no author and I find somewhat too weird for a researcher to wait for 20 years to make a finding of some sort. It has all the making of some type of urban legend. In order to find out if it is, I generally use google and type some of the words in the story, if the first three links have hoaxes in it, it most certainly is one. Three sites show up generally: snopes, Hoaxbusters and hoaxbuster, a french resource on hoaxes. I tried the words "1245 of them" and the first one that comes up is this. So I went for the book on amazon and read the customer reviews. By this time, I am pretty much convinced this is a hoax. This is not so bad because in this case, everybody really wants to believe in this story, except that sometimes the propagation of rumors has a direct negative impact on someone, countries or companies. I found a lot of good reference to the propagation of rumors here. What I find fascinating when I read the book of Kapferer is the ability of these stories to change while remaining as potent as the initial rumor.

Wednesday, May 12, 2004

The readiness is all

It looks like the mechanism by which Grand Challenge contestants could provide funding for themselves while building the technology for winning the grand challenge race is now here. This Learning Applied to Ground Robot (LAGR) comes from the following observation taken out the PIP:

"...Current systems for autonomous ground robot navigation typically rely on hand-crafted, hand-tuned algorithms for the tasks of obstacle detection and avoidance. While current systems may work well in open terrain or on roads with no traffic, performance falls short in obstacle-rich environments. In LAGR, algorithms will be created that learn how to navigate based on their own experience and by mimicking human teleoperation. It is expected that systems developed in LAGR will provide a performance breakthrough in navigation through complex terrain....Because of the inherent range limitations of both stereo and LADAR, current systems tend to be “near-sighted,” and are unable to make good judgments about the terrain beyond the local neighborhood of the vehicle. This near-sightedness often causes the vehicles to get caught in cul-de-sacs that could have been avoided if the vehicle had access to information about the terrain at greater distances. Furthermore, the pattern recognition algorithms tend to be non-adaptive and tuned for particular classes of obstacles. The result is that most current systems do not learn from their own experience, so that they may repeatedly lead a vehicle into the same obstacle, or unnecessarily avoid a class of “traversable obstacles” such as tall weeds..."

Tall weed, uh, sound like experience is talking.

Impressive numbers 2004.

The BBC has an archive that is about 10 PB large. That's 10 billion Mbytes.

Tuesday, May 11, 2004

The Davalos-Carron's Law or the DC Law.

Our internal application provides a way of tracking action items evolution within our organization and also serves as a repository of documents exchanged between different parties (internal and external). In other words, one part of this application could be considered the equivalent of all E-mails/Messenger discussions minus the attachments, the other part of the application could be considered as the union of all E-mail attachments, hardrives in the company as well as the ftp and internal web sites (intranet.) Over the past three years, we have found a simple relationship between the amount of information data in the first part (text of E-mail/messenger discussions) posted to the application and the data in the second part (attachments of E-mails/ftp sites/internal library...). The second part of the information is a thousand times bigger that the first one. This really means that when Google proposes 1 GB E-mail storage, it should be equivalent to about 1MB of E-mails without attachments. Since most E-mails are not more than say about 300 Bytes each plus headers (an E-mail is then about 1 to 2 Kb large), there is a great chance that the google E-mail average user account will not contain more than a 1,000 E-mails on average. The main reason we think it follows this law is really because the user is told that there is no memory limit which could be considered the case with the new Gmail service of Google (as opposed to storage limits on yahoo and hotmail). In case of memory limit, the user occasionaly purges her/his system by removing the bigger items first and therefore compresses this ratio between light threads/communication and documentation (self generated movies and audios, presentations, ...)

Not your average DSL

DSL refers to Domain Specific Languages, not your broadband connection. Anyway, ever since I discovered languages like LISP or Caml, I am trying to figure out how to produce languages specific to a rather broad area of science. Broad is a rather loose term but let me define it a little bit further. When the underlying physics of a specific phenomena is common to many different fields, one expects the description of that phenomena to be pretty much understood by every specialists. An example of this can be found with a product provided by Lexifi, in this case a domain specific language for contracts (or here for a general presentation. My intent is more along the lines of using this same type of concept for multilayer media and the linear transport equation, more on that later....

Sunday, May 09, 2004

Grand Challenges...

It looks like Wired obtained an explanation from each of the teams involved the darpa grand challenge. As one can see in a previous of my post, it looks like they did not really got the good explanation. Oh well, you cannot blame the teams from divulging their weaknesses. On a related note, NASA is coming up with grand challenges on their own. I hope they are not asking people to do a Mars mission with only 250 K$ of prize money....

GPU programming, part II

It looks like the programming of GPUs is taking off. This article shows an improvement of 3.5 times for a 1500 by 1500 matrix multiplication. This is not enough for many different reasons. One of the reason is Moores law which basically states that in about a year this performance will be nulled by the speed of the CPU. Well, one would say, if they were to use a 10000 by 10000 element matrix the speedup would have been more impressive and I would have to agree. However, they are not that many engineering problems that require 10000 by 10000 element matrix multiplication. Generally one wants to rely of sparse solvers because a full system with a high number of elements is likely having issues with precision. The only class of problems that deals with that many elements is integral equation, however, there again, the projection of the problem on a reasonable basis like wavelet or curvelet should sparsify the system at hand.

Bee annihilation

This movie is pretty tough to look at. Damm the hornets.

Saturday, May 08, 2004

Unreasonable faith in modeling makes us burn witches. Of earthquakes and spam.

As some of you know, the team working with Keilis-Borok has ascertained that an earthquake of magnitude larger than 6.5 would hit Southern California before September 5th of this year. Their modeling has many parameters but they show that variation in these parameters still provide a way of devising earthquakes with a high certainty (only one false alarm out of five earthquakes.) For those of us not entirely convinced, they used the algorithm on Italian Earthquakes from 1979 to 2001. The pictures are overall pretty compelling as can be seen here in the map pictures all the way down below. The only miss seems to have been the Assisi earthquake that was categorized as a 6+ magnitude earthquake but predicted by the M8 algorithm to be only a 5.5. The worst that can happen is really if this is a false alarm, a situation not unlike that found for spam filters. In the fight against spam, everybody wants to have a spam filter that does not discard any E-mail that is genuine. Customers for spam filters expect a 100 % accuracy there. They can deal with spam coming through to their mailbox but absolutely do not want real E-mails to not make it to their computers. In the earthquake situation, people can deal with a model that cannot predict an earthquake but they absolutely cannot deal with a false alarm because it takes faith away from the algorithm altogether.

Friday, May 07, 2004

Thursday, May 06, 2004

Everything leads to two-phase flow

I have already mentionned how the use of tools like Visicalc were shown to give a false impression of control to the user. Most of this discussion was based on a discussion by Peter Coffee on how to go beyond the usual spreadsheet analysis by adding probability distributions to cells in a spreadsheet environment like Excel. DecisionTools Pro seems to be one of these products as well as Crystal Ball Pro. Both of these products are priced in the range of 1500 plus dollars. Not a bad price, but does it you an additional impression of control without really have any ? What seems very akward is the sense that one knows the probability distribution of an event. Indeed, the idea that a particular process may have a certain probability distribution known in advance (i.e. for instance a gaussian) is very suspicious. One knows the law for a specific process only after having gathered much data on the same process. This really means that either one is a specialist of this very specific process in a monotonous environment or that the company has been gathering data over the years on it, both situation being very particular and overall pretty unusual. Either way, as Coffee points out it really puts the perspective of failure as an issue rather than believe that everything will work. Another software mentioned by Coffee is that of Projected Financials , a new way of doing business forecast using a different interface than the traditional error-prone excel spreadsheet, in his words " A revenue stream, for example, has certain characteristics, such as when it starts and what trend it exhibits. A financial statement aggregates some numbers, such as monthly profits that sum to yearly profits, but reports others, such as accounts receivable, as levels rather than flows. ". It so happens that this is really a little bit what we do with our application (task manager), a task is a very specific object with many attributes and a history and can very hardly be quantified as a number. I am currently evaluating how to use this concept to make our software more universal. This approach is akin to many of the approaches displayed in building a Domain Specific Language (DSL). Funnily enough, it so happens that one of the case study for this software (projected financials) is that of Joe Marsala who in turn does two-phase flow for a living. For those of you who never wanted to ask me the difference between two-phase flow and single phase flow cooling here is the presentation that you might find interesting.

Tuesday, May 04, 2004

Orders of magnitude 2004.

It looks like it took at least 500 Terabytes to make the trilogy of the lord of the rings. Google offers 1GB E-mail accounts, and seems to have over 79,000 CPUs doing your favorite search while indexing 4 billions pages.
To put things in perspective, it dwarfs some of the current world top supercomputers doing classified things such as intelligence processing and nuclear stockpile stewardship (until Blue Gene/L comes online) or even environmental computations. In 2001, it was taking something like in the surrounding of 50 Terabytes to produce Final Fantasy.