Tuesday, July 17, 2007

How does the Rice one pixel camera work ?

I have been asked this question several times by colleagues and friends, so I decided to use my talents in Microsoft Paint to try to provide an explanation with some beautifully handcrafted pictures.

The problem to solve is the following: Let's say you have only one sensitive pixel/photodetector/radiation detector/Teraherz detector at your disposal but you need to take a 10 Megapixel image like the one you can get from a Best Buy or an Amazon point-and-shoot cameras how would you go about it ? There are many ways to do this but let us also imagine that you can also put your hands on a DMD chip that is made of 10 million oscillating mirrors (an example include the famous Texas Instrument DMD) like the ones you can find in digital projectors and you can command the action of each and everyone of these tiny (15 micrometer by 15 micrometer) mirrors. In other words, with a proper set-up, every milliseconds, you can decide to shine each of these mirrors on your detector .... or not. There are now two options for obtaining this 10MP image.

First option (the raster mode):
The raster mode is simple. Just shine one mirror at a time onto the detector and let all the other mirros shine elsewhere. Do this once
twice (with another mirror)

thrice (with yet another mirror)
four times (....)
five times (...)
....5 millions times (...)
until you reach the last 10 millionth mirror.

After doing all this, you now have ten million information which put together piece by piece provides you with a 10 MP image. Generally, you then use a small CPU to perform the Discrete Cosine Transform so that eventually you are now the proprietor of a JPEG image (i.e. a compressed version of this 10MP image).


Second option (the Compressive Sensing mode):

You tell the set of mirrors, on that DMD chip, to display a set of random tilings. That way, a random set of mirrors are shining the incoming light unto the detector.

You do this once with an initial random tiling and obtain your first CS measurement
then you do this again with a second random tiling,  in order to obtain your second CS measurement

then you do this again with a third random tiling,  this is your third CS measurement
and so on.

Compressed sensing tells you that with very high probability, you will get the same result as the raster mode (first method) above but with many fewer CS measurements than the 10 million raster mode measurements obtained in the first method. In fact, instead of taking 10 million raster mode measurements, you are likely to need only 20 percent of that in the form of CS measurements, maybe even less.

The reason this second method works stems from the idea that most natural images are sparse in bases ranging from cosines, wavelets to curvelets (this is also why JPEG does a tremendous job in decreasing the size of most images). Functions that represent random tilings of reflective and non reflective mirrors (0s and 1s) are said to be mathematically "incoherent" with these bases thereby allowing an automatic compression at the detector level (here in the second mode, there is no need for compression with JPEG at the very end since the CS measurements are already compressed version of the image). A computational steps is required to obtain a human viewable image from these CS measurements. That step uses these solvers.

What are the pros and cons of the second option compared to the first one ?
Pros:
  1. The overall sensor requires very low power because there is no CPU/GPU/FPGA trying to perform the compression stage at the very end (JPEG).
  2. The sensor is dedicated to acquiring information. Information processing can be done somewhere else ( think on some other planet)
  3. Compared to raw images (raster mode output), the information to be transmitted is very much compressed albeit not optimally (compared to JPEG).
  4. Instead you can spend all your money designing the very best sensitive pixel you want, it may even act as a spectrometer (looking at many different energy bands), radiation detector and so forth.
  5. Last but not least, the amount of light that goes to the detector is about half of the 10 million mirrors, which is quite high compared to the single mirror exposure in the raster mode (first method). In other words, the signal to noise ratio is pretty high (a good thing) in the CS mode as opposed to the raster mode.
Cons:

  1. Faint signals could be submerged in the CS measurements. You first need to have a high signal to noise ratio signal to detect small signals.
  2. Your sensor has to have a much larger dynamic range than in the raster mode in order for the A/D quantization algorithm to not mess with the CS measurements.
  3. The number of CS measurements will be higher than the number of bits required to store the JPEG of the raster mode (about 4 to 5 times higher). Then again (and this is a pro-argument, the transformation from raster mode to JPEG is the power hungry stage of your camera and the reason you always need to recharge it, memory on the other hand is cheap.)


Terry Tao provides a much clearer mathematical explanation albeit with no beautifully hand crafted images such as the ones found here. All information about this single pixel camera system can be found at Rice University.

[ Update 2013: Inview Corporation develops single pixel cameras using the intellectual property from Rice University ]

8 comments:

Aleks said...

How would one be able to use this concept for compression: you would have to communicate the locations of those pixels too, which would take a lot of bits?

Second, why sample randomly, why not sample intelligently?

Igor said...

Aleks,

I guess your question is:
- The number of measurements is lower
- but you need to also transmit all the "masks" that produced the random measurement.
-> the total amount of information is much higher, because you have to send both the "masks" and the measurements.

Random needs to be taken in the following understanding:
- at the lab, you use your favorite random number generator and produce these masks ahead of time.
- you install those "masks" as tables/memory in the sensor
- when the sensor is in the field, it recalls these masks from its memory.
- when transmitting the compressed measurement somewhere, you can add a bit or two to let the receiver know which of the random mask was used.

So in effect, it is random, but it is not generated on the fly.

For the random intelligently, some people are doing it and the bayesian compressive sensing would fall in that category. However, in these types of system, doing any computation uses a lot of power and so the thinking is that if you want to have those for years, you do not want them to draw on the power side of things.

And then there is the kicker as mentioned by Richard Baraniuk and his team. Since you know these random measurements are pretty much universal, the information you are retrieving from them may not be the same as the ones you are retrieving from them ten years down the road where we will have much better reconstruction techniques. If you put intelligence in it today, you are most likely decreasing your ability to get anything interesting out these data in the future.

Igor.

Andrew Gelman said...

Thanks.

abhishek said...

Do we need a longer exposure time for these single pixel cameras? If 't' is the exposure time needed for traditional camera, then for single pixel camera do we need 'm*t' exposure time where 'm' is the number of distinct random masks?

Igor said...

Abhishek,

Yes, as far as I can tell you do need m*t exposure time.

Cheers,


Igor.

abhishek said...

Thanks for the answer. I have one more question.

As I understand, the random mask represents a projection matrix 'U' consisting of 0's and 1's, and image data 'x' (from spatial domain) is projected onto these bases.

Do we need that the data in original space should be sparse (i.e. 'x' should be sparse in the current representation)? or it can be sparse in any basis?

Another question is - suppose 'x' is not sparse in current representation. Do we need to design the random projection matrices so that they bring out the sparsity in data? or it can be any random matrix of 0's and 1's - which are ofcourse incoherent with cosine bases?

thanks a lot,
abhishek.

austin said...

I know this is a bit old, but I wanted to point out that a larger single pixel with back thinning, and an electron multiplying register could do well to improve single exposures, thereby reducing the acquisition time. - I have some descriptions of EM cameras at www.austinblanco.com/blog

Anonymous said...

Thanks for the very clear explanation.