Monday, June 06, 2011

Are Perceptual Hashes an instance of Compressive Sensing ?

On Hacker News, iandanfortth wondered if perceptual hashes were related to compressed sensing. The answer is, I think, yes. What are perceptual hashes ? according to the web page:

Perceptual hash algorithms describe a class of comparable hash functions. Features in the image are used to generate a distinct (but not unique) fingerprint, and these fingerprints are comparable.

Perceptual hashes are a different concept compared to cryptographic hash functions like MD5 and SHA1. With cryptographic hashes, the hash values are random. The data used to generate the hash acts like a random seed, so the same data will generate the same result, but different data will create different results. Comparing two SHA1 hash values really only tells you two things. If the hashes are different, then the data is different. And if the hashes are the same, then the data is likely the same. (Since there is a possibility of a hash collision, having the same hash values does not guarantee the same data.) In contrast, perceptual hashes can be compared -- giving you a sense of similarity between the two data sets.

A more robust algorithm is used by pHash. (I use my own variation of the algorithm, but it's the same concept.) The pHash approach extends the average approach to the extreme, using a discrete cosine transform (DCT) to reduce the frequencies.
  1. Reduce size
  2. . Like Average Hash, pHash starts with a small image. However, the image is larger than 8x8; 32x32 is a good size. This is really done to simplify the DCT computation and not because it is needed to reduce the high frequencies.
  3. Reduce color
  4. . The image is reduced to a grayscale just to further simplify the number of computations.
  5. Compute the DCT
  6. . The DCT separates the image into a collection of frequencies and scalars. While JPEG uses an 8x8 DCT, this algorithm uses a 32x32 DCT.
  7. Reduce the DCT
  8. . This is the magic step. While the DCT is 32x32, just keep the top-left 8x8. Those represent the lowest frequencies in the picture.
  9. Compute the average value
  10. . Like the Average Hash, compute the mean DCT value (using only the 8x8 DCT low-frequency values and excluding the first term since the DC coefficient can be significantly different from the other values and will throw off the average). Thanks to David Starkweather for the added information about pHash. He wrote: "the dct hash is based on the low 2D DCT coefficients starting at the second from lowest, leaving out the first DC term. This excludes completely flat image information (i.e. solid colors) from being included in the hash description."
  11. Further reduce the DCT
  12. . This is the magic step. Set the 64 hash bits to 0 or 1 depending on whether each of the 64 DCT values is above or below the average value. The result doesn't tell us the actual low frequencies; it just tells us the very-rough relative scale of the frequencies to the mean. The result will not vary as long as the overall structure of the image remains the same; this can survive gamma and color histogram adjustments without a problem.
  13. Construct the hash
  14. . Set the 64 bits into a 64-bit integer. The order does not matter, just as long as you are consistent. To see what this fingerprint looks like, simply set the values (this uses +255 and -255 based on whether the bits are 1 or 0) and convert from the 32x32 DCT (with zeros for the high frequencies) back into the 32x32 image: = 8a0303769b3ec8cd
  15. At first glance, this might look like some random blobs... but look closer. There is a dark ring around her head and the dark horizontal line in the background (right side of the picture) appears as a dark spot.

The differences with traditional Compressed Sensing are minimal in fact, but the algorithm is in line with some of the work undertaken in manifold signal processing, LSH, random projection etc. The differences are: 
  • The proposed hash is deterministic, generally in Compressed Sensing, you want to rely on random projections in order to get some sort of universality and by the same token some sort of robustness. The random nature of the projection is preferred but not a necessary condition as there are some results for deterministic compressive sensing problems.
  • Step 3 and 4 are the most fascinating steps because they are clearly one of the approaches used in manifold signal processing for images. In short, because the image manifold is non differentiable you really want to defocus the pictures so that similar images become neighbors in the now smooth image manifold (in the pHash implementation you just use the low frequency component of the image). The defocusing idea in compressive sensing was, for instance, featured in Mike Wakin's PhD thesis and attendant papers.  
  • For one image, the hash consists of 16 measurements (16 bits of the hash result). So to a certain extent this is achieving some sort of compression. But this has to be weighted against the fact that the algorithm is already lossy as the initial size and color of the picture are lost forever after step 1 and 2. This information loss brings us to the next point.
  • Step 6 also loses some information, however, it is very similar to some CS "lossy" schemes such as the 1-bit compressed sensing approach (there you retain only the sign of the measurement). Let us mention the use of 1-bit scheme in a previous entry on seismic noise this week-end

All in all, I am not overly surprised that this algorithm works and it looks very easy to implement.  Compressive sensing could probably provide faster and more robust ways to encode images. Why you ask ? Because:

  • the random mixing of DCT components makes the measurement pretty robust
  • the redundancy brought by the use of all colors makes the hash robust to comparison between similar scenes with different colors (do we care about hair color feature ?)

The other fascinating aspect of the pHash algorithm is that it probably works because the underlying image is sparse in the DCT basis. However, nothing in the algorithm hints on that, it just works. I have not tried it but this is how a CSHash would work:
  • Step 1: Same as pHash step 1
  • Step 2: Keep the colors
  • Step 3: Blur the image (with a gaussian kernel for instance) and remove average value from the image for each color
  • Step 4: Make your three color image into one large vector (size is 32x32x3 = 3072) , and multiply that vector with a 16 x 3072 rectangular random Gaussian matrix  (this same matrix needs to be used for all the photos you want to analyze/compare.)
  • Step 5: Take the most important bit of each element of the new vector (of size 16).
  • Step 6: Same as pHash step 7.
One advantage of this algorithm is that it is specific to this implementation since one choose a random Gaussian matrix in the algorithm. In other words, you have your own encryption algorithm and your database of hash cannot be guessed easily. It might even happen that a specific random matrix does a better job on specific type of images specific to your collection. Finally, since the encoding is pretty universal, it can directly be applied to movies as well.

Related posts:


Eric Tramel said...


This is a pretty interesting application, thanks for putting it up :)

I kind of want to just go ahead and slap together your suggestion and see what happens, maybe tonight...

Igor said...

Cool I look forward to hearing about it.