Detecting duplicate and near-duplicate files

This web page describes research I did for Google 2000 through 2003, although mostly in 2000.

This work resulted in US Patent 6658423, by William Pugh and Monika Henzinger, assigned to Google.

The information here does not reflect any information about Google business practices or technology, other than that described in the patent. I have no knowledge as to whether or how Google is currently applying the techniques described in the patent. This information is not approved or santioned by Google, other than by giving me permission to discuss the research I did for them that is described in the patent.

The patent describes techniques to find near-duplicate documents in a collection. Google is obviously considering applying these techniques to web pages, but they could be applied to other documents as well. It might even be possible to sequences that are not documents (such as DNA sequences), although that raises some questions that aren't covered here.

I'll get more information up shortly, but for now:

US Patent 6658423
A Pdf version of a presentation on the technique
A screenshot of a Google search for [Aeron Chair] in 2000. Note that currently Google returns much better search results with no obvious duplicates.