Going through my “huge” archive of digital camera images, I noticed that I had some duplicate images. The filename differs but the content is the same. Now I could just download a tool from the almighty internet, that could compare all my images, but that would be going to far! Why not give it a go myself?
Several approaches come to mind. I could start cataloging all the files and then compare filesizes. If the filesize match, I should then compare the content of the files. If the content matches, then the file should be marked a duplicate.
I could also Google for an advanced image comparison algorithm and spend the next week trying to implement it.
I took the easy way out 🙂 First I catalog all the image files (*.jpg). Then I calculated the MD5 hash of every file. Before all that, I had created an SQL Compact Edition 3.5 database containing two fields. Filename and MD5 hash. Only one index was created, MD5 hash must be unique.
[ad name=”Google Adsense-1″]
So every time I tried to insert a record into the database, and the MD5 hash already existed, I would get an exception. In the exception handling code I rename the file to “DUP_”+original filename. Naturally I could have deleted it on the spot, but I wanted to make sure the file really was a dup 🙂
Anyways, code speak louder than words
[sourcecode language=”csharp”]
// Create a new instance of the MD5CryptoServiceProvider
MD5 md5Hasher = MD5.Create();
// Get all files, including subdirectories (recursively)
string[] files = Directory.GetFiles( @"path_to_imagefiles_here/",
"*.jpg",
SearchOption.AllDirectories );
// loop through all files
BinaryReader br = null;
foreach( string file in files )
{
try
{
br = new BinaryReader( new FileStream( file, FileMode.Open ) );
FileInfo fi = new FileInfo( file );
byte[] fileData = new byte[(int)fi.Length];
// read content of file
br.Read( fileData, 0, fileData.Length );
br.Close();
// compute the MD5 hash
byte[] md5Data = md5Hasher.ComputeHash( fileData );
// create the MD5 human readable string
StringBuilder sBuilder = new StringBuilder();
// Loop through each byte of the hashed data
// and format each one as a hexadecimal string.
for( int i = 0; i < md5Data.Length; i++ )
{
sBuilder.Append( md5Data[i].ToString( "x2" ) );
}
if( !insertRow( file, sBuilder.ToString() ) )
{
// rename the file
File.Move( file, string.Format("{0}\\{1}{2}",
fi.DirectoryName,
"DUP_",
fi.Name) );
}
}
}
[/sourcecode]
Granted, this might not be the best way but it only took about 10 minutes to write… and it works for me 🙂