Need a super-fast filestream hashing algorithm

**LCHIEN** · 05-26-2009, 03:30 PM

I've hashed filenames and variable names before but not filestreams. I thought hashing was a way to compute a specific address for looking up a textual word in a table with a single computation as opposed to a search with multiple compares. I'm not even sure what a filestream is. Maybe i've been out of software too long.

In software you can very often trade between speed and memory...i.e. larger hash table length would result in fewer collisions to resolve.

I found this:
"The FileStream class inherits the Stream class to provide byte reading/writing operations to and from a source file. The FileStream class provides synchronous reading/writing methods as well asynchronous reading/writing methods. This class also provides a seeking functionality; I'll explain more about that later. You can construct a FileStream object in many ways, so let's look at some of them."

So, it appears to be a C# structure. for indicating file sources.

**Alex Franke** · 05-26-2009, 04:17 PM

Originally posted by LCHIEN

I've hashed filenames and variable names before but not filestreams. I thought hashing was a way to compute a specific address for looking up a textual word in a table with a single computation as opposed to a search with multiple compares. I'm not even sure what a filestream is.

Yeah, you can use hashes for that, too. It's basically saying something like string x translates to the much shorter string x'. String y translates to the much shorter string y'. If x' changes, it means that x mush have changed, too. It's possible but statistically very unlikely that y' = x'.

So you're right -- they're great for lookup tables.

A file stream is basically like opening up a fire hose to read the raw file data in (or write it out). In this case, I want to calculate the hash as I'm reading it in so I don't have to hold it all in memory (because there are some big files out there).

**cgallery** · 05-26-2009, 06:00 PM

How large is your hash?

As the files become progressively larger, the hash needs to grow to accommodate the likely seemingly random distribution of data.

4GB is an awful lot of data to hash.

**jkristia** · 05-26-2009, 06:30 PM

This is .NETs GetHashCode for a string (using reflector)

public override unsafe int GetHashCode()
{
fixed (char* str = ((char*) this))
{
char* chPtr = str;
int num = 0x15051505;
int num2 = num;
int* numPtr = (int*) chPtr;
for (int i = this.Length; i > 0; i -= 4)
{
num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[0];
if (i <= 2)
{
break;
}
num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[1];
numPtr += 2;
}
return (num + (num2 * 0x5d588b65));
}
}

You probably could create a long hasvalue for every 8 bytes received and then keep exclusive OR'ing the values.

Not sure if that is what you were looking for.

**Alex Franke** · 05-26-2009, 06:37 PM

Originally posted by cgallery

How large is your hash?

As the files become progressively larger, the hash needs to grow to accommodate the likely seemingly random distribution of data.

4GB is an awful lot of data to hash.

I'm hoping for 32-64 bits actually. I only need no collisions (if possible) with 500 or so files at a time, and usually far less. I'm comparing files of the same sizes, and I want to be able to say that 2 files with the same hash are identical with say, 99.9% accuracy.

Kids are in bed now, so I'm back at it again.

**Alex Franke** · 05-26-2009, 06:42 PM

Originally posted by jkristia

This is .NETs GetHashCode for a string (using reflector)
[snip]
You probably could create a long hasvalue for every 8 bytes received and then keep exclusive OR'ing the values.

Hey! Good thinking!!!

This might be perfect. Let me see how fast I can make for hashing a file...

**LCHIEN** · 05-26-2009, 06:51 PM

O, I see, you're hashing the entire contents of each file to get a hopefully unique ID on each.

Typically we used to do a truncated Checksum (which is basically an exlusive OR) or a CRC code which is a polynomial with feedback to identify files for damage during transmission.

These would get you a value of 8 16, or 32 bits with more or less random disctribution within the range., so unless you had some diabolical files (all zeros for example) the chances of getting a duplicate are 1/2^nbits.

Checksums and CRC codes are very quick to compute.

**Alex Franke** · 05-26-2009, 07:45 PM

Originally posted by LCHIEN

O, I see, you're hashing the entire contents of each file to get a hopefully unique ID on each.
[snip]
Checksums and CRC codes are very quick to compute.

Yeah, that's it. 2^32 is a big number. But this sounds a lot like the "birthday question" from statistics class -- you know, that there's like a 75% chance that two people in your class of 30 share the same birthday. I don't remember stats very well, but I wonder what the probability would be of two of 500 files sharing the same checksum...

I also found this one: http://en.wikipedia.org/wiki/Rabin_fingerprint -- while it looks pretty darn promising for uniquely identifying any file, I don't think it'll be as quick to compute...

Y'all are being a great help here! Thanks so much!!!

**crokett** · 05-26-2009, 07:52 PM

Jeebers! And I thought what I did for a living confused the heck out of other people!

**jackellis** · 05-26-2009, 07:55 PM

Lots of English words but it reads like Geek to me.

Note to self. Don't expect to understand Geek posts on WW forum after drinking (1) beer.

**Alex Franke** · 05-26-2009, 07:58 PM

Originally posted by Alex Franke

I don't remember stats very well, but I wonder what the probability would be of two of 500 files sharing the same checksum...

(Yes, I know I'm quoting myself here!

)

I think I found the answer to this one here -- and without any math

: http://en.wikipedia.org/wiki/Birthday_attack - in the chart about half way down. If I'm reading this right, if I want a 99.9% success rate and I have a 32-bit hash, then I need to be comparing less than 2,900 files against each other.

That sounds pretty close...

**cgallery** · 05-26-2009, 10:00 PM

One of the things I used to do when I had variable file sizes was to use the file size itself as my primary test. I only computed checksums once subsequent files were discovered with the same file size as a file in my database. You don't even need to go back and compute the checksum on the first file in the set, you can find it by the process of elimination.

**Alex Franke** · 05-26-2009, 10:27 PM

Originally posted by cgallery

One of the things I used to do when I had variable file sizes was to use the file size itself as my primary test. I only computed checksums once subsequent files were discovered with the same file size as a file in my database. You don't even need to go back and compute the checksum on the first file in the set, you can find it by the process of elimination.

Yeah -- that's actually why there are only up to 500 files (usually fewer) in each batch. First they're hashed by file size, and then by whatever I end up using (a checksum right now).

I think what I also might try is grab just as many bytes as I need in order to rule out a match. So to test a 5MB file against 5 other 5MB files, just grab a few hundred bytes or so (hopefully skipping over any consistent file header information for files of the same type), and only keep grabbing if they keep matching. That way the only ones completely hashed will be the matches.

Currently I'm doing a full hash for each file in the group until I find a match, and then I stop. That's just killing me on the large files.

**Alex Franke** · 05-28-2009, 11:57 PM

(just in case some of you're interested... )

Well, as it turns out, Microsoft's built-in MD5 algorithm is pretty darn fast -- faster than several managed checksum implementations that I tried, including Adler-32 with looked to be pretty fast itself.

I hit an unfortunate bottleneck actually where I obtain the file size for the "outer" file size hash -- calculating MD5's was even quicker than ONLY checking the file size for several tests.

I think I'm going to try a partial hash approach to rule out most collisions. For 55,000 files (about 10GB) I hashed the first 8K of each file in about 7 seconds on one thread. Adding in a test for the file size, that went up to 9 seconds.

I think I'll try a WinAPI call there and see if that helps... The first 1K hashed about a second quicker.

Next I need to test the speed of the built-in hashtables with the 32-byte keys and see if I really do get any benefit from nesting them by filesize...

Need a super-fast filestream hashing algorithm

Need a super-fast filestream hashing algorithm

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Footer Ad