Need a super-fast filestream hashing algorithm

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Alex Franke
    Veteran Member
    • Feb 2007
    • 2641
    • Chapel Hill, NC
    • Ryobi BT3100

    Need a super-fast filestream hashing algorithm

    Okay -- I promise I'll get back to the woodworking posts soon!

    It's been a long time since I took algorithms, and I'm not really up to date on the latest technologies. I need a very fast filestream hashing algorithm. It does not have to be cryptographic, but I need to be able to calculate it very quickly without knowing the length of the file before I start. And it needs have ideally zero collisions over only 500 or so files ranging in size from from 1 byte to over 4GB.

    Speed is the most important consideration, no collisions is secondary, and security isn't even on the list.

    Any ideas?
    online at http://www.theFrankes.com
    while ( !( succeed = try() ) ) ;
    "Life is short, Art long, Occasion sudden and dangerous, Experience deceitful, and Judgment difficult." -Hippocrates
  • LCHIEN
    Internet Fact Checker
    • Dec 2002
    • 21124
    • Katy, TX, USA.
    • BT3000 vintage 1999

    #2
    I've hashed filenames and variable names before but not filestreams. I thought hashing was a way to compute a specific address for looking up a textual word in a table with a single computation as opposed to a search with multiple compares. I'm not even sure what a filestream is. Maybe i've been out of software too long.

    In software you can very often trade between speed and memory...i.e. larger hash table length would result in fewer collisions to resolve.

    I found this:
    "The FileStream class inherits the Stream class to provide byte reading/writing operations to and from a source file. The FileStream class provides synchronous reading/writing methods as well asynchronous reading/writing methods. This class also provides a seeking functionality; I'll explain more about that later. You can construct a FileStream object in many ways, so let's look at some of them."

    So, it appears to be a C# structure. for indicating file sources.
    Last edited by LCHIEN; 05-26-2009, 03:44 PM.
    Loring in Katy, TX USA
    If your only tool is a hammer, you tend to treat all problems as if they were nails.
    BT3 FAQ - https://www.sawdustzone.org/forum/di...sked-questions

    Comment

    • Alex Franke
      Veteran Member
      • Feb 2007
      • 2641
      • Chapel Hill, NC
      • Ryobi BT3100

      #3
      Originally posted by LCHIEN
      I've hashed filenames and variable names before but not filestreams. I thought hashing was a way to compute a specific address for looking up a textual word in a table with a single computation as opposed to a search with multiple compares. I'm not even sure what a filestream is.
      Yeah, you can use hashes for that, too. It's basically saying something like string x translates to the much shorter string x'. String y translates to the much shorter string y'. If x' changes, it means that x mush have changed, too. It's possible but statistically very unlikely that y' = x'.

      So you're right -- they're great for lookup tables.

      A file stream is basically like opening up a fire hose to read the raw file data in (or write it out). In this case, I want to calculate the hash as I'm reading it in so I don't have to hold it all in memory (because there are some big files out there).
      online at http://www.theFrankes.com
      while ( !( succeed = try() ) ) ;
      "Life is short, Art long, Occasion sudden and dangerous, Experience deceitful, and Judgment difficult." -Hippocrates

      Comment

      • cgallery
        Veteran Member
        • Sep 2004
        • 4503
        • Milwaukee, WI
        • BT3K

        #4
        How large is your hash?

        As the files become progressively larger, the hash needs to grow to accommodate the likely seemingly random distribution of data.

        4GB is an awful lot of data to hash.

        Comment

        • jkristia
          Established Member
          • Jan 2006
          • 114
          • Simi Valley, CA

          #5
          This is .NETs GetHashCode for a string (using reflector)

          public override unsafe int GetHashCode()
          {
          fixed (char* str = ((char*) this))
          {
          char* chPtr = str;
          int num = 0x15051505;
          int num2 = num;
          int* numPtr = (int*) chPtr;
          for (int i = this.Length; i > 0; i -= 4)
          {
          num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[0];
          if (i <= 2)
          {
          break;
          }
          num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[1];
          numPtr += 2;
          }
          return (num + (num2 * 0x5d588b65));
          }
          }

          You probably could create a long hasvalue for every 8 bytes received and then keep exclusive OR'ing the values.

          Not sure if that is what you were looking for.

          Comment

          • Alex Franke
            Veteran Member
            • Feb 2007
            • 2641
            • Chapel Hill, NC
            • Ryobi BT3100

            #6
            Originally posted by cgallery
            How large is your hash?

            As the files become progressively larger, the hash needs to grow to accommodate the likely seemingly random distribution of data.

            4GB is an awful lot of data to hash.
            I'm hoping for 32-64 bits actually. I only need no collisions (if possible) with 500 or so files at a time, and usually far less. I'm comparing files of the same sizes, and I want to be able to say that 2 files with the same hash are identical with say, 99.9% accuracy.

            Kids are in bed now, so I'm back at it again.
            online at http://www.theFrankes.com
            while ( !( succeed = try() ) ) ;
            "Life is short, Art long, Occasion sudden and dangerous, Experience deceitful, and Judgment difficult." -Hippocrates

            Comment

            • Alex Franke
              Veteran Member
              • Feb 2007
              • 2641
              • Chapel Hill, NC
              • Ryobi BT3100

              #7
              Originally posted by jkristia
              This is .NETs GetHashCode for a string (using reflector)
              [snip]
              You probably could create a long hasvalue for every 8 bytes received and then keep exclusive OR'ing the values.
              Hey! Good thinking!!!

              This might be perfect. Let me see how fast I can make for hashing a file...
              online at http://www.theFrankes.com
              while ( !( succeed = try() ) ) ;
              "Life is short, Art long, Occasion sudden and dangerous, Experience deceitful, and Judgment difficult." -Hippocrates

              Comment

              • LCHIEN
                Internet Fact Checker
                • Dec 2002
                • 21124
                • Katy, TX, USA.
                • BT3000 vintage 1999

                #8
                O, I see, you're hashing the entire contents of each file to get a hopefully unique ID on each.

                Typically we used to do a truncated Checksum (which is basically an exlusive OR) or a CRC code which is a polynomial with feedback to identify files for damage during transmission.

                These would get you a value of 8 16, or 32 bits with more or less random disctribution within the range., so unless you had some diabolical files (all zeros for example) the chances of getting a duplicate are 1/2^nbits.

                Checksums and CRC codes are very quick to compute.
                Loring in Katy, TX USA
                If your only tool is a hammer, you tend to treat all problems as if they were nails.
                BT3 FAQ - https://www.sawdustzone.org/forum/di...sked-questions

                Comment

                • Alex Franke
                  Veteran Member
                  • Feb 2007
                  • 2641
                  • Chapel Hill, NC
                  • Ryobi BT3100

                  #9
                  Originally posted by LCHIEN
                  O, I see, you're hashing the entire contents of each file to get a hopefully unique ID on each.
                  [snip]
                  Checksums and CRC codes are very quick to compute.
                  Yeah, that's it. 2^32 is a big number. But this sounds a lot like the "birthday question" from statistics class -- you know, that there's like a 75% chance that two people in your class of 30 share the same birthday. I don't remember stats very well, but I wonder what the probability would be of two of 500 files sharing the same checksum...

                  I also found this one: http://en.wikipedia.org/wiki/Rabin_fingerprint -- while it looks pretty darn promising for uniquely identifying any file, I don't think it'll be as quick to compute...

                  Y'all are being a great help here! Thanks so much!!!
                  online at http://www.theFrankes.com
                  while ( !( succeed = try() ) ) ;
                  "Life is short, Art long, Occasion sudden and dangerous, Experience deceitful, and Judgment difficult." -Hippocrates

                  Comment

                  • crokett
                    The Full Monte
                    • Jan 2003
                    • 10627
                    • Mebane, NC, USA.
                    • Ryobi BT3000

                    #10
                    Jeebers! And I thought what I did for a living confused the heck out of other people!
                    David

                    The chief cause of failure in this life is giving up what you want most for what you want at the moment.

                    Comment

                    • jackellis
                      Veteran Member
                      • Nov 2003
                      • 2638
                      • Tahoe City, CA, USA.
                      • BT3100

                      #11
                      Lots of English words but it reads like Geek to me.

                      Note to self. Don't expect to understand Geek posts on WW forum after drinking (1) beer.

                      Comment

                      • Alex Franke
                        Veteran Member
                        • Feb 2007
                        • 2641
                        • Chapel Hill, NC
                        • Ryobi BT3100

                        #12
                        Originally posted by Alex Franke
                        I don't remember stats very well, but I wonder what the probability would be of two of 500 files sharing the same checksum...
                        (Yes, I know I'm quoting myself here! )

                        I think I found the answer to this one here -- and without any math : http://en.wikipedia.org/wiki/Birthday_attack - in the chart about half way down. If I'm reading this right, if I want a 99.9% success rate and I have a 32-bit hash, then I need to be comparing less than 2,900 files against each other.

                        That sounds pretty close...
                        online at http://www.theFrankes.com
                        while ( !( succeed = try() ) ) ;
                        "Life is short, Art long, Occasion sudden and dangerous, Experience deceitful, and Judgment difficult." -Hippocrates

                        Comment

                        • cgallery
                          Veteran Member
                          • Sep 2004
                          • 4503
                          • Milwaukee, WI
                          • BT3K

                          #13
                          One of the things I used to do when I had variable file sizes was to use the file size itself as my primary test. I only computed checksums once subsequent files were discovered with the same file size as a file in my database. You don't even need to go back and compute the checksum on the first file in the set, you can find it by the process of elimination.

                          Comment

                          • Alex Franke
                            Veteran Member
                            • Feb 2007
                            • 2641
                            • Chapel Hill, NC
                            • Ryobi BT3100

                            #14
                            Originally posted by cgallery
                            One of the things I used to do when I had variable file sizes was to use the file size itself as my primary test. I only computed checksums once subsequent files were discovered with the same file size as a file in my database. You don't even need to go back and compute the checksum on the first file in the set, you can find it by the process of elimination.
                            Yeah -- that's actually why there are only up to 500 files (usually fewer) in each batch. First they're hashed by file size, and then by whatever I end up using (a checksum right now).

                            I think what I also might try is grab just as many bytes as I need in order to rule out a match. So to test a 5MB file against 5 other 5MB files, just grab a few hundred bytes or so (hopefully skipping over any consistent file header information for files of the same type), and only keep grabbing if they keep matching. That way the only ones completely hashed will be the matches.

                            Currently I'm doing a full hash for each file in the group until I find a match, and then I stop. That's just killing me on the large files.
                            online at http://www.theFrankes.com
                            while ( !( succeed = try() ) ) ;
                            "Life is short, Art long, Occasion sudden and dangerous, Experience deceitful, and Judgment difficult." -Hippocrates

                            Comment

                            • Alex Franke
                              Veteran Member
                              • Feb 2007
                              • 2641
                              • Chapel Hill, NC
                              • Ryobi BT3100

                              #15
                              (just in case some of you're interested... )

                              Well, as it turns out, Microsoft's built-in MD5 algorithm is pretty darn fast -- faster than several managed checksum implementations that I tried, including Adler-32 with looked to be pretty fast itself.

                              I hit an unfortunate bottleneck actually where I obtain the file size for the "outer" file size hash -- calculating MD5's was even quicker than ONLY checking the file size for several tests.

                              I think I'm going to try a partial hash approach to rule out most collisions. For 55,000 files (about 10GB) I hashed the first 8K of each file in about 7 seconds on one thread. Adding in a test for the file size, that went up to 9 seconds. I think I'll try a WinAPI call there and see if that helps... The first 1K hashed about a second quicker.

                              Next I need to test the speed of the built-in hashtables with the 32-byte keys and see if I really do get any benefit from nesting them by filesize...
                              online at http://www.theFrankes.com
                              while ( !( succeed = try() ) ) ;
                              "Life is short, Art long, Occasion sudden and dangerous, Experience deceitful, and Judgment difficult." -Hippocrates

                              Comment

                              Working...