This discussion is archived
5 Replies Latest reply: Mar 11, 2013 6:12 PM by Murray9654 RSS

problem in comparing 2 files reliably?

Murray9654 Newbie
Currently Being Moderated
Hi i have to check two files whether the content of the two files are same or not? The thing is the file is on a server and when I am uploading a second file i have to check whether this file already exist on the server? If i have to upload the second file on to the server and check the contents of both the files there is a problem. the problem is if two files are equal then uploading the second file is useless. so the approach i took was comparing the checksums of both the files but then sometimes the checksum is the same even when two file contents are not the same so is there any way to deal this situation with no probability of error.

Edited by: Muralidhar on Mar 11, 2013 11:34 PM
  • 1. Re: problem in comparing 2 files reliably?
    baftos Expert
    Currently Being Moderated
    I am not sure I understand what is where, but using a checksum seems to be reasonable. Suppose two different files produce the same checksum. You do an unnecessary upload. You must still compare the files. Ok, you find out that they are not the same and you did an unnecessary upload. Too bad, this will happen once in 65000 times with a 2 bytes checksum, so just live with it. Or improve your chances astronomically by using a cryptographic hash function instead of a simple checksum.
  • 2. Re: problem in comparing 2 files reliably?
    rp0428 Guru
    Currently Being Moderated
    >
    the problem is if two files are not equal then uploading the second file is useless.
    >
    Did you mean if the two files ARE equal then it is a problem?
    >
    so the approach i took was comparing the checksums of both the files but then sometimes the checksum is the same even when two file contents are not the same so is there any way to deal this situation with no probability of error.
    >
    Then you are using the wrong method to compute your checksum.

    A '1 bit' checksum is going to detect a lot of 'duplicate' files since there are only two possible values for that one bit.

    Use MD5 and you won't have to worry about two different files having the same checksum. You could also use SHA1 or one of the many other higher-bit algorithms but that is generally unnecessary, especially for a use case such as yours.

    Here is an article with some simple Java code that shows how to compute both checksums.
    http://www.javablogging.com/sha1-and-md5-checksums-in-java/

    Another performance tip: upload a 'zipped' version of the file when you do the upload. Unless the file is already highly compressed (e.g. an image file or video file) sending the compressed file can be much quicker and the time savings can be much greater than the time it takes to zip the file to begin with.

    Your code to compute the checksum needs to read the entire input file anyway so just use Javas ZipOutputStream to write it to a new zip file at the same time you compute the checksum.

    That technique is especially effective for data files (CSV or delimited files, xml files, etc) since they are highly compressible. Also if a csv file is to later be loaded into Oracle the EXTERNAL TABLE functionality in 11g supports a pre-processor directive that can be used to load the zipped file without having to first unzip it.

    Then compare the local checksum to the checksum at the remote site. If they are different send the zipped file and either leave it zipped until it needs to be used (saves storage space) or unzip it after it is on the remote box.
  • 3. Re: problem in comparing 2 files reliably?
    Murray9654 Newbie
    Currently Being Moderated
    >
    Here is an article with some simple Java code that shows how to compute both checksums.
    http://www.javablogging.com/sha1-and-md5-checksums-in-java/
    I have understood the above example.The only question is why is he converting the hash into hexadecimal value? is it necessary?
  • 4. Re: problem in comparing 2 files reliably?
    rp0428 Guru
    Currently Being Moderated
    >
    The only question is why is he converting the hash into hexadecimal value? is it necessary?
    >
    Necessary? No - but the 'digest' method returns a byte array and outside of Java that isn't very useful.
            // get the hash value as byte array
            byte[] hash = algorithm.digest();
    The 'normal' use is to provide the digest value as a hex string. If you download files from the web a lot of times they will provide a digest value (MD5 or SHA1) you can use to verify that the downloaded file was not corrupted and is, in fact, the correct file. Those sites provide the digest value as a hex string. That makes it easy to cut & paste to compare with another value.

    Also, for this code the 'calculateHash' method is declared to return a string
        public static String calculateHash(MessageDigest algorithm,
    Having the digest value as a hex string makes it easier to manipulate.
  • 5. Re: problem in comparing 2 files reliably?
    Murray9654 Newbie
    Currently Being Moderated
    Thank you so much.

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points