This content has been marked as final. Show 5 replies
I am not sure I understand what is where, but using a checksum seems to be reasonable. Suppose two different files produce the same checksum. You do an unnecessary upload. You must still compare the files. Ok, you find out that they are not the same and you did an unnecessary upload. Too bad, this will happen once in 65000 times with a 2 bytes checksum, so just live with it. Or improve your chances astronomically by using a cryptographic hash function instead of a simple checksum.
>1 person found this helpful
the problem is if two files are not equal then uploading the second file is useless.
Did you mean if the two files ARE equal then it is a problem?
so the approach i took was comparing the checksums of both the files but then sometimes the checksum is the same even when two file contents are not the same so is there any way to deal this situation with no probability of error.
Then you are using the wrong method to compute your checksum.
A '1 bit' checksum is going to detect a lot of 'duplicate' files since there are only two possible values for that one bit.
Use MD5 and you won't have to worry about two different files having the same checksum. You could also use SHA1 or one of the many other higher-bit algorithms but that is generally unnecessary, especially for a use case such as yours.
Here is an article with some simple Java code that shows how to compute both checksums.
Another performance tip: upload a 'zipped' version of the file when you do the upload. Unless the file is already highly compressed (e.g. an image file or video file) sending the compressed file can be much quicker and the time savings can be much greater than the time it takes to zip the file to begin with.
Your code to compute the checksum needs to read the entire input file anyway so just use Javas ZipOutputStream to write it to a new zip file at the same time you compute the checksum.
That technique is especially effective for data files (CSV or delimited files, xml files, etc) since they are highly compressible. Also if a csv file is to later be loaded into Oracle the EXTERNAL TABLE functionality in 11g supports a pre-processor directive that can be used to load the zipped file without having to first unzip it.
Then compare the local checksum to the checksum at the remote site. If they are different send the zipped file and either leave it zipped until it needs to be used (saves storage space) or unzip it after it is on the remote box.
Here is an article with some simple Java code that shows how to compute both checksums.I have understood the above example.The only question is why is he converting the hash into hexadecimal value? is it necessary?
The only question is why is he converting the hash into hexadecimal value? is it necessary?
Necessary? No - but the 'digest' method returns a byte array and outside of Java that isn't very useful.
The 'normal' use is to provide the digest value as a hex string. If you download files from the web a lot of times they will provide a digest value (MD5 or SHA1) you can use to verify that the downloaded file was not corrupted and is, in fact, the correct file. Those sites provide the digest value as a hex string. That makes it easy to cut & paste to compare with another value.
// get the hash value as byte array byte hash = algorithm.digest();
Also, for this code the 'calculateHash' method is declared to return a string
Having the digest value as a hex string makes it easier to manipulate.
public static String calculateHash(MessageDigest algorithm,
Thank you so much.