1 2 Previous Next 20 Replies Latest reply: Dec 3, 2007 3:11 PM by jschellSomeoneStoleMyAlias RSS

    two way Directory replicator in bit level

    807603
      Currently working on my desertation and developing an application that replicates two located directories to have the same data, for example if data manupilated in a file in directory A, directory B will be modified by synchronizing the modified part in the file only (Not to copy the complete file) and the other way round B to A.

      Obviously I will need the I/O package to use InputStream, OutputStream. But I need to find out an effiecent way how to chop a file in bits and then compare the two bit structure.

      Thought about using bittorrent functionality and did some research about the Bencode encoder and decoder. Am I on the right path, or is their other way to do this.

      Have anyone tried to do anything similar, could give me suggestions, point to a helpful source or a book.

      Thanks in Advance ;)
        • 1. Re: two way Directory replicator in bit level
          796440
          Ozie02 wrote:
          But I need to find out an effiecent way how to chop a file in bits and then compare the two bit structure.
          You're comparing bitwise, not byte-wise or character-wise? Why?

          In any case, what's wrong with just reading with BufferedInputStream?

          As for writing out "only the changed part," that will only be possible if the file after the change has exactly the same number of bytes as the file before the change. If you insert a byte into the middle of a file, everything after that byte will have to be written back out to disk.
          • 2. Re: two way Directory replicator in bit level
            807603
            As I looked into bittorrent materials I thought about going in Bitwise, so do you prefer me to go in bytewise?

            About BufferedInputStream, so that would do the job to read and get into bytewise?

            Okay, thanks for the infromation about writing out requirements where the file size has to be the same as before the changes (I didn't know that), okay if the file size is the same after the modifications, how that would be done, do I have to use BufferedOutputStream and what precautions I have to keep in mind.

            Sorry for sounding silly but this is new to me.

            Appreciate your response.

            Thanks.
            • 3. Re: two way Directory replicator in bit level
              796440
              Ozie02 wrote:
              As I looked into bittorrent materials I thought about going in Bitwise, so do you prefer me to go in bytewise?
              I don't care how you do it. I don't see the point of comparing a file bit by bit though. And you'll have to take extra steps to compare bit by bit rather than just comparing whole bytes or chars at a time.
              About BufferedInputStream, so that would do the job to read and get into bytewise?
              Read the docs, and the following tutorial if needed, and post back specific questions if something's not clear.
              http://java.sun.com/docs/books/tutorial/essential/io/index.html
              Okay, thanks for the infromation about writing out requirements where the file size has to be the same as before the changes (I didn't know that), okay if the file size is the same after the modifications, how that would be done, do I have to use BufferedOutputStream and what precautions I have to keep in mind.
              You can use RandomAccessFile to quickly jump to a specific position in a file for reading and writing.
              • 4. Re: two way Directory replicator in bit level
                jschellSomeoneStoleMyAlias
                About BufferedInputStream, so that would do the job to read and get into bytewise?
                Just to be sure - you understand the difference between bit and byte right?

                If you do then the question is basically non-sensical. Even if you find a bit difference the OS file IO is still going to require writing bytes anyways.

                And unless you have fixed format files or the files are only appended to you are going to have to re-write the files anyways. Although perhaps determing that algrorithmically is the point of your excercise.
                • 5. Re: two way Directory replicator in bit level
                  796440
                  jschell wrote:
                  And unless you have fixed format files or the files are only appended to you are going to have to re-write the files anyways. Although perhaps determing that algrorithmically is the point of your excercise.
                  I initially read that as "point of your existence," leading me to think, "Damn, and I thought *I* had no life!"
                  • 6. Re: two way Directory replicator in bit level
                    807603
                    I have implemented that kind of stuff before, and I will assume that you are trying to synchronize stuff over a network (otherwise, it doesn't make sense, just copying the files is probably more efficient).

                    Let's say that computer A contains the files that are up to date and computer B is the one that has to be synchronized. Basically, what you have to do first is have B send a representation of his folder hierachy to a and then compare the hierarchies of files from both directories. Then send to B instructions to delete files it has that aren't on A. Then send to B all the files that B doesn't have and A has.

                    The tricky part is what happens when both A and B have a file. Let A and B chop up each of their files into small pieces (crumbs) of say between 16kB and 128kB (you decide the size, but make sure all the crumbs are the same size). Now hash each of these pieces on both computers. Let B send to A to list of all the hashes all of the crumbs it has. Let A compare these crumbs to the corresponding ones A has calculated. Then have A send to be the data of all the crumbs for which the hash produced by B differs from the hash produced by A.

                    This is the basic principle. You might want to add some logic to avoid sending files that haven't been modified since the last synchronization (by comparing the last modified time). And you will of course need some logic to send crumbs that A has and that B doesn't (if file "foo.bar" on A is 100MB and "foo.bar" on B is 50MB, then A has a lot more crumbs for that file than B has...
                    • 7. Re: two way Directory replicator in bit level
                      807603
                      About the part *'I don't care how you do it.'* you are the one asked why I am doing it in bitwise.


                      I have tried to read and write files using BufferedInputStream, it worked great with text files. Here comes the bit what file type I wanted to replicate. I wanted to replicate virtual disks (VMWare) where I tried basically to read and write it to find out how it goes, the results were that the output virtual disk was created but corrupted (Can't read it).

                      I tried to read and write mp3 files, the same thing the output file created and when played you would hear cracking and listening down the tunnel (corrupted).

                      Then tried using FileInputStream the output file was created (but dead slow), when played the file it was okay. For the virtual disk 2 hours passed and not completed yet.
                      • 8. Re: two way Directory replicator in bit level
                        807603
                        Yes that's right I am trying to synchronize files over the network. Thanks for the guidance was really helpful.

                        About choping up the file into small pieces, how would I do that?


                        Thankyou.
                        • 9. Re: two way Directory replicator in bit level
                          796440
                          Ozie02 wrote:
                          About the part *'I don't care how you do it.'* you are the one asked why I am doing it in bitwise.
                          I asked that to try to get clarification of the problem. I couldn't imagine why you'd be doing that way, so perhaps you misunderstood something in the problem statement or made a bad assumption about the approach, or perhaps I misunderstood something about what you're trying to do. I think it makes more sense to work at a byte or char level, but It's your project, so I have no investment in which approach you take.
                          I have tried to read and write files using BufferedInputStream, it worked great with text files. Here comes the bit what file type I wanted to replicate. I wanted to replicate virtual disks (VMWare) where I tried basically to read and write it to find out how it goes, the results were that the output virtual disk was created but corrupted (Can't read it).
                          Then there's a bug in your code. You can read and write any file type with BufferedInputStream and BufferedOutputStream.
                          Then tried using FileInputStream the output file was created (but dead slow), when played the file it was okay. For the virtual disk 2 hours passed and not completed yet.
                          You woulud typically wrap a BIS/BOS around a FIS/FOS, and use the BufferedStreams' method that reaad and write large chunks at a time. If the file got corrupted, there's a bug in your read/write code.
                          • 10. Re: two way Directory replicator in bit level
                            796440
                            Ozie02 wrote:
                            About choping up the file into small pieces, how would I do that?
                            Read pieces of it at a time. E.g., read 16k bytes, calculate the hash, read the next 16k, calculate the hash, etc.
                            • 11. Re: two way Directory replicator in bit level
                              807603
                              Ozie02 wrote:
                              I have tried to read and write files using BufferedInputStream, it worked great with text files. Here comes the bit what file type I wanted to replicate. I wanted to replicate virtual disks (VMWare) where I tried basically to read and write it to find out how it goes, the results were that the output virtual disk was created but corrupted (Can't read it).

                              I tried to read and write mp3 files, the same thing the output file created and when played you would hear cracking and listening down the tunnel (corrupted).
                              This sound like you are using a Reader on top of your output/input streams. Readers will interpret all the data they receive as text. As you know, a stream of bytes can be encoded and decoded using several different encodings (ASCII, UTF-8, UTF-16, iso-8359-1, etc...). If you convert from one of these encodings to another, you won't get a perfect byte-for-byte copy of your input stream. Although I'm not totally sure about it, it is also very possible that even if you don't convert between encodings, special characters like the null character (byte value 0) might be outputted in a different way than it was read. In short: Don't read binary data using stuff that's meant for text (XxxxxReader) or you will run into trouble.
                              Then tried using FileInputStream the output file was created (but dead slow), when played the file it was okay. For the virtual disk 2 hours passed and not completed yet.
                              That kind of stuff happens when you read one byte at a time. Each time you call a read method, it basically amounts to one Hard drive access. A Hard drive access is really slow, 10-20 ms, (due to the mechanical parts in your HDD), and you have to read a lot of bytes. Reading (and then writing) 1 byte at a time is making you lose a tremendous amount of time. This means that you need to read a lot of bytes at the same time.

                              I find that the easiest way to read and write files is to use DataInputStream / DataOutputStream:
                              public void copyFile(File fromFile, File toFile) throws IOException {
                                DataInputStream in = new DataInputStream(new FileInputStream(fromFile));
                                DataOutputStream out = new DataOutputStream(new FileOutputStream(toFile));
                                
                                byte[] buffer = new byte[16384]; // we will read 16kB of data at once
                                int bytesRead = 0;
                                
                                do{
                                  bytesRead = in.read(buffer); // read the block of data
                              
                                  // do any kind of processing you need (like calculating hashes...)
                              
                                  out.write(buffer, 0, bytesRead); // write the data we just read into the destination file
                              
                                }while(bytesRead == buffer.length);
                              }
                              That's basically it. The bigger the chunk of data you read at once, the better copy performance you will get, but remember that you keep that much data in memory. With buffers bigger than 4 or 5 megabytes, you will probably not get any significant performance improvements by making the buffer bigger. Remember that data is usually stored on a HDD in clusters that are minimum 4kB in size, so you don't want your buffer to be smaller than that. Additionally, if you only read your file sequentially (and don't write it anywhere), the buffer size will have less impact on performance than if you both read and write to the disk at the same time (hard drives are very fast at reading data sequentially, but suck at accessing data randomly).

                              Edited by: npiguet on Dec 2, 2007 11:55 PM Corrected the loop exit condition
                              • 12. Re: two way Directory replicator in bit level
                                jschellSomeoneStoleMyAlias
                                Now hash each of these pieces on both computers.
                                Myself I don't think that I would rely on this for file transfer - given that hashes aren't unique.

                                I might use timestamp and size if I was confident of how the systems were being run.
                                • 13. Re: two way Directory replicator in bit level
                                  jschellSomeoneStoleMyAlias
                                  That kind of stuff happens when you read one byte at a time. Each time you call a read method, it basically amounts to one Hard drive access. A Hard drive access is really slow, 10-20 ms, (due to the mechanical parts in your HDD),
                                  Not on any modern OS at least not at the level high order language calls. The OS will buffer reads.

                                  So excluding perhaps some real time javas that will not happen.
                                  • 14. Re: two way Directory replicator in bit level
                                    796440
                                    jschell wrote:
                                    Now hash each of these pieces on both computers.
                                    Myself I don't think that I would rely on this for file transfer - given that hashes aren't unique.

                                    I might use timestamp and size if I was confident of how the systems were being run.
                                    I think some backup systems offer this as an option--use hash or size and timestamp for performance, or byte-by-byte comparison when you absolutely can't afford to miss something.

                                    For any decent hash, the probability of two different versions of a given file hashing to the same value is exeedingly tiny, so if it's for, say, a daily backup of something where losing a day is expensive but not crippling, it might be a fair tradeoff.
                                    1 2 Previous Next