1 2 3 4 Previous Next 57 Replies Latest reply: Mar 21, 2013 9:30 AM by 998363 RSS

    Removing duplicate file entries from LinkedHashSet

    843793
      Sorry, but I'm just coming back to Java, and I'm having a little trouble.

      I'm trying to write a tool which will recursively obtain a list of files in a directory. I'm using a Helper method which does the recursion and passes back Collection<File>, so I'm taking this collection and inserting it into my LinkedHashList with addAll().

      The problem I'm having is that I want the User to be able to feed in files as well as directories, and I want to filter out any duplicate occurances. But for some reason, LinkedHashList is not filtering these duplicate File objects. I've tried several methods, including creating a list of files from the command line and adding them one by one at the end, but I'm not having a lot of luck. Is this possible?
      public class MyClass
      {
          public static void main(String[] args)
          {
              File filename = null;
              LinkedHashSet<File> directories = new LinkedHashSet<File>(0);
              LinkedHashSet<File> fileList = new LinkedHashSet<File>(0);
      
              LinkedHashSet<File> inputFileList = new LinkedHashSet<File>(0);
      
              for ( int i=0; i < args.length; i++ )
              {
                  filename = new File( args[i] );
                  if ( filename.isDirectory )
                  {
                      directories.add( filename );
                  }
                  else if ( filename.isFile() )
                  {
                      fileList.add( filename );
                  }
              }
      
              if ( directories.isEmpty() == false )
              {
                  for ( File directory : directories )
                  {
                      fileList.addAll( FileListHelper.listFiles( directory );
                  }
              }
      
              File[] fileArray = new File[ fileList.size() ];
              fileList.toArray( fileArray );
      
              for ( int j = 0; j < fileArray.length; j++ )
              {
                  System.out.println( fileArray.getName() );
      }
      }
      }

      public class FileListHelper
      {
      public static Collection<File> listFiles ( File directory )
      {
      ....do stuff....
      return Collection of files
      }

      }
      So if my_directory/ contains: 
      MyClass.java
      MyClass.class
      FileListHelper.java
      FileListHelper.class
      
      And I pass in 
      $ java MyClass FileListHelper.java my_directory/
      
      I'm going to get FileListHelper.java twice in the list.
      
      I've also tried starting out with ArrayLists, converting to LinkedHashSets and back again. Nothing seems to filter out the duplicates. 
      
      Also I sort of care about the order of the files, because I'm going to do some more processing with the fileArray later.  I mainly need the order to be the same everytime I run it.
      
      Thanks for your help;                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
        • 1. Re: Removing duplicate file entries from LinkedHashSet
          843793
          What do You mean by saying duplicate? Are these two files with the same name only? Maybe I'm not right but You can extend File class and override equals and hashcode methods that are used to compare objects. By overriding these methods You'll be able to compare Files in the way You are interested, and duplicates will be automatically removed when added to the instance of Set. There's no such collection LinkedHashList also.

          Edited by: elOpalo on Mar 5, 2010 12:12 PM
          • 2. Re: Removing duplicate file entries from LinkedHashSet
            DrClap
            elOpalo wrote:
            What do You mean by saying duplicate? Are these two files with the same name only? Maybe I'm not right but You can extend File class and override equals and hashcode methods that are used to compare objects.
            That's sort of not right; the File class already has suitable equals() and hashCode() methods. You don't need to override it to provide them.
            • 3. Re: Removing duplicate file entries from LinkedHashSet
              843793
              I'm talking about duplicates as in "it's the same file", not just the same name. Same path, same name, same size, same created date, etc. I had hoped that since you can pull all that information from the file handle, and since File is basically built into Java, that the File class would be smart enough to know what was meant by equals() and hashCode().

              Writing my own subclass just to implement something that should be in there already seems kinda obnoxious. But if that's my only choice, then maybe I'm stuck with it.

              Any good references on how to sub the hasCode() method?
              • 4. Re: Removing duplicate file entries from LinkedHashSet
                843793
                DrClap wrote:That's sort of not right; the File class already has suitable equals() and hashCode() methods. You don't need to override it to provide them.
                Thanks, that's what I thought the case should be. Any thoughts on why the LinkedHashSet isn't able to differentiate the file?
                Edited by: mreeves on Mar 5, 2010 12:40 PM
                • 5. Re: Removing duplicate file entries from LinkedHashSet
                  DrClap
                  mreeves wrote:
                  I'm talking about duplicates as in "it's the same file", not just the same name. Same path, same name, same size, same created date, etc.
                  That doesn't make sense. If A and B are the same file, i.e. they have the same path, then they can't possibly have different sizes. There's only one file from the underlying file system involved and it only has one size. And one created data. And one last change date. And so on for any other properties of a file you might think of.
                  • 6. Re: Removing duplicate file entries from LinkedHashSet
                    DrClap
                    $ java MyClass FileListHelper.java my_directory/
                    So first you consider FileListHelper.java in your current directory, whatever that might be. Then you consider the files in my_directory, which presumably isn't the current directory. It's possible that this recursive search finds another file named FileListHelper.java, but since it isn't in the current directory, it has a different path from the one you passed in as the first parameter. Hence it's a different file. You could check that by simply displaying the full path of each file.
                    • 7. Re: Removing duplicate file entries from LinkedHashSet
                      843793
                      DrClap wrote:
                      $ java MyClass FileListHelper.java my_directory/
                      So first you consider FileListHelper.java in your current directory, whatever that might be. Then you consider the files in my_directory, which presumably isn't the current directory. It's possible that this recursive search finds another file named FileListHelper.java, but since it isn't in the current directory, it has a different path from the one you passed in as the first parameter. Hence it's a different file. You could check that by simply displaying the full path of each file.
                      Actually, it is in the same directory. I'm trying to idiot proof my code. So I'm testing a case where the user would give me input that is the same file as something that also exists in the directory I'm putting in.

                      For some reason, it's not seeing those as the same file.

                      I just tested doing this:
                      $ java MyClass MyClass.java MyClass.java

                      In that case, my code is only listing the File once. But when I insert the current directory, I'm getting the Collection back which includes the file, so then I'm getting the file in the list twice.

                      Edited by: mreeves on Mar 5, 2010 12:52 PM
                      • 8. Re: Removing duplicate file entries from LinkedHashSet
                        843793
                        DrClap wrote: That's sort of not right; the File class already has suitable equals() and hashCode() methods. You don't need to override it to provide them.
                        Right, thanks for the hint.
                        • 9. Re: Removing duplicate file entries from LinkedHashSet
                          843793
                          Ok, so I also decided to test out this.
                          java MyClass my_directory/ my_directory/

                          In this case, I'm getting a listing of each file, only once as well. So it's some weird case where you enter a directory and a filename that is also in the directory.

                          Maybe this has something to do with using Collection?
                          • 10. Re: Removing duplicate file entries from LinkedHashSet
                            843793
                            Alright, I need to apologize to you guys. I haven't actually been entering my_directory/. I've been using ./ as my directory, thinking it was the same thing.

                            It turns out that something else is going on. The only situation that I get this problem is when I use ./ for a input directory. I just tested this situation.

                            $ java MyClass ./ MyClass.java

                            however, when if I do this:
                            $ java MyClass sub_dir/ sub_dir/myfilename.txt

                            Then myfilename.txt only shows up one time in the list.

                            again, still not really understanding what is going on here.

                            Edited by: mreeves on Mar 5, 2010 1:30 PM
                            • 11. Re: Removing duplicate file entries from LinkedHashSet
                              DrClap
                              mreeves wrote:
                              again, still not really understanding what is going on here.
                              Did you consider my suggestion to do some debugging? Print out paths which you think should be the same but which the code says are not the same?
                              • 12. Re: Removing duplicate file entries from LinkedHashSet
                                796085
                                I haven't written any code to test this, but if you look through the equals() method of File you will find (ultimately, inside Win32FileSystem) that it just does a string comparison on the output of File.getPath(). I don't believe this method turns your paths into canonical paths, I think it just normalises them (essentially, sorts out the / characters).

                                So, you could easily have two paths that point to the same file and don't equals() each other. You might have to write a custom comparator which compares based on the canonical name.
                                • 13. Re: Removing duplicate file entries from LinkedHashSet
                                  843793
                                  :) That's what I've written at the beginning, that the files aren't compared in a suitable way. Maybe will test it later by myself.
                                  • 14. Re: Removing duplicate file entries from LinkedHashSet
                                    796085
                                    You can ask a File for its canonical name, so writing a comparator should be one line of code.
                                    1 2 3 4 Previous Next