This discussion is archived
1 2 Previous Next 15 Replies Latest reply: Jun 25, 2008 1:35 AM by mlk RSS

Remove duplicate lines from a text file and split a reference number

807589 Newbie
Currently Being Moderated
Hi,

This is my first post.

I am trying to remove duplicate lines from a text file. To make things difficult the lines contain non unique timestamps but a unique reference number. Some of the duplicates amount to 10 lines whereas others can only be 2 lines.

1. Here are some examples of duplicates lines: <timestamp>,<reference>,<error message>

08:47:22,95847170050,Problem inputting data.
08:53:28, 96672540040, More problems inputting data.
08:47:29,95847170050,Problem inputting data.
08:53:35, 96672540040, More problems inputting data.
08:47:35,95847170050,Problem inputting data.
08:53:41, 96672540040, More problems inputting data.

I want to delete all but the most recent duplicate line.

2. The reference number is a series of 11 digits which i need to split into two numbers (one 7 digits long and the other 4) separated by a comma.

Before: 96672540040
After: 9667254,0040

Appreciate all the help in advance.

Thanks
  • 1. Re: Remove duplicate lines from a text file and split a reference number
    807589 Newbie
    Currently Being Moderated
    I want to delete all but the most recent duplicate line.

    2. The reference number is a series of 11 digits which i need to split into two numbers (one 7 digits long and the other 4) separated by a >comma
    So, how far have you gotten on the problems? Have you created any code at all? And what problems more specifically do you need help with if so? You dont expect us to you write you a complete program for you, do you?

    Get started on the program, and when you encounter a specific problem you just cant solve by your self, come back and post that.
  • 2. Re: Remove duplicate lines from a text file and split a reference number
    mlk Newbie
    Currently Being Moderated
    sed 's/,/\t/g' test.txt | sort -r -k3 | uniq -f2 | cut -f2,3 | sed 's/\([1-9][1-9][1-9][1-9]\)\(.*\)/\1,\2/g' | sed 's/\t/,/g'
    I'm sure the spliting can be done better, and I'd check the sort out with more data if I was you but I think it should work.

    Edited by: mlk on 23-Jun-2008 12:45
  • 3. Re: Remove duplicate lines from a text file and split a reference number
    807589 Newbie
    Currently Being Moderated
    Thank you for your comments which I will take on board.

    I am new to java so can you tell what the best way of doing this is?

    As for part one of my problem, should I use an array or an arraylist? And if so how would I go about sorting the duplicate lines together before deleting all but the most recent?

    Thank you.
  • 4. Re: Remove duplicate lines from a text file and split a reference number
    807589 Newbie
    Currently Being Moderated
    Thanks mlk.

    I was hoping for something in java rather than unix. :-)
  • 5. Re: Remove duplicate lines from a text file and split a reference number
    mlk Newbie
    Currently Being Moderated
    MuverGooz wrote:
    Thanks mlk.
    np
    I was hoping for something in java rather than unix. :-)
    Why? A shell script is ideal for this kind of problem.

    You need to tell us what part of are stuck on.
    As for part one of my problem, should I use an array or an arraylist?
    Up to you. I'd use neither. You only need the current line, and the last distinct lines.
    And if so how would I go about sorting the duplicate lines
    Have you looked in the "Collections" class?
  • 6. Re: Remove duplicate lines from a text file and split a reference number
    807589 Newbie
    Currently Being Moderated
    An shell script would be ideal but the file in question resides on a windoze box.

    I will look into the collections class.
  • 7. Re: Remove duplicate lines from a text file and split a reference number
    807589 Newbie
    Currently Being Moderated
    This is the snippet of code I have got to so far to split the reference into a 7/4 split with a comma separating the two numbers.
    Although I am successfully adding a comma in the right place it still shows as one 11 digit number in the amendedMapErrors.Log.
    Please help.
    import java.io.*;
    
    public class Reduce {
         public static void main (String[] args) {
    
              File Filemapinfo =
                   new File ("MapErrors.Log");
              FileReader fileReader =
                   new FileReader(Filemapinfo);
              BufferedReader fileMapBUFF =
                   new BufferedReader(fileReader);
              FileWriter writer =
                   new FileWriter("amendedMapErrors.Log");
    
              while (( line = fileMapBUFF.readLine()) != null) {
                   // split reference to order/operation comma seperated
                   String[] csvSplit = line.split(",");
                   int i = 7;
                   StringBuffer csv = new StringBuffer(csvSplit[1]);
                   csv = csv.insert(i,',');
                   csvSplit[1] = csv.toString();
         System.out.println(csvSplit[1]);
                   writer.write(line +"\n");
              } //while
              fileMapBUFF.close();
              writer.close();
         }//main
    }//confirmReduce
    Edited by: MuverGooz on Jun 24, 2008 2:55 PM
  • 8. Re: Remove duplicate lines from a text file and split a reference number
    mlk Newbie
    Currently Being Moderated
    MuverGooz wrote:
    An shell script would be ideal but the file in question resides on a windoze box.
    A developer without Cygwin installed?!
    I will look into the collections class.
  • 9. Re: Remove duplicate lines from a text file and split a reference number
    mlk Newbie
    Currently Being Moderated
    writer.write(line +"\n");
    What does "line" contain?
  • 10. Re: Remove duplicate lines from a text file and split a reference number
    807589 Newbie
    Currently Being Moderated
    "line" contains each line of the input file "MapErrors.Log"

    OK, how do I get "csvSplit[1]" back into "line" please?
  • 11. Re: Remove duplicate lines from a text file and split a reference number
    mlk Newbie
    Currently Being Moderated
    You don't, you build a new line up with the data you have in the array and write that out.
  • 12. Re: Remove duplicate lines from a text file and split a reference number
    807589 Newbie
    Currently Being Moderated
    I'm stuck.
    What is the best way to do that?
  • 13. Re: Remove duplicate lines from a text file and split a reference number
    807589 Newbie
    Currently Being Moderated
    public class Reduce {
         public static void main (String[] args) throws Throwable
         {
              String line = null;
              File Filemapinfo =
                   new File ("D:/samples/java/data/MapErrors.Log");
              FileReader fileReader =
                   new FileReader(Filemapinfo);
              BufferedReader fileMapBUFF =
                   new BufferedReader(fileReader);
              FileWriter writer =
                   new FileWriter("D:/samples/java/data/amendedMapErrors.Log");
     
              Map map = new HashMap();
              Calendar cal = new GregorianCalendar();
              Calendar cal1 = new GregorianCalendar();
              String[] csvSplit = null;
              String[] existingStr = null;
              String[] time = null;
              while (( line = fileMapBUFF.readLine()) != null)
              {
                   // split reference to order/operation comma seperated
                   csvSplit = line.split(",");
                   
                   if( csvSplit[0] != null )
                   {
                        time = csvSplit[0].split( ":" );
                   }
                   cal.set( Calendar.HOUR, Integer.parseInt(time[0]));
                   cal.set( Calendar.MINUTE, Integer.parseInt(time[1]));
                   cal.set( Calendar.SECOND, Integer.parseInt(time[2]));
                   
                   if( !map.containsKey( csvSplit[1] ) )
                   {
                        map.put( csvSplit[1], csvSplit );
                   }
                   else
                   {
                        System.out.println( "inside else..." );
                        existingStr = (String[]) map.get( csvSplit[1] );
                        time = existingStr[0].split( ":" );
                        cal1.set( Calendar.HOUR, Integer.parseInt(time[0]) );
                        cal1.set( Calendar.MINUTE, Integer.parseInt(time[1]) );
                        cal1.set( Calendar.SECOND, Integer.parseInt(time[2]) );
                        if( cal.after( cal1 ) )
                        {
                             map.put( csvSplit[1], csvSplit );
                        }
                   }
                   
              } //while
              String[] newStr = null;
              int commaIndex = 7;
              StringBuffer csv = null;
              Set set = map.keySet();
              Iterator iter = set.iterator();
              while( iter.hasNext() )
              {
                   newStr = (String[]) map.get(iter.next());
                   if( newStr != null )
                   {
                        System.out.println( "***" );
                        csv = new StringBuffer(newStr[1]);
                        csv = csv.insert(commaIndex,',');
                        newStr[1] = csv.toString();
                        line = newStr[0] + newStr[1] + newStr[2];
                        writer.write(line +"\n");
                   }
              }
              fileMapBUFF.close();
              writer.close();
         }
    }
  • 14. Re: Remove duplicate lines from a text file and split a reference number
    807589 Newbie
    Currently Being Moderated
    The best program to [work with duplicates|http://www.moleskinsoft.com/] is Clone Remover. Multifunctional and easy in use.

    Edited by: Cleaner007 on Jul 8, 2008 1:38 PM
1 2 Previous Next