8 Replies Latest reply: Sep 13, 2009 10:26 AM by Darryl Burke RSS

    parsing a text file.

    843789
      ok. here is the objective, I need to read a file that has all bunch of lines and parse out some of them and store them to an array in an order. so my text file would be something like this:

      FT VARIANT 115 115 G -> V (in allele DRB1*0804).
      FT /FTId=VAR_016685.
      FT VARIANT 236 236 V -> M (in dbSNP:rs2230816).
      FT /FTId=VAR_056535.
      FT VARIANT 262 262 T -> R (in dbSNP:rs9269744).
      FT /FTId=VAR_056536.
      SQ SEQUENCE 266 AA; 30004 MW; D452D1C3A75CEA31 CRC64;
      (indented) MVCLRLPGGS CMAVLTVTLM VLSSPLALAG DTRPRFLEYS TGECYFFNGT ERVRFLDRYF
      (indented) YNQEEYVRFD SDVGEYRAVT ELGRPSAEYW NSQKDFLEDR RALVDTYCRH NYGVGESFTV
      (indented) QRRVHPKVTV YPSKTQPLQH HNLLVCSVSG FYPGSIEVRW FRNGQEEKTG VVSTGLIHNG
      (indented) DWTFQTLVML ETVPRSGEVY TCQVEHPSVT SPLTVEWSAR SESAQSKMLS GVGGFVLGLL
      (indented) FLGAGLFIYF RNQKGHSGLQ PTGFLS
      //
      ID A16A1_HUMAN Reviewed; 802 AA.
      AC Q8IZ83; Q86YF0; Q8IYL4; Q8TEI8;
      DT 15-JAN-2008, integrated into UniProtKB/Swiss-Prot.
      DT 01-MAR-2003, sequence version 1.
      DT 28-JUL-2009, entry version 49.
      DE RecName: Full=Aldehyde dehydrogenase family 16

      p.s (indented=there is nothing there originally it should be blank but the forum makes all the lines start from the beginning)

      here is the thing the text file contains all sorts of lines. and basically entries that ends with "//" as you can see. I need to get the indented lines that are after the line that starts with"SQ" split it and get rid of the spaces in between. And than store those into the zeroth index of an array. And there are currently 25 of those indented parts in my file but there might be more. So here what I have done so far; I have created a method that reads the file and gets the very first line after "SQ" parts. but I cant get the rest and concatenate them and store them into the first element of the array. So my method turns to this array. Here is my code;

      public static String [] sqFinder(int length, String readfile) throws FileNotFoundException{

                String [] sq= new String [length];
           boolean b=true;
                Scanner readline=new Scanner(new File(readfile));
      int count=0;
                while (readline.hasNextLine())
                {
                     String line=readline.nextLine();
                     if(b==true&&line.startsWith("SQ")){
                     b=false;
                     
                     String line1=readline.nextLine();
                     if(line1.startsWith(" "))
                     {
                          
                          System.out.println(line1);
                     b=true;
                     sq[count]=line1;
                     count++;
                     }
                
                     }
      }     
                     return sq;
                     


      Here is the instructions again;

      The id given by uniprot to this protein is P69905. This id is called an accession number
      (AC). The Uniprot knowledgebase is available in a variety of formats, including a simple
      at
      le format that you are going to parse. Part of the entry for the hemoglobin alpha globin is
      as follows:
      ID HBA_HUMAN Reviewed; 142 AA.
      AC P69905; P01922; Q1HDT5; Q3MIF5; Q53F97; Q96KF1; Q9NYR7; Q9UCM0;
      DT 21-JUL-1986, integrated into UniProtKB/Swiss-Prot.
      DT 23-JAN-2007, sequence version 2.
      DT 28-JUL-2009, entry version 75.
      DE RecName: Full=Hemoglobin subunit alpha;
      DE AltName: Full=Hemoglobin alpha chain;
      DE AltName: Full=Alpha-globin;
      GN Name=HBA1;
      GN and
      GN Name=HBA2;
      OS Homo sapiens (Human).
      OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
      OC Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
      OC Catarrhini; Hominidae; Homo.
      OX NCBI_TaxID=9606;
      ....
      SQ SEQUENCE 142 AA; 15258 MW; 15E13666573BBBAE CRC64;
      MVLSPADKTN VKAAWGKVGA HAGEYGAEAL ERMFLSFPTT KTYFPHFDLS HGSAQVKGHG
      KKVADALTNA VAHVDDMPNA LSALSDLHAH KLRVDPVNFK LLSHCLLVTL AAHLPAEFTP
      AVHASLDKFL ASVSTVLTSK YR
      //
      Things you should know about this format:
      Each type of information is specied by a two-letter code at the beginning of the line.
      The accession number for this protein is found in the AC line (and it does not include
      the semi-colon). Note that the protein has several accession numbers, and the rst one
      is the primary accession number.
      The sequence is found in the lines following SQ line until the end-of-entry terminator
      (//). Spaces are not part of the sequence, and are there only for human readability.
      Each entry is terminated by //.
      2
      Assignment 1
      You can nd the complete Uniprot entry for this protein at http://www.uniprot.org/
      uniprot/P69905. The Uniprot knowledgebase is also available in XML which is a modern,
      generic language for representing data.
      Your task is to parse a Uniprot
      at le containing entries for an unknown number of
      proteins (currently the database contains information on 495880 proteins coming from 5208
      species). For each protein you will need to extract its primary accession number and se-
      quence. The data that you extracted then needs to be written to another le which contains
      the accession numbers and sequences, and should have the following format:
      -header1
      sequence1
      -header2
      sequence2


      So I got the part with accession numbers but not the sequences.

      Methods: your program needs to have at least two methods in addition to the Main
      method: a method for extracting the accession numbers and sequences out of the
      Uniprot le. This method needs to return the data as two arrays (think how to do
      that, since a method can only have a single return value). Also, you don't know in
      advance how many proteins are represented in the Uniprot le, so in order to dene a
      arrays of the right size you need to count the number of entries in the le. This is the
      second method your program needs to have. Java has data structures that can grow
      dynamically, but you shouldn't use them.
      ....

      THANKS.

      Edited by: granum on Sep 4, 2009 2:13 PM

      Edited by: granum on Sep 4, 2009 2:16 PM
        • 1. Re: parsing a text file.
          843789
          If you need to parse a whitespace delimited line into a String array then it's best to go with the string.split(delimiter) method.
          The line "(indented) ABC ABC ABC" would give "ABC", "ABC", "ABC".
          I'm not sure if the starting whitespace affects the parsing, but you might want to do string.trim anyway.
          String line = ...
          String[] tokens = line.trim().split("\\s+");
          Where "\\s" means whitespace and the "+" is a regex operator meaning "1 or more".
          • 2. Re: parsing a text file.
            3004
            granum wrote:
            p.s (indented=there is nothing there originally it should be blank but the forum makes all the lines start from the beginning)
            Use the CODE button or [code] and [/code] tags to preserve spacing and alignment. You'll have to re-paste from the original source, as the one here has already lost all that.
            • 3. Re: parsing a text file.
              3004
              TuringPest wrote:
              If you need to parse a whitespace delimited line
              Oh, is that all he needs to do? I'm far too lazy to read all that.
              • 4. Re: parsing a text file.
                843789
                jverd wrote:
                TuringPest wrote:
                If you need to parse a whitespace delimited line
                Oh, is that all he needs to do? I'm far too lazy to read all that.
                Well I didn't read past the 2nd sentence, so we'll see. ;)
                • 5. Re: parsing a text file.
                  843789
                  ok here is the actual homework. the due date is past already. so you guys can help me out what I am missing.

                  http://www.cs.colostate.edu/~asa/courses/cs161/fall09/assignments/assignment1.pdf

                  and the part that I cant get is with Sequences. this is the txt file that our program should work on.

                  http://www.cs.colostate.edu/~asa/courses/cs161/fall09/assignments/uniprot_human_sample.dat

                  Here is what I have done so far;

                  import java.io.*;
                  import java.util.*;
                  import java.io.BufferedWriter;
                  import java.io.FileWriter;
                  import java.io.IOException;
                  
                  public class UniprotParser {
                  
                       
                  
                  
                       /// main method
                       public static void main(String[] args) throws IOException {
                  
                            String uniprot=args[0];
                            String fasta=args[1];
                            int entryno = entryFinder(uniprot);
                            String [] accession=accFinder(entryno,uniprot); 
                            String [] sequence=sqFinder(entryno,uniprot);
                  
                  
                            /// prints out the primary accession numbers and the corresponding sequences.
                            for (int i=0; i<entryno; i++){
                                 System.out.println(">"+accession);
                                 System.out.println(sequence[i]);
                            }

                            /// writes the output to a file called "fatsaFile"
                            FileWriter fstream = new FileWriter(fasta);
                            BufferedWriter out = new BufferedWriter(fstream);


                            out.write("there will be accession numbers here");
                            out.newLine();
                            out.write("there will be sequences here ");
                            out.close();

                       }


                       /// this method finds out how many entries there are in the file.
                       public static int entryFinder(String readfile)throws FileNotFoundException{

                            int sqnumber=0;
                            Scanner readline=new Scanner(new File(readfile));

                            while (readline.hasNextLine())
                            {
                                 String line=readline.nextLine();
                                 String [] seq= line.split(" ");

                                 if (seq[0].equals("SQ")){
                                      sqnumber++; }
                            }     
                            readline.close();

                            return sqnumber;
                       }


                       /// this method gets the primary accession numbers.
                       public static String [] accFinder(int length, String readfile) throws FileNotFoundException{

                            String [] acc= new String [length];
                            boolean b=true;
                            String strg="";
                            int count=0;

                            Scanner readline=new Scanner(new File(readfile));
                            while (readline.hasNextLine())
                            {

                                 String line=readline.nextLine();
                                 String []spltline=line.split(" ");
                                 if (b==true&&spltline[0].equals("AC"))
                                 {

                                      String [] c=spltline[3].split(";");

                                      b=false;

                                      strg=c[0];

                                      acc[count]=strg;
                                      count++;
                                 }

                                 if (b==false&&spltline[0].equals("//"))
                                 {
                                      b=true;
                                 }
                            }
                            readline.close();
                            return acc;
                       }


                       /// this method gets the sequences assigned for each primary accession number.
                       public static String [] sqFinder(int length, String readfile) throws FileNotFoundException{

                            String [] sq= new String [length];
                       boolean b=true;
                            Scanner readline=new Scanner(new File(readfile));
                  int count=0;
                            while (readline.hasNextLine())
                            {
                                 String line=readline.nextLine();
                                 if(b==true&&line.startsWith("SQ")){
                                 b=false;
                                 
                                 String line1=readline.nextLine();
                                 if(line1.startsWith(" "))
                                 {
                                      
                                      //System.out.println(line1);
                                 b=true;
                                 sq[count]=line1;
                                 count++;
                                 }
                            
                                 }






                                 
                            }     
                                 return sq;
                                 
                                 

                            }



                  }
                  Edited by: granum on Sep 4, 2009 2:47 PM                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
                  • 6. KISS IT!
                    800308
                    granum,

                    It took me a while to work out the issues with your implementation. Generally you're working pretty much in the right direction, within the contraints given you.

                    Firstly I feel I need to offer a critique on the lectures "requirements"... The technique of creating two-or-more arrays to store the "fields" of one "row" is called "parallel arrays", and it's considered bad practice.... google it for an explanation of the issues, and the alternative approaches.

                    Secondly, your code has a number of "style issues" including meaningless variable (boolean b=false; don't tell me sheeite!) and method names (WTF is an accFinder FFS!), and formatting (esp indentation), blank lines, and inconsistent brace placement. These "niceties" are important to readability of your code; and you are asking us to read your code in order to help you; hence we expect that you invest some time and effort into getting these basics right. Reading badly "styled" code is a pain in the arse, so help us help you, and fix it. To make myself crystal clear, If I was your university lecturer, and you handed this code in to me, I would summarily dismiss your submission with the comment "Unmaintainable. 1/10. EPIC Fail.", regardless of whether or not it works.

                    Thirdly, you are over-complicating your sqFinder (another "oh FFS" name, BTW) method with that extra nextLine() call. From experience, I find it's better/simpler to use one and only one readline in a "read loop", because it simplifies the exit-loop-test (yours is badly broken if the input-file deviates from the expected format).

                    *Pseudocode for my readSequencesFromUniprotFile*
                    for each line read from the input file do
                        if the line starts with "SQ " then it's start of a new sequence, right?, so create a new sequence.
                        otherwise if the line starts with "     " (5 spaces) then this is a the continuation of the current sequence, so append it to the sequence.
                        otherwise if the line equals "//" then it's the end of the current sequence, so add the current sequence to the list of sequences.
                        otherwise just ignore this line.
                    next line
                    Tip: Store the "current sequence" in a StringBuilder because it's more efficient than appending strings using the &#43; or &#43;= operators.

                    Also I'd eliminate those split calls. As general rule of thumb, don't attempt to "parse" a string unless you're already pretty-sure that the the string is in the expected format... in this case that means "Just use String.startsWith(String) to test if this is a line-of-interest." I'd be tempted to use split on the AC line, once I'd established that I was dealing with an AC line, but I would NOT split and then check if the first field equals "AC"... Q: What if a line in the file was blank? A: ArrayIndexOutOfBoundsException.

                    Cheers. Keith.
                    • 7. Re: parsing a text file.
                      843789
                      Here is an alternate solution.

                      You have an input text file at /input.txt. You want to extract lines

                      1. starting at the line after the line that begins with SQ
                      2. ending at the line before the line that contains just //


                      Here is a script.



                      # Script SQ.txt
                      # Read file in.
                      var str input ; cat "/input.txt" > $input
                      # Strip off all lines upto (and including) the 'SQ' line.
                      stex -c -r "^\nSQ&\n^]" $input > null
                      # Strip off all lines starting wih (and including) the '//' line.
                      stex "[^\n//\n^" $input > null
                      # $input only has the lines you want. Print them,
                      # or, do something else with them.
                      echo $input
                      Script is in biterscripting ( [http://www.biterscripting.com] ). To try, save the script as /Scripts/SQ.txt, enter the following command in biterscripting.

                      script "/Scripts/SQ.txt"
                      Script can also be called directly from any other language with the following command.

                      "/biterScripting/biterScripting" "/Scripts/SQ.txt"
                      Sen
                      • 8. Re: parsing a text file.
                        Darryl Burke
                        ScriptingTeacher, this is a forum for the Java programming language, not an advertising billboard for your non-Java solution. I'm blocking your post and the similar one you posted in February which elicited a question to which you didn't respond:
                        kajbj wrote:

                        In what way is that related to Java programming?

                        Kaj
                        Any further off topic posts and your account will be blocked.

                        db