7 Replies Latest reply: Jan 4, 2008 6:08 AM by 800282 RSS

    Using regex... problems with groups

    807603
      Ok, I am trying to extract information from html code taken from reddit.com. My lecturer has stated he wants us to turn:

      onclick="return morechildren(this, 't3_2mg72', ['c2mo4k', 'c2mn1u', 'c2mmsp', 'c2mmrw', 'c2mm6d', 'c2mlbx', 'c2ml08', 'c2mkv5', 'c2mkqb', 'c2mkn3', 'c2mk92', 'c2mjwd', 'c2mjt7', 'c2mjsu', 'c2mjs2', 'c2mjcy', 'c2mj9p', 'c2mj4b', 'c2mil2', 'c2mih8', 'c2mhxe', 'c2mhl0', 'c2mgvj', 'c2mh6r', 'c2mhhf', 'c2mh4u', 'c2mh5t', 'c2mhal', 'c2mhq2', 'c2mhak', 'c2mhi6', 'c2mg9f', 'c2miax', 'c2mgrh', 'c2mgx8', 'c2mgyc', 'c2mhac', 'c2mhn5', 'c2mgt1', 'c2mi7p', 'c2mic3', 'c2mlkh', 'c2mix7', 'c2mkz2', 'c2mjec', 'c2mh1z', 'c2mklc', 'c2mi1r', 'c2mi38', 'c2mm36', 'c2mkcw', 'c2mlbr', 'c2mldz', 'c2mle6', 'c2mlk3', 'c2mqxe', 'c2mta0'], 0)">load more comments

      into:

      Thread: t3_2mg72
      Posts: c2mn1u, c2mmsp, c2mmrw, etc etc..............

      I used:

      String t="onclick=\"return morechildren\\(this, '(t3_2mg72)', \\[(('c2m\\w{3}',?\\s?)+)\\], ([0-9])\\)\">load more comments";

      and

      System.out.println("Thread: "+cmatcher.group(1));
      System.out.println("Posts: "+cmatcher.group(2));

      And I get:

      Thread: t3_2mg72
      Posts: 'c2mo4k', 'c2mn1u', 'c2mmsp', 'c2mmrw', 'c2mm6d', 'c2mlbx', 'c2ml08', 'c2mkv5', 'c2mkqb', 'c2mkn3', 'c2mk92', 'c2mjwd', 'c2mjt7', 'c2mjsu', 'c2mjs2', 'c2mjcy', 'c2mj9p', 'c2mj4b', 'c2mil2', 'c2mih8', 'c2mhxe', 'c2mhl0', 'c2mgvj', 'c2mh6r', 'c2mhhf', 'c2mh4u', 'c2mh5t', 'c2mhal', 'c2mhq2', 'c2mhak', 'c2mhi6', 'c2mg9f', 'c2miax', 'c2mgrh', 'c2mgx8', 'c2mgyc', 'c2mhac', 'c2mhn5', 'c2mgt1', 'c2mi7p', 'c2mic3', 'c2mlkh', 'c2mix7', 'c2mkz2', 'c2mjec', 'c2mh1z', 'c2mklc', 'c2mi1r', 'c2mi38', 'c2mm36', 'c2mkcw', 'c2mlbr', 'c2mldz', 'c2mle6', 'c2mlk3', 'c2mqxe', 'c2mta0'

      Now, I don't want all the 's in there. But I don't know how to avoid them. I need it widely grouped so that the + is included and I get all the codes instead of just one. Is there any kind of way to group the code within the ''s and then state that I want all of those bits within group 2, but not the outer bits?

      Hmm, I feel I am not making much sense! But don't know how to explain it any better. Hope someone knows what I am on about and can help... :)

      S xx
        • 1. Re: Using regex... problems with groups
          800282
          eurythmic wrote:
          Ok, I am trying to extract information from html code taken from reddit.com. My lecturer has stated he wants us to turn:

          onclick="return morechildren(this, 't3_2mg72', ['c2mo4k', 'c2mn1u', 'c2mmsp', 'c2mmrw', 'c2mm6d', 'c2mlbx', 'c2ml08', 'c2mkv5', 'c2mkqb', 'c2mkn3', 'c2mk92', 'c2mjwd', 'c2mjt7', 'c2mjsu', 'c2mjs2', 'c2mjcy', 'c2mj9p', 'c2mj4b', 'c2mil2', 'c2mih8', 'c2mhxe', 'c2mhl0', 'c2mgvj', 'c2mh6r', 'c2mhhf', 'c2mh4u', 'c2mh5t', 'c2mhal', 'c2mhq2', 'c2mhak', 'c2mhi6', 'c2mg9f', 'c2miax', 'c2mgrh', 'c2mgx8', 'c2mgyc', 'c2mhac', 'c2mhn5', 'c2mgt1', 'c2mi7p', 'c2mic3', 'c2mlkh', 'c2mix7', 'c2mkz2', 'c2mjec', 'c2mh1z', 'c2mklc', 'c2mi1r', 'c2mi38', 'c2mm36', 'c2mkcw', 'c2mlbr', 'c2mldz', 'c2mle6', 'c2mlk3', 'c2mqxe', 'c2mta0'], 0)">load more comments

          into:

          Thread: t3_2mg72
          Posts: c2mn1u, c2mmsp, c2mmrw, etc etc..............
          What happened to "c2mo4k"?
          \\
          \\
          I used:

          String t="onclick=\"return morechildren\\(this, '(t3_2mg72)', \\[(('c2m\\w{3}',?\\s?)+)\\], ([0-9])\\)\">load more comments";

          and

          System.out.println("Thread: "+cmatcher.group(1));
          System.out.println("Posts: "+cmatcher.group(2));

          And I get:

          Thread: t3_2mg72
          Posts: 'c2mo4k', 'c2mn1u', 'c2mmsp', 'c2mmrw', 'c2mm6d', 'c2mlbx', 'c2ml08', 'c2mkv5', 'c2mkqb', 'c2mkn3', 'c2mk92', 'c2mjwd', 'c2mjt7', 'c2mjsu', 'c2mjs2', 'c2mjcy', 'c2mj9p', 'c2mj4b', 'c2mil2', 'c2mih8', 'c2mhxe', 'c2mhl0', 'c2mgvj', 'c2mh6r', 'c2mhhf', 'c2mh4u', 'c2mh5t', 'c2mhal', 'c2mhq2', 'c2mhak', 'c2mhi6', 'c2mg9f', 'c2miax', 'c2mgrh', 'c2mgx8', 'c2mgyc', 'c2mhac', 'c2mhn5', 'c2mgt1', 'c2mi7p', 'c2mic3', 'c2mlkh', 'c2mix7', 'c2mkz2', 'c2mjec', 'c2mh1z', 'c2mklc', 'c2mi1r', 'c2mi38', 'c2mm36', 'c2mkcw', 'c2mlbr', 'c2mldz', 'c2mle6', 'c2mlk3', 'c2mqxe', 'c2mta0'

          Now, I don't want all the 's in there. But I don't know how to avoid them. I need it widely grouped so that the + is included and I get all the codes instead of just one. Is there any kind of way to group the code within the ''s and then state that I want all of those bits within group 2, but not the outer bits?
          What do you mean +"I don't want all the 's in there"+ ?
          • 2. Re: Using regex... problems with groups
            807603
            So you want to extract the thread as the quoted string just prior to the '[' and then all the values until the ']' BUT you want to get rid of the quotes. Is that right?
            • 3. Re: Using regex... problems with groups
              807603
              Yes, that's right. I pretty much have the answer required, I just want to eliminate all the quotes
              • 4. Re: Using regex... problems with groups
                800282
                sabre150 wrote:
                So you want to extract the thread as the quoted string just prior to the '[' and then all the values until the ']' BUT you want to get rid of the quotes. Is that right?
                Ah, I see: the OP wasn't talking about the character 's', but about the single quotes!
                • 5. Re: Using regex... problems with groups
                  800282
                  eurythmic wrote:
                  Yes, that's right. I pretty much have the answer required, I just want to eliminate all the quotes
                  Well, when matching the thread you placed the single quotes outside the parenthesis...
                  • 6. Re: Using regex... problems with groups
                    807603
                    eurythmic wrote:
                    Yes, that's right. I pretty much have the answer required, I just want to eliminate all the quotes
                    Then do a
                    replaceAll("'","") 
                    on the result.
                    • 7. Re: Using regex... problems with groups
                      807603
                      Thanks!
                      ~S~