This discussion is archived
2 Replies Latest reply: Jan 19, 2011 12:52 AM by 802340 RSS

RegEx issue

832304 Newbie
Currently Being Moderated
Hi,

I am fumbling with some regular expression issue. I am trying to separate out sentences in a paragraph. I tried using regEx as [?!.] to tokenize. But the issue I faced was that it also broke the sentence at the fractions.

e.g. I am using iphone 3.2 but I am not happy with it.

This sentence was broken down into

I am using iphone 3
2 but I am not happy with it

I tried using regEx as [.!?]^[[0-9].[0-9]] to exclude all instances of '.' with number before and after it but it still failed. I have just started learning regEx so I am not sure what is the right way to do this. Any inputs?

Thanks.
  • 1. Re: RegEx issue
    sabre150 Expert
    Currently Being Moderated
    Assuming that the punctuation character will always be followed by one or more white space then use 'positive look ahead' to ensure that.
  • 2. Re: RegEx issue
    802340 Newbie
    Currently Being Moderated
    As sabre150 already suggested, you could split on one of [.?!] followed directly by a white space character, but tokenizing English (or other human languages) can't simply (and reliably) be done with a simple split(...).

    Consider the sentence:
    Imagine mr. H. G. Wells using an iPhone 3.2.
    (note the periods in the initials and after `mr`)

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points