3 Replies Latest reply: Feb 13, 2013 5:25 AM by 990573 RSS

    Simple Regex problem with symbols

    990573
      I need split a text and get only words, numbers and hyphenated composed-words. I need to get latin words also, then I used \p{L}, which gives me é, ú ü ã, and so forth. The example is:

      String myText = "Some latin text with symbols, ? 987 (A la pointe sud-est de l'île se dresse la cathédrale Notre-Dame qui fut lors de son achèvement en 1330 l'une des plus grandes cathédrales d'occident) : ! @ # $ % ^& * ( ) + - _ #$% " ' : ; > < / \ | , here some is wrong… * + () e -"

      Pattern pattern = Pattern.compile("[^\\p{L}+(\\-\\p{L}+)*\\d]+");
      String words[] = pattern.split( myText );

      What is wrong with this regex? Why it matches symbols like "(", "+", "-", "*" and "|" ?

      Some of results are:

      dresse // OK
      sud-est // OK
      occident) // WRONG
      987 // OK
      () // WRONG
      (a // WRONG
      * // WRONG
      - // WRONG
      + // WRONG
      ( // WRONG
      | // WRONG

      P.S.:
      The regex explanation is:

      [^\p{L}+(\-\p{L}+)*\d]+

      * Word separator will be:
      * [^  ...  ] No sequence in:
      * \p{L}+ Any latin letter
      * (\-\p{L}+)* Optionally hyphenated
      * \d or numbers
      * [ ... ]+ once or more.