This discussion is archived
3 Replies Latest reply: Feb 13, 2013 3:25 AM by 990573 RSS

Simple Regex problem with symbols

990573 Newbie
Currently Being Moderated
I need split a text and get only words, numbers and hyphenated composed-words. I need to get latin words also, then I used \p{L}, which gives me é, ú ü ã, and so forth. The example is:

String myText = "Some latin text with symbols, ? 987 (A la pointe sud-est de l'île se dresse la cathédrale Notre-Dame qui fut lors de son achèvement en 1330 l'une des plus grandes cathédrales d'occident) : ! @ # $ % ^& * ( ) + - _ #$% " ' : ; > < / \ | , here some is wrong… * + () e -"

Pattern pattern = Pattern.compile("[^\\p{L}+(\\-\\p{L}+)*\\d]+");
String words[] = pattern.split( myText );

What is wrong with this regex? Why it matches symbols like "(", "+", "-", "*" and "|" ?

Some of results are:

dresse // OK
sud-est // OK
occident) // WRONG
987 // OK
() // WRONG
(a // WRONG
* // WRONG
- // WRONG
+ // WRONG
( // WRONG
| // WRONG

P.S.:
The regex explanation is:

[^\p{L}+(\-\p{L}+)*\d]+

* Word separator will be:
* [^  ...  ] No sequence in:
* \p{L}+ Any latin letter
* (\-\p{L}+)* Optionally hyphenated
* \d or numbers
* [ ... ]+ once or more.

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points