1 Reply Latest reply: Feb 19, 2009 11:18 PM by EJP RSS

    Problem with parsing XML with SAX

      I am parsing one of the large files distributed by Wikimedia foundation. It is a dictionary, and I would like to extract relevant definitions. However, I am not able to get the entire string, only first maybe 100 characters of it.
      public void characters(char[] ch, int start, int length) {
                  if(title) {
                      // this is title
                      titleString = new String(ch, start, length);
                      for(int i=0; i< wordsArray.length; i++) {
                          if (wordsArray.equalsIgnoreCase(titleString+" ")) {
      // ACHTUNG the stupidest whitespace at the end of the string!
      System.out.println("discover title: "+ titleString);
      } else if (text) {
      // this is text
      if (whichWord != -1) {
      defString = new String(ch, start, length);
      defArray[whichWord] = defString;
      Inside defString I get something like this:
      whereas it should be:
      {{wikipedia|dab=free}} {{also|-free}} ==English== {{rank|became|second|United|351|free|return|call|speak}} ===Etymology=== {{etyl|enm}} {{term|fre||lang=enm}} < {{etyl|ang}} {{term|freo|fr&#275;o|lang=ang}}. ===Pronunciation=== * {{IPA|/f&#633;i&#720;/}}, {{SAMPA|/fri:/}} * {{audio|en-us-free.ogg|Audio (US)}} *: {{rhymes|i&#720;}} ===Adjective=== {{en-adj|freer|freest}} # Not [[imprisoned]] or [[enslaved]]. #: ''a '''free''' [[man]]'' # Obtainable without [[payment]]. #: ''All drinks are '''free''''' <!--#:''free of [[charge]]'' this example not unambiguous; could be the adverb--> # [[unconstrained|Unconstrained]]. #: ''He was given '''free''' rein to do whatever he wanted'' <!--"rein", not "reign"--> # {{mathematics}} [[unconstrained|Unconstrained]]. #: ''The '''free''' group on three generators'' # Unobstructed, without [[blockage]]s. #: ''the drain was '''free''''' # Not in use #: ''go sit on this chair, it's '''free''''' # Without [[obligation]]s. #: '''''free''' time'' # {{software}} With very few [[limitations]] on distribution or improvement compared to [[proprietary software]]. #: ''[[free software]]'' # Without; not containing (what is specified). #: ''We had a wholesome, filling meal, '''free''' of meat'' # {{programming}} Of [[identifier]]s, not [[bound]]. # {{mycology}} Not attached to the [[stipe]]. #: ''In this group of mushrooms, the gills are '''free'''.'' ====Synonyms==== * {{sense|not imprisoned or enslaved}} * {{sense|obtainable without payment}} [[free of charge]], [[gratis]] * {{sense|unconstrained}} [[unconstrained]], [[unfettered]], [[unhindered]] * {{sense|mathematics: unconstrained}} * {{sense|unobstructed}} [[clear]], [[unobstructed]] * {{sense|without obligations}} * {{sense|software: with very few limitations on distribution or improvement}} [[libre]] * {{sense|without, not containing}} [[without]] * {{sense|of identifiers, not bound}} [[unbound]] * {{sense|mycology: not attached to the stipe}} ====Antonyms==== * {{sense|not imprisoned or enslaved}} [[bound]], [[enslaved]], [[imprisoned]] * {{sense|unconstrained}} [[constrained]], [[restricted]] * {{sense|unobstructed}} [[blocked]], [[obstructed]] * {{sense|of identifiers, not bound}} [[bound]] ====Derived terms==== {{rel-top|Terms derived from ''free''}} * [[-free]] <!--Terms such as "tax-free" should be added to the page for "-free"--> * [[free Abelian group]]<!--UK spelling-->, [[free abelian group]]<!--US spelling--> * [[free algebra]] * [[free as a bird]] * [[freeball]] * [[freebooter]] * [[free fall]] * [[free group]] * [[freelance]] * [[freeloader]] * [[Freemason]] * [[free module]] {{rel-mid}} * [[free object]] * [[free of charge]] * [[free rein]]<!--Note: this is the only correct spelling of this expression--> * [[free semigroup]] * [[free-thinker]] * [[free time]] * [[free variable]] * [[freeware]] * [[freewheel]] * [[free will]] * [[unfree]] {{rel-bottom}} ====Related terms==== * [[freedom]]<!--from Old English--> * [[friend]] 
      Can you tell me what is wrong with my code? I am relatively new to Java.