2 Replies Latest reply: Oct 12, 2007 8:05 PM by 807605 RSS

    How to truncate a utf string to desired boundaries

    807605
      How do I truncate a utf-8 string to appropriate boundaries.

      The string "ఆకాశ దేశాన" {Telugu, {c06,c15,c3e,c36,20,c26,c47,c36,c3e,c28}} when run through
      the following test
      @Test
      public void truncateString() {
          final String truncated = truncateStringUtf("ఆకాశ దేశాన", 2);
          System.out.println(truncated);
      }
      
      static final String truncateStringUtf(final String s, final int len) {
          final int codePoints = Math.min(len, s.codePointCount(0, s.length()));
          if (codePoints > 0) {
              final StringBuilder sb = new StringBuilder();
              for (int i = 0; i < codePoints; i++) {
                  sb.append((char)s.codePointAt(i));
              }
              sb.append("...");
              return sb.toString();
          } else {
              return s;
          }
      }
      gives the output as "&#3078;&#3093;..." {c06,c15}.
      This is not a proper readable string, whereas I expected "&#3078;&#3093;&#3134;" which is more complete.

      Further tests with String's and Character's utf api showed the following output
      @Test
      public void normalize() {
          final String original = "&#3078;&#3093;&#3134;&#3126; &#3110;&#3143;&#3126;&#3134;&#3112;";
          final int originalLength = original.length();
          System.out.println("originalLength = " + originalLength);
          final String normalized = Normalizer.normalize(original, 
              Normalizer.COMPOSE, originalLength);
          final int normalizedLength = normalized.length();
          System.out.println("normalizedLength= " + normalizedLength);
          final int codePoints = normalized.codePointCount(0, normalizedLength);
          System.out.println("# code points in normalized str = " + codePoints);
      }
      
      @Test
      public void codePoints() {
          final String sourceStr = "&#3078;&#3093;&#3134;&#3126; &#3110;&#3143;&#3126;&#3134;&#3112;";
          final int length = sourceStr.length();
          System.out.println("sourceStr.length() = " + length);
          final int codePoints = sourceStr.codePointCount(0, length);
          System.out.println("sourceStr.codePointCount(" + '0' + ", " + length + ") = " + codePoints);
          final StringBuilder sb = new StringBuilder();
          for (int i = 0; i < codePoints; i++) {
              char c = (char)sourceStr.codePointAt(i);
              System.out.println(Integer.toHexString(c));
              sb.append(c);
          }
          assertEquals(sourceStr, sb.toString());
      }
      
      @Test
      public void testChars() {
          char c1 = '\u0c15', c2 = '\u0c3e';
          boolean isHighSurrogate = Character.isHighSurrogate(c1);
          boolean isLowSurrogate = Character.isLowSurrogate(c1);
          boolean isSupplementaryCodePoint = Character.isSupplementaryCodePoint(c1);
          System.out.println("" + Integer.toHexString(c1) + 
                                 "-> isHighSurrogate = " + isHighSurrogate + 
                                 ", isLowSurrogate = " + isLowSurrogate + 
                                 ", isSupplementaryCodePoint = " + isSupplementaryCodePoint);
          isHighSurrogate = Character.isHighSurrogate(c2);
          isLowSurrogate = Character.isLowSurrogate(c2);
          isSupplementaryCodePoint = Character.isSupplementaryCodePoint(c2);
          System.out.println("" + Integer.toHexString(c2) + 
                             "-> isHighSurrogate = " + isHighSurrogate + 
                             ", isLowSurrogate = " + isLowSurrogate + 
                             ", isSupplementaryCodePoint = " + isSupplementaryCodePoint);
          boolean isSurrogatePair = Character.isSurrogatePair(c1, c2);
          System.out.println("isSurrogatePair(" + Integer.toHexString(c1) + ", "
                             + Integer.toHexString(c2) + ") = "+ isSurrogatePair);
          isSurrogatePair = Character.isSurrogatePair(c2, c1);
          System.out.println("isSurrogatePair(" + Integer.toHexString(c2) + ", "
                             + Integer.toHexString(c1) + ") = "+ isSurrogatePair);
          String x = c1 + "" + c2;
          System.out.println(x);
      }
      Output:

      originalLength = 10
      normalizedLength= 10
      # code points in normalized str = 10

      sourceStr.length() = 10
      sourceStr.codePointCount(0, 10) = 10

      c15-> isHighSurrogate = false, isLowSurrogate = false, isSupplementaryCodePoint = false
      c3e-> isHighSurrogate = false, isLowSurrogate = false, isSupplementaryCodePoint = false
      isSurrogatePair(c15, c3e) = false
      isSurrogatePair(c3e, c15) = false
        • 1. Re: How to truncate a utf string to desired boundaries
          DrClap
          Okay. There aren't any surrogate pairs in that string. (And it's not a "utf-8" string either, there's no such thing in Java, only Unicode.) And all the characters are in the BMP, so each of them is a code point. You wrote code to extract the first two chars, then, so why did you expect it to extract three?

          Edit: As you might guess I don't know much about Telugu. How do you determine "appropriate boundaries" for substrings?

          Edited by: DrClap on Oct 12, 2007 3:24 PM
          • 2. Re: How to truncate a utf string to desired boundaries
            807605
            I think what the OP is talking about are grapheme clusters, rather than surrogate pairs. It's a higher-level concept, and not a simple one, from what I've seen. :-( This page has some info about them:

            http://www.unicode.org/reports/tr29/

            According to the Telugu character chart, \u0C3E is a vowel sign, which is always used in combination with one or more other characters, never as a character in its own right. Judging by the OP's sample, if a letter is followed by a vowel sign, the two codepoints are treated as a single character, both when they're displayed and for purposes of counting characters, matching, or collating. There doesn't seem a single-codepoint alternative representation for these combinations, as there are for accented Western letters, so normalization doesn't apply. As far as I know, the Java standard libraries don't support grapheme clusters, but ICU4J probably does:

            http://icu-project.org/index.html