HTMLEntityCodec destroys 32-bit CJK (Chinese, Japanese and Korean) characters

Description

From ri.j...@gmail.com on April 04, 2013 08:36:58

What steps will reproduce the problem? 1. Escape "𡘾𦴩𥻂" with org.owasp.esapi.Encoder#encodeForHTML
2. View the result in a browser What is the expected output? What do you see instead? Expected: 𡘾𦴩𥻂
Current: ������ What version of the product are you using? On what operating system? 2.0.1 on Mac OS X 10.8.3 Does this issue affect only a specified browser or set of browsers? It's the same in Chrome, Firefox and IE. Please provide any additional information below. The reason is that 32-bit characters do not fit in a Java char/Character. Here some code to illustrate it:

String s = "𡘾𦴩𥻂";
// Wrong:
StringBuilder sb = new StringBuilder();
for (int i = 0; i < s.length(); i++) {
sb.append("&#x").append(Integer.toHexString(s.charAt)).append(';');
}
System.out.println(sb); // &#xd845;&#xde3e;&#xd85b;&#xdd29;&#xd857;&#xdec2;

// Correct:
sb = new StringBuilder();
for (int i = 0; i < s.length(); ) {
int codePoint = s.codePointAt;
sb.append("&#x").append(Integer.toHexString(codePoint)).append(';');
i += Character.charCount(codePoint);
}
System.out.println(sb); // &#x2163e;&#x26d29;&#x25ec2;

Original issue: http://code.google.com/p/owasp-esapi-java/issues/detail?id=297

Environment

None

Status

Assignee

Unassigned

Reporter

Max Gelman

Priority

Configure