I am working with Javascript for one of the first times and its for a sha-1 hash. I have found code to do this, but one of its dependencies is a method to convert the string to utf-8, however the server I am comparing against utilizes utf-16. I have looked around and all my results keep showing up w/ utf-8. Can anybody at least point me in the right direction? Thanks.
1 Answer
Javascript already uses UTF-16 internally - use charCodeAt() to get the values.
6 Comments
Mike 'Pomax' Kamermans
note: charCodeAt() will not give you the UTF-16 byte codes, it will give you the encoding-less Unicode code point number, so it's not particularly useful unless you also have the codepoint-to-UTF16-bytecode conversion algorithm available.
Christoph
@Mike'Pomax'Kamermans: that's incorrect -
charCodeAt() does return UTF-16 code units - see the linked documentation or the ECMA spec; what you describe is codePointAt(), an ES6 additionMike 'Pomax' Kamermans
I read the ECMA spec fairly frequently, so here's the spec for it: "String.prototype.charCodeAt(pos) -- Returns a Number (a nonnegative integer less than 2^16) representing the code unit value of the character at position pos in the String resulting from converting this object to a String. If there is no character at that position, the result is NaN." Code unit refers to the Unicode point, not a specific encoding pattern (Unicode itself is encodingless, it's just a list of glyph-X-has-list-number-...)
Christoph
@Mike'Pomax'Kamermans: there are three levels involved: (1) codepoints (aka unicode characters), which go up tu 0x10FFFF (~21 bits), (2) what the ECMA spec calls code unit values which you get by encoding unicode characters via UTF-16 and where higher codepoints are encoded as surrogate pairs (21 > 16), and (3) the byte level, which is just the decision to encode the 16-bit values in little-endian or big-endian order; ECMAScript5 only gives access to the 2nd level, but that's fine as that's what SwiftStriker00 was looking for
Mike 'Pomax' Kamermans
I was testing with too-low characters. "🀀".charCodeAt() does indeed give the surrogate byte value. Apologies.
|