array of output bytes difference version of php

Question

I'm using a function that transcribes strings into a byte array, I have this function in PHP and JavaScript but both have different behaviors when I play these characters: 㬁愃膘ƘჀ䚐⦀飠噋&ӡ๨㏃棱쌌ص䌠

How to make the results are the same?

My code:

function bytesFromWords($string) {
    $bytes = array();
    $j = strlen($string);

    for($i = 0; $i < $j; $i++) {
        $char = ord(mb_substr($string, $i, 1));
        $bytes[] = $char >> 8;
        $bytes[] = $char & 0xFF;
    }
    return $bytes;
}
echo bytesFromWords('㬁愃膘ƘჀ䚐⦀飠噋&ӡ๨㏃棱쌌ص䌠'); // result: 0,227,0,172,0,129,0,230,0,132,0,131,0,232,0,134,0,152,0,198,0,152,0,225,0,131,0,128,0,228,0,154,0,144,0,226,0,166,0,128,0,233,0,163,0,160,0,229,0,153,0,139,0,38,0,211,0,161,0,224,0,185,0,168,0,227,0,143,0,131,0,230,0,163,0,177,0,236,0,140,0,140,0,216,0,181,0,228,0,140,0,160


function bytesFromWords (string) {
    var bytes = [];
    for(var i = 0; i < string.length; i++) {
        var char = string.charCodeAt(i);
        bytes.push(char >>> 8);
        bytes.push(char & 0xFF);
    }
    return bytes;
}
console.log(bytesFromWords('㬁愃膘ƘჀ䚐⦀飠噋&ӡ๨㏃棱쌌ص䌠').toString()); // result: 59,1,97,3,129,152,1,152,16,192,70,144,41,128,152,224,86,75,0,38,4,225,14,104,51,195,104,241,195,12,6,53,67,32

Community · Accepted Answer · 2017-05-23 10:24:05Z

3

Issues:

strlen does not count Unicode characters as expected.
ord not work with unicode as expected.
chr not work with unicode as expected.

Problem with `strlen`

'㬁愃膘ƘჀ䚐⦀飠噋&ӡ๨㏃棱쌌ص䌠'.length returns 17 and strlen('㬁愃膘ƘჀ䚐⦀飠噋&ӡ๨㏃棱쌌ص䌠') returns 46, for fix it, use:

$j = preg_match_all('/.{1}/us', $string, $data);

Problem with `ord`

Using '㬁'.charCodeAt(0) returns 15105 and ord('㬁') returns 227, for fix use:

function unicode_ord($char) {
    list(, $ord) = unpack('N', mb_convert_encoding($char, 'UCS-4BE', 'UTF-8'));
    return $ord;
}

_{Source: https://stackoverflow.com/a/10333307/1518921}

Problem with `chr`

Using String.fromCharCode(15104) returns 㬁 and chr(15104) return empty/blank, for fix use:

function unicode_chr($u) {
    return mb_convert_encoding('&#' . intval($u) . ';', 'UTF-8', 'HTML-ENTITIES');
}

_{Source: https://stackoverflow.com/a/9878531/1518921}

Full code:

<?php
function unicode_ord($char) {
    list(, $ord) = unpack('N', mb_convert_encoding($char, 'UCS-4BE', 'UTF-8'));
    return $ord;
}

function unicode_chr($u) {
    return mb_convert_encoding('&#' . intval($u) . ';', 'UTF-8', 'HTML-ENTITIES');
}

function bytesToWords($bytes) {
    $str = '';
    $j = count($bytes);

    for($i = 0; $i < $j; $i += 2) {
        $char = $bytes[$i] << 8;
        if ($bytes[$i + 1]) {
            $char |= $bytes[$i + 1];
        }
        $str .= unicode_chr($char);
    }
    return $str;
}

function bytesFromWords($string) {
    $bytes = array();
    $j = preg_match_all('/.{1}/us', $string, $data);
    $data = $data[0];

    foreach ($data as $char) {
        $char = unicode_ord($char);
        $bytes[] = $char >> 8;
        $bytes[] = $char & 0xFF;
    }
    return $bytes;
}


$data = bytesFromWords('㬁愃膘ƘჀ䚐⦀飠噋&ӡ๨㏃棱쌌ص䌠');

echo implode(', ', $data), '<br>';
echo bytesToWords($data);

edited May 23, 2017 at 10:24

CommunityBot

11 silver badge

answered Apr 22, 2015 at 1:25

Korvo

9,8069 gold badges61 silver badges129 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Ja͢ck Over a year ago

Getting the length of a Unicode string is typically done with mb_strlen() and not with preg_match_all() :)

Korvo Over a year ago

@Ja͢ck is that instead of using mb_strlen and mb_substr I used the preg_match_all to simplify :) Note that this is not necessary to get one character at a time.

Ja͢ck Over a year ago

Oh, I see what you did there; I would have expected you to just foreach ($data[0] as $char) { ... } :)

Korvo Over a year ago

@Ja͢ck I typically use "foreach" for associative arrays type. Do you agree? Thanks :)

Ja͢ck Over a year ago

foreach can be used for either and is actually faster, too :)

|

Ja͢ck · Accepted Answer · 2015-04-22 05:46:39Z

2

JavaScript uses UCS-2 encoding for Unicode strings, so in order to achieve the same ordinal representation you first have to convert your string, e.g. by using mb_convert_encoding() or iconv() if preferable.

A trick to get ordinal values from a string quickly is by using unpack().

function bytesFromWords($string)
{
    $x = mb_convert_encoding($string, 'UCS-2', 'UTF-8');
    $data = unpack('C*', $x);
    return array_values($data);
}

Demo

edited Apr 22, 2015 at 5:46

answered Apr 22, 2015 at 1:46

Ja͢ck

174k39 gold badges269 silver badges317 bronze badges

Comments

Community · Accepted Answer · 2017-05-23 11:47:33Z

You use mb_substr() which may return you multibyte strings (even if it's just one codepoint).

But ord() doesn't like that… it will only take the first passed byte (not character).

To get what you want, you should just split the string and take the single bytes:

$bytes = str_split($string);
foreach ($bytes as &$chr) {
    $chr = ord($chr);
}

Yes, this is not the same than you have in Javascript. In Javascript you get codepoint identifier via string.charCodeAt(), not the UTF-8 byte sequence.

A trick to get the bytes in Javascript would be (Copied from https://stackoverflow.com/a/18729536 ~ Jonathan Lonowski):

var utf8 = unescape(encodeURIComponent(string));

var arr = [];
for (var i = 0; i < utf8.length; i++) {
    arr.push(utf8.charCodeAt(i));
}

But if you wanted the unicode identifier in PHP… just do a quick search for it (e.g. How to get code point number for a given character in a utf-8 string?)

Collectives™ on Stack Overflow

array of output bytes difference version of php

3 Answers 3

Problem with `strlen`

Problem with `ord`

Problem with `chr`

6 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Problem with strlen

Problem with ord

Problem with chr

6 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related

Problem with `strlen`

Problem with `ord`

Problem with `chr`