6

I need to convert large UTF-8 strings into ASCII. It should be reversible, and ideally a quick/lightweight algorithm.

How can I do this? I need the source code (using loops) or the JavaScript code. (should not be dependent on any platform/framework/library)

Edit: I understand that the ASCII representation will not look correct and would be larger (in terms of bytes) than its UTF-8 counterpart, since its an encoded form of the UTF-8 original.

7
  • im getting confused by your edits. it's starting to sound like what you actually to do is url encoding. is that right? Commented May 7, 2009 at 12:30
  • 1
    I didn't downvote you. And I don't care about the binary format of UTF-8. Commented May 7, 2009 at 15:28
  • 2
    If I didn't know what I was asking for, I wouldn't even have gotten a few correct answers. (such as Escaping/Base64) Commented May 7, 2009 at 15:37
  • 1
    You should consider going with David's answer - endoceURI()/decodeURI() are better suited to solve your problem than quote()/eval() Commented May 7, 2009 at 17:08
  • 1
    Jeremy, take a look at what people are commenting and update your question, currently the title and description are very wrong. Otherwise you will continue to get downvotes from others. Commented Dec 23, 2009 at 13:42

11 Answers 11

12

You could use an ASCII-only version of Douglas Crockford's json2.js quote function. Which would look like this:

    var escapable = /[\\\"\x00-\x1f\x7f-\uffff]/g,
        meta = {    // table of character substitutions
            '\b': '\\b',
            '\t': '\\t',
            '\n': '\\n',
            '\f': '\\f',
            '\r': '\\r',
            '"' : '\\"',
            '\\': '\\\\'
        };

    function quote(string) {

// If the string contains no control characters, no quote characters, and no
// backslash characters, then we can safely slap some quotes around it.
// Otherwise we must also replace the offending characters with safe escape
// sequences.

        escapable.lastIndex = 0;
        return escapable.test(string) ?
            '"' + string.replace(escapable, function (a) {
                var c = meta[a];
                return typeof c === 'string' ? c :
                    '\\u' + ('0000' + a.charCodeAt(0).toString(16)).slice(-4);
            }) + '"' :
            '"' + string + '"';
    }

This will produce a valid ASCII-only, javascript-quoted of the input string

e.g. quote("Doppelgänger!") will be "Doppelg\u00e4nger!"

To revert the encoding you can just eval the result

var encoded = quote("Doppelgänger!");
var back = JSON.parse(encoded); // eval(encoded);
Sign up to request clarification or add additional context in comments.

5 Comments

Why not use something other than eval() ? Like say, html entities?
mostly because you don't need to implement anything for reversion and it will be pretty fast. You could just as well use an regex-based unquote method very much like the quote function.
.. or you could secure the eval based unquote with regex validation like json2.js does for complete JSON.
Note that strictly speaking this is not "conversion to ASCII". You're actually implementing your own encoding scheme on top of ASCII. This may be perfectly ok for the requirements (and it seems to be for you), but it's not just a simple "conversion to ASCII".
instead of eval(encoded) you can use JSON.parse(encoded) (which is similar under the covers, but safer)
6

Any UTF-8 string that is reversibly convertible to ASCII is already ASCII.

UTF-8 can represent any unicode character - ASCII cannot.

3 Comments

"ASCII cannot" - Of course it can! look at the accepted answer above.
@Jeremy: Then state your question less sneakly! "UTF-8 to ASCII conversion" sounds like a character encoding conversion problem, while what you really want is a way to represent Unicode (that's not the same as UTF-8) characters using the ASCII charset and a known character escaping syntax.
@Pat That's one of the most common misconceptions about UTF-8. UTF-8 and UTF-16 actually have variable bit lengths and either one can represent any unicode character. en.wikipedia.org/wiki/UTF-8
6

As others have said, you can't convert UTF-8 text/plain into ASCII text/plain without dropping data.

You could convert UTF-8 text/plain into ASCII someother/format. For instance, HTML lets any character in UTF-8 be representing in an ASCII data file using character references.

If we continue with that example, in JavaScript, charCodeAt could help with converting a string to a representation of it using HTML character references.

Another approach is taken by URLs, and implemented in JS as encodeURIComponent.

Comments

4

It is impossible to convert an UTF-8 string into ASCII but it is possible to encode Unicode as an ASCII compatible string.

Probably you want to use Punycode - this is already a standard Unicode encoding that encodes all Unicode characters into ASCII. For JavaScript code check this question

Please edit you question title and description in order to prevent others from down-voting it - do not use term conversion, use encoding.

Comments

2

If the string is encoded as UTF-8, it's not a string any more. It's binary data, and if you want to represent the binary data as ASCII, you have to format it into a string that can be represented using the limited ASCII character set.

One way is to use base-64 encoding (example in C#):

string original = "asdf";
// encode the string into UTF-8 data:
byte[] encodedUtf8 = Encoding.UTF8.GetBytes(original);
// format the data into base-64:
string base64 = Convert.ToBase64String(encodedUtf8);

If you want the string encoded as ASCII data:

// encode the base-64 string into ASCII data:
byte[] encodedAscii = Encoding.ASCII.GetBytes(base64);

1 Comment

Great idea, though I wanted JS. Thanks.
2
function utf8ToAscii(str) {
    /**
     * ASCII contains 127 characters.
     * 
     * In JavaScript, strings is encoded by UTF-16, it means that
     * js cannot present strings which charCode greater than 2^16. Eg:
     * `String.fromCharCode(0) === String.fromCharCode(2**16)`
     *
     * @see https://developer.mozilla.org/en-US/docs/Web/API/DOMString/Binary
     */
    const reg = /[\x7f-\uffff]/g; // charCode: [127, 65535]
    const replacer = (s) => {
        const charCode = s.charCodeAt(0);
        const unicode = charCode.toString(16).padStart(4, '0');
        return `\\u${unicode}`;
    };

    return str.replace(reg, replacer);
}

Better way

See Uint8Array to string in Javascript also. You can use TextEncoder and Uint8Array:

function utf8ToAscii(str) {
    const enc = new TextEncoder('utf-8');
    const u8s = enc.encode(str);

    return Array.from(u8s).map(v => String.fromCharCode(v)).join('');
}
// For ascii to string
// new TextDecoder().decode(new Uint8Array(str.split('').map(v=>v.charCodeAt(0))))

Comments

1

Do you want to strip all non ascii chars (slash replace them with '?', etc) or to store Unicode code points in a non unicode system?

First can be done in a loop checking for values > 128 and replacing them.

If you don't want to use "any platform/framework/library" then you will need to write your own encoder. Otherwise I'd just use JQuery's .html();

Comments

1

Your requirement is pretty strange.

Converting UTF-8 into ASCII would loose all information about Unicode codepoints > 127 (i.e. everything that's not in ASCII).

You could, however try to encode your Unicode data (no matter what source encoding) in an ASCII-compatible encoding, such as UTF-7. This would mean that the data that is produced could legally be interpreted as ASCII, but it is really UTF-7.

3 Comments

"loose all information" - It can be lossless! look at the accepted answer above.
Good idea about the UTF-7 though.
@Jeremy: it can be lossless, but then you're no longer just "converting to ASCII", you're then converting to some encoding scheme implemented on top of the ASCII character set ...
0

Here is a function to convert UTF8 accents to ASCII Accents (àéèî etc) If there is an accent in the string it's converted to %239 for exemple Then on the other side, I parse the string and I know when there is an accent and what is the ASCII char.

I used it in a javascript software to send data to a microcontroller that works in ASCII.

convertUtf8ToAscii = function (str) {
    var asciiStr = "";
    var refTable = { // Reference table Unicode vs ASCII
        199: 128, 252: 129, 233: 130, 226: 131, 228: 132, 224: 133, 231: 135, 234: 136, 235: 137, 232: 138,
        239: 139, 238: 140, 236: 141, 196: 142, 201: 144, 244: 147, 246: 148, 242: 149, 251: 150, 249: 151
    };
    for(var i = 0; i < str.length; i++){
        var ascii = refTable[str.charCodeAt(i)];
        if (ascii != undefined)
            asciiStr += "%" +ascii;
        else
            asciiStr += str[i];
    }
    return asciiStr;
}

Comments

0

If you are using node.js you can use the TextDecoder class.

const decoder = new TextDecoder('ascii');
let text = decoder.decode(buffer);

Comments

-1

An implementation of the quote() function might do what you want. My version can be found here

You can use eval() to reverse the encoding:

var foo = 'Hägar';
var quotedFoo = quote(foo);
var unquotedFoo = eval(quotedFoo);
alert(foo === unquotedFoo);

2 Comments

@Jeremy: not really - same thing, different implementation; if I'd seen fforw's answer before posting my own, I wouldn't have bothered; my version has a few more options (choice between single or double quotes, optionally doesn't escape non-ascii characters), but most likely it will be slower
Dead link -----

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.