Thanks Hank Green.

  • nycki@lemmy.world
    link
    fedilink
    English
    arrow-up
    15
    ·
    10 days ago

    Almost all web traffic now uses the utf-8 encoding, a clever hack which works because ascii is a seven-bit code but web traffic uses 8-bit bytes.

    • If the first bit is 0, treat the byte as ascii.
    • if the first bit is 1, treat the byte as part of a multi-byte unicode character.

    multi-byte characters in utf-8 can officially be up to four bytes long, with 11 of those 32 bits used for tracking the size of the multi-byte block. That leaves 2^21 code points available, about two million in total, easily enough for every alphabet you could need to write on a website, and all without breaking ascii.

      • nycki@lemmy.world
        link
        fedilink
        English
        arrow-up
        3
        ·
        10 days ago

        yep! the ascii standard was originally invented for teletypewriters, and includes four ‘blocks’ of 32 codes each, for 128 in total, so it only uses seven bits per code.

        the first block, hex 00 - 1F, contains control codes for the typewriter. stuff like “newline”, “backspace”, and “ring bell” all go in here.

        The second block has the digits are in order, from hex 30 = ‘0’ all the way to hex 39 = ‘9’,

        The uppercase alphabet starts at hex 41 = ‘A’, and exactly one block later, the lowercase alphabet starts at hex 61 = ‘a’. This means their binary codes are 100 0001 and 110 0001, differering only in a single bit! So you can easily convert between upper and lowercase ascii by flipping that bit.

        The remaining space in the last three blocks is filled with various punctuation marks. I’m not sure if these are in any particular order.

        The final ascii code, 7F, is reserved for “delete”, because its binary representation is 111 1111, perfect for “deleting” data on a punch card by punching over it.