by

Japanese text on the web is a lot like politics and sausage – it’s a messy process that nobody should ever have to see. But in the time I’ve been working at Tofugu, I’ve had to bear witness to some horrible, horrible things.

Let me pull back the curtain for a bit and show the absolute nightmare that exists behind Japanese text on the internet.

Kanji, Kanji Everywhere

If you’re unfamiliar with Asian history and culture, ancient Chinese culture had a tremendous impact on virtually every culture in East Asia. Other countries in Asia adopted Chinese food (like ramen!), customs, and parts of the language.

Most of you probably already know that the complicated characters in Japanese called kanji come from Chinese characters, but it doesn’t stop there. Korean has its own adaptation of Chinese characters called hanja, and until colonialism, Vietnamese used Chinese characters in its language.

These are all known as han characters, or sometimes CJK (Chinese, Japanese, Korean) characters.

We Are The WorldYou’d think that this would be a good thing, right? All these different countries and cultures using han characters, it’s like everybody’s joined hands and is singing We Are The World, right?

Oh, if only.

Why Kanji Doesn’t Look Quite Right On The Internet

Why are these han characters a problem? It has something to do with Unicode, a commonly used standard that’s used to display text from different languages on computers.

Somewhere down the line, somebody thought that it would be a great idea to save time and space by saying that, in Unicode, all of these han characters are, for all intents and purposes, exactly the same. This process was called Han Unification, and would soon become the bane of my existence.

Han Unification is a problem because han characters can look different and mean different things in each language.

Just take a look at this picture: it’s the same Unicode character, rendered in five different languages:
The Chinese versions look completely different from the other languages.

Unless the website explicitly says that a piece of kanji text is Japanese (with the HTML lang attribute), it won’t look quite right. It might use the Chinese style and make everything else (i.e. kana) look out of place.

This becomes a problem more often than you might think, whether it’s on your phone, when using electronic flashcards, or just reading the news.

And when you throw in different fonts, operating systems, and browsers into the mix, all bets are off.

Worse still, people argue that this is exactly what Unicode should be like. The argument is that, despite stylistic and cultural differences, underneath it all these characters are essentially the same.

I can understand the rationale behind Han Unification but, since I have the emotional capacity of a child and just want things to work, I’m going to say that it’s dumb and stupid and I hate it.

Why Japanese Isn’t Readable On The Internet

But hey, if your kanji looks wrong, all’s not lost. You can always use furigana, the simple, little characters you see above kanji to help you read them. Right?

Wrong.

While there is the technology to do this on the web (the HTML ruby element), you won’t see it much. It just doesn’t work on all web browsers (like Firefox), and few people choose to use it on their websites.

I would love to include furigana in the kanji I write to make it easier for beginners, but right now it’s not really an option.

But, unfortunately, web developers seem much more interested in tech demos and proof-of-concept sites than making sure the web looks as good in other languages as it does in English.

Maybe someday Japanese will get the first-class treatment on the web that it deserves, but right now I think we have a long way to go.

  • M_Nimon

    Interesting.  Now, if only someone can explain to me why in the world someone out there thought half-width katakana was a good idea…

  • Mescale

    Text Encoding of any kind is crazy nuts, I don’t rightly recall exactly what I was doing, but it was hell, I must have blacked out or something. Oh God, there’s a whole portion of my life I don’t remember, but I do remember it involved Unicode.

    WHAT THE HELL HAPPENED TO ME!

    I think it involved code pages, and UTF bits and things.

  • Peptron

    Funny, one of the main reasons I don’t use Chrome is because it defaults to Chinese Style whenever it sees East-Asian stuff. So reading Japanese makes your eyes bleed…

  • Belthazar

    Ruby text in my copy of firefox displays as 友達(ともだち). At least, I’m pretty sure it’s actually ruby text, because it displays properly when I view the same page in Safari. Not pretty, but not exactly nightmarish either. I tried to install a ruby text add on for Firefox, but it just made Firefox crash on startup.

  • Peptron

    Other than that, I also cannot stand how some sites use small fonts and defaults to serifed instead of non-serifed. Small font + serif means that you won’t be able to get your delivery of biang-biang-mian today, as your eyes will have exploded.

  • http://www.tofugu.com/ Hashi

    Yeah, I’ve heard that Chrome is the worst offender when it comes to this stuff.

  • Guest


    I’m going to say that it’s dumb and stupid and I hate it.” 
    qq moar

  • http://www.tofugu.com/ Hashi

     The example I used broke down the furigana per-character, rather than using the furigana for the entire word.

    I’m not really sure which is better, tbh, though per-word looks nicer on browsers without Ruby support.

  • http://www.tofugu.com/ Hashi

    *looks up biang-biang-mian*

    *head erupts*

  • http://www.tofugu.com/ Hashi

    Hmm, I’ll have to get back to you on that one.

  • トラビス

    OH MY GOD FURIGANA ON THE INTERNET WOULD ROCK I CAN’T EVEN IMAGINE HOW GREAT IT WOULD BE. Finally being able to actually read something after three years.

  • kuyaChristian

    No wonder why the text doesn’t make sense when I use Rikaikun for each kanji. T_______T

  • Guest

    This app for Firefox displays pronunciations and definitions of any Japanese word you hover over with your mouse: 
    http://www.polarcloud.com/rikaichan/

  • http://www.facebook.com/profile.php?id=100000511548857 Hobbid Hobbin

    While were in the topic of readable English on sites I should probably point out that when I use chrome sentences run into (and over) each other on the tofugu site. Not complaining though I can always use firefox 

  • dh10110

    In my own projects, I’ve faked ruby support in Firefox by putting the furigana in a data-ruby attribute in a span around the kanji, then using css to to show and position the content of the attriubute.

    [data-ruby] {    display: inline-block;    text-align: center;}[data-ruby]:before {    font-size: 60%;    content: attr(data-ruby);    display: block;    text-align: center;}

  • http://thepretentiousgamer.blogspot.com Rachel

    there’s one for chrome, too–furigana injector
    https://chrome.google.com/webstore/detail/cbahnmcliajmanjkaolemjelphicnein

  • http://www.facebook.com/profile.php?id=66701403 Jace ‘Lion’ Repshire

    There’s always Rikaikun/Rikaichan to spell out the Kanji in furigana! At least we have that to help! I think I pulled that from one of Tofugu’s resource pages, but if I didn’t, check it out. It’s a great browser add on that will explain Kanji when you hover over them with your mouse. It shows the furigana and meaning. :3

  • Berry Phillips

    It’s hardly surprising that this issue exists when we don’t even have any standardisation between browsers and operating systems for Latin character sets. So frustrating for us web developers (and designers, of course) who want to make websites that look the same on all platforms.

  • Berry Phillips

    Firugana is supported on at least the latest dev channel release of Chrome (v19.x): http://share.berryp.com/0S323z0P0I0R0f2q413N (The example is real text, not images.)

  • ianclarksmith

    I don’t suppose this is related to this being posted on HN the other day: http://www.reigndesign.com/blog/love-hotels-and-unicode/

  • http://www.tofugu.com/ Hashi

    A little bit! Han Unification is an issue that’s come up a few times before for me (usually on work with TextFugu), but that article inspired me a bit to actually write about it.

  • Meagan McClendon

    This exact thing used to trip me up all the time when I first started and I still hate it…
    Here’s an anecdote : D
    When I tried my hand at Chinese (after 5 years learning Japanese) I had my tests marked down because I wrote the way the book said instead of their study guide. I showed them to my professor and they said they had never noticed it before. *facepalm*
    At least I brought it to their attention… although I don’t know if they appreciated this Linguistics-Japanese undergrad showing them up on their own language’s supposedly-impossible-to-learn-as-a-foreigner-characters over fonts. (((<.<;;)

  • Zeldaskitten

    This happens to me too! Always thought it was weird. If I can’t read something, scrolling up and down usually fixes it..

  • coldcaption

    I use OS X and Windows and everything is a bit more harmonious on the Mac. I get into Mandarin a little bit as well and I don’t think I’ve had this problem on it, or I just haven’t noticed yet. Things are miserable on the PC, though. You’ll get bold, fun kana and then unforgiving serif’d kanji. Widths don’t match up. Everything is aliased. When I first got my Windows laptop and used it for Japanese, I spent a while trying to figure out what a kanji was. I couldn’t find it, but did eventually learn that it was 書 with some strokes minced.
    I have had some breakdowns with my Chinese school book, though. Sometimes /it/ has its characters rendered differently than the way I’m used to, so I won’t notice that I already know one for a while.

    Though if you think computers are bad, you should see my Chinese handwriting. My Japanese handwriting is fine (as long as I don’t have kanji amnesia~), but my Chinese handwriting is a hodgepodge of Chinese traditional characters, Japanese simplifications, and Chinese simplified characters.

  • coldcaption

    What? I use Chrome. Have I been corrupting myself? ;-;

  • http://www.tofugu.com koichi

    Half width is nice when you need to make it look like romaji, which I want to say is half width. I’m a fan, anyways.

  • ZXNova

     You’d think they’d put more effort into something like this…

  • linguarum

    While we’re on the subject, what are those odd-looking characters at the top of Tofugu posts lately? The one at the top of this article looks like some Korean characters and an upsidedown Roman capital “A.”

  • linguarum

    ((유∀유|||))

  • http://twitter.com/Meroigo Johannes / ヨハネス

    Funny how you chose the 冷. I remember in Japanese school when I had looked that up a word with that in it, in my iPod Touch (the app “Japanese”) and, was writing it in my notebook. The teacher stopped me and asked me to write it in another way, saying that the way I was writing was only used by computers and stuff. She told me the correct hand-writing way is how the Chinese version looks. ;D Pretty confusing……. But these confused moments are really rare and I can’t say I have bumped into any “problems” while surfing the Japanese side of the internet. Except not knowing all the kanji, and have to look up with Rikaikun (in Chrome). :P

  • Alvin B.

    It’s horrible on Android as well. I gave up trying to use Ankidroid because of font issues.

    Curious though, on the PC, how does Chrome show your furigana example?

  • http://www.tofugu.com/ Hashi

    Furigana renders fine on Chrome on all platforms as far as I know, but the default Japanese font is the big difference.

    Anyway, here’s what it looks like on my Windows 7 machine with the newest Chrome: http://i.imgur.com/HD09f.png

  • ಠ_ರೃ

     ( •̀ .̫ •́ )✧

  • ಠ_ರೃ

     It’s ’cause Google made Chrome too fast and now things can’t stop in time!

  • ಠ_ರೃ

     So how ’bout those line breaks, eh? Just poppin’ up wherever.

  • ಠ_ರೃ

     I think it looks adorable. It just tries so hard!

    ガンバレ、ハンカクカナサン!

  • http://www.tofugu.com koichi

    (*●⁰ꈊ⁰●)ノ

  • http://mobobe.com 13xforever

    I’ve been using  HTML Ruby Firefox addon for ages now (https://addons.mozilla.org/en-US/firefox/addon/html-ruby/ ).

    Yes, you have to install it separately, but strictly speaking, Ruby isn’t a part of HTML – it’s a module, so it’s up to browser vendor to decide whether to include it in their HTML engine or not.

  • http://twitter.com/arleas_ Lee Rolfing

    Actually I was using iknow.jp to learn a few words and was confused when they chose to use two different versions (apparently) in the same flash application no less.  I’d see 命令 and the 2nd kanji in that word would look like the Chinese version in one part, and the Japanese version in the other part.  I guess I can’t even suppose who sees it as what version.

    For the record, as I typed it I see it as the “no country” version (I suppose).  冷たい as well… I just figured it was a different style and went with it though… I mean, the first time I saw a cursive Q in English I was hella confused.

  • Mescale

    I can’t believe you made me start messing around with encodings and fonts on my web browser again. 

    Well played.

    (,,#゚Д゚):∴;’・,;`:ゴルァ!!

  • dthunt

     Y

  • dthunt

    Given I now want to tear my eyes out, I now see the problem with half-width katakana.

  • dthunt

    So, I’m going to come out against han unification here.

    People were IDIOTS to assume that utf-16 was a good idea, which I will happily assert without basis is the rationale behind han unification (I don’t see anyone saying that UCS-32 space is inadequate, yet, and variable-width encodings like utf-8 don’t care).

    These are different characters. Why would you collapse them merely because they look alike and (likely) have some common ancestor character?

  • Victorianakashima

    Hi
    I love your website, I think that it is very
    complete and super interesting.
    Keep it!
     

  • Stroopwafel

    I noticed the same thing as well. For instance; kanji like 家 and 糸 are both written slightly differently when you see them on a computer, in comparison to hand-writing. I guess there are just certain radicals that look slightly different when you write kanji in Chinese. Funny, eh?
    Maybe you could write Tofugu Guide on this subject, Hashi.

  • http://www.tofugu.com/ Hashi

    I’m definitely thinking about writing a guide.

  • http://www.tofugu.com/ Hashi

    THE ULTIMATE TROLL

  • http://www.tofugu.com/ Hashi

    It’s crazy where yo
    u’ll find them!

  • http://www.tofugu.com/ Hashi

    Yeah, fake ruby via CSS wizardry is definitely always an option, but I’d really like if Firefox supported ruby natively.

  • http://www.tofugu.com/ Hashi

    ¯_(ツ)_/¯

  • http://www.tofugu.com/ Hashi

    Huh, I didn’t know that ruby was an HTML module – worse yet, I wasn’t aware that the W3C made distinctions like that.

    Even still, it seems like every browser vendor at least plans to implement ruby, as far as I can tell.

  • http://mobobe.com 13xforever

    You can skim the recommendation (it’s not so big) and it’ll be more clear what’s the state of this feature: http://www.w3.org/TR/ruby/

  • ultraali453

    I guess us beginners can use furigana addons with browsers to make reading easier

  • IndigoSelvedge

    Rikaikun works for Chrome, as well.

  • ಠ_ರೃ

     I agree. I’ve seen a lot of complete websites, but this? This is VERY complete. I also vote for keeping it.

  • Zealot_04

    About 冷… Now this is probably going to frustrate you even more Koichi, but in Japanese text you’ll see it appear as the Korean version, but never this way in handwriting because it would be improper. In fact, everyone in Japan writes it the Chinese way. I asked professors about this at my university (南山大学) in Nagoya, and they insisted that only the Chinese way is acceptable in handwriting. Yamasa also supports this:

    http://yamasa.cc/members/ocjs/kanjidic.nsf/SortedByKanji2THEnglish/%E5%86%B7?OpenDocument 

  • Eee

    A few understand the Unicode standard in the wrong way. Unicode is just a binary representation of a character that can be implemented in different character encodings. By itself, it is NOT a character for “display”; it is a character codepoint – a binary index to represent a character in a binary array that represents arbitrary characters. Note “arbitrary” there since I can design a font that represents something different than the normal characters like Wingdings font.

    In Unicode, the first 128 codepoints are the same in any character encoding – they are called ASCII character set and they are present in codepage-based ASCII encoding (Latin-1 (Windows-1252), Cyrillic, Thai, etc.) and UTF-xx encoding.

    Anything above the first 128 codepoints are either codepage-based specific or Unicode-based. 

    The next 125 codepoints can be codepage-specific and Unicode-based. 

    However, codepage-based specific encoding covers up to the first 256 codepoints. Codepoints above 256 codepoints fall under Unicode encoding range: UTF-8, UTF-16-LE, UTF-16-BE, UTF-32-LE and UTF-32-BE.
    A “display” character that you can see is represented by a font symbol (a combination of one or more glyphs). 
    For example, an apostrophe symbol in Arial font is u0027 in Unicode and 0x27 in hex. Its position is 27th in the Arial font symbol list and it contains one glyph. A symbol in any CJK-based font can have more than one glyph. If you use Wingdings font, codepoint 27 is not an “apostrophe”; it is a “candle”.Note that there are non-Unicode-supported fonts and Unicode-supported fonts. Non-Unicode-supported fonts covers up to 255 codepoints (0x00 to 0xFF): on Windows, the fonts are Symbol, Courier, Serif with .FON extension

    Unicode-supported fonts are fonts that include the first 256 codepoint (128 ASCII set + codepage specific set) and Unicode codepoints; they can be character set specific or full character set. These fonts are provided in different extension: TTF (True Type Font) and OTF (OpenType Font)

    Character-set specific Unicode-supported fonts are Mincho (Chinese-Japanese-Korean (CJK) charset), Lathal (Tamil charset), Mangal (Devanagari charset).

    Full Unicode-supported fonts are fonts that cover all Unicode charsets. On Windows, it is Arial Unicode MS (available when Microsoft Office is installed)

    All browser support changing fonts and encoding to suit how you want to display characters in the web pages if the CSS font style attribute is not set.

    Hun unification? Probably, that is why it is called CJK charset.

    If anything is wrong in what I’ve just said, please correct me.

  • Pmhausen

    Hi, all,

    since I did not fully understand the concepts in the article at first, I looked up Han Unification – there is a
    dedicated Wikipedia entry on it.

    Oddly enough, in the tables that are supposed to feature examples of successful unifications and not so
    successful ones, to me currently all characters in one row look exactly the same!
    I’m reading and writing on an iPad at the moment.

    So ,yes, now at least I do understand the problem ;-)

    Best regards,
    Patrick

  • Juicy cock 123

    hi

  • Jonathan Harston

    Agree with you there. I had to fight to get a page on my site (mdfs.net/JGH/Docs/Me/iknow1.htm.sjis) to render *one* name in Japanese correctly. I had to put a Content-Type in the header *and* change the page name to end with “.sjis” *and* constantly cut’n’paste the characters from other pages to try and get the correct characters pasted in. The thing is, the Content-Type is all you should need. Grrr.

  • Justin Stressman

    Hashi-san, one thing you might want to note is that you apparently missed part of the ruby mark-up.

    Take for instance the following;
    子供(こども)

    The tags are the “Ruby Parentheses” tags that allow you to specify which optional characters you’d like to use to enclose the ruby text in browsers that don’t properly support ruby (like Firefox).

    You stick the whole thing in a container, stick your kanji first, and following the kanji you’d like the furigana to be associated with you stick the tags for the left parenthesis, the tags enclosing the furigana ruby text, and then the right parenthesis tags.

    So the example above would actually show up as 子供(こども) in Firefox, which really isn’t that bad.

    And while it might be more accurate to do furigana per kanji, until Firefox includes official support, for now it’s visually better to just do whole word ruby I think.

    (IE9 actually has nice support for Ruby as well. So Chrome, Safari, IE9+ support it, but Opera and Firefox don’t yet. And in my experience trying to hack in support with CSS gets really messy because of inconsistent browser support for inline-table placement and the other types of CSS generally used for placement. So I think it’s extensions for now for Firefox or live with the fallback ruby as shown above.)

  • http://www.tofugu.com/ Hashi

    Glad that the wiki article helped you out! Han Unification is definitely a complicated issue and I wasn’t sure if I explained it very clearly :x

  • http://www.tofugu.com/ Hashi

    Good point, I should probably stop dragging my feet waiting for Firefox (and Opera, I guess) to have proper Ruby support and just use the kind of markup you talk about.

  • http://www.tofugu.com/ Hashi

    I’m not sure if anything you’ve said is wrong, because I think it was mostly over my head, sorry :x

  • iryw

    Well you could include furigana and limit those who can see it by applying to only those with Safari browsers.

  • Guest

     Thanks for the tip! As a junior web-designer myself, info like this is very useful and I’ll bear it in mind for future sites. However, another crushing blow to Firefox.. moving to Chrome is starting to look close to inevitable now.

  • Zaywex

    Well, there’s already a highlight and look up kanji app which makes things easier.

  • Melissa

    Thanks for explaining the Unicode. I remember looking up the stroke order for the kanji in the picture and being confused because the drawing looked nothing like the text…

  • Takashiro

    That would defeat the purpose of the web, we need to work within the standards that we have and push the browser developers to continue improving support for certain features. I mostly use Chrome, but Tofugu, and other sites work best with Safari, as Apple’s had the benefit of creating multi-lingual OSes.

  • http://mistersanity.blogspot.com Jonadab

    The ruby element wouldn’t be needed in HTML, and every single other application and file format ever wouldn’t also need to make special provisions for ruby text, if only Unicode had included combining furigana characters (and pinyin), similar to how it includes combining diacritics. Unlike Han Unification (which cut the total number of Unicode codepoints in half or better and is probably the only reason there’s any such thing as a complete Unicode font), the failure to support furigana saved, what, fifty characters? Maybe closer to eighty if you don’t want to nest combinations (e.g., to add combining-furigana-dakuten onto a combining-furigana-fu when you need a bu pronunciation). That would’ve been totally manageable, and it would make dealing with Japanese text *majorly* easier.

  • AnadyLi

    I’m thirteen, Chinese, and I have NEVER noticed this.

  • Johannes Löthberg

    “Half-width kana were used in the early days of Japanese computing, to
    allow Japanese characters to be displayed on the same grid as monospaced fonts of Latin characters”

    http://en.wikipedia.org/wiki/Half-width_kana

  • http://twitter.com/Silgrond Silgrond

    I’m really late, but I’ve always hated the font rendering on Windows. I’ve found this app, which basically changes the Cleartype rendering to mac-like. You can even change it to be registry-based, so everything will be smooo~th. (There are “themes” too) http://code.google.com/p/mactype/