How are 'Words' and 'Characters' counted for Asian (Chinese) languages?

This is a very simple, and very important question, but I can't see a clear answer everywhere. 

In my project, 'Use word-based tokenization for Asian languages' is NOT ticked. It would be good to know how this would affect the word count. How are Asian words being 'tokenised' and what does it mean? 

My language is Chinese. I don't see how 'words' are defined here. 


4 Replies Latest Replies: 10 Jan 2019 10:17 AM by Steven Whale
  • Hello  ,

    The following Gateway article explains this well- (since Studio 2017)

    SDL Trados Studio Application

    For users to get more complete information when translating from Asian languages in the light of the new Asian tokenization option, there is now a Word column in the Analysis report for Asian source languages:

    >if the character-based tokenization (active by default) is used, the word column reports a single Asian-language character as one word and a Western-language word as one word.

    >if the new word-based tokenization is used, the word column reports Asian-language words as words identified by the new tokenization engine and Western-language words also as one word. This typically always results in a lower word count.

  • In reply to Steven Whale:

    Hi Steven,

    Thanks for your reply.

    So, 'words' means Asian characters + English words? And 'characters' means Asian characters and all English language characters?

    Can you provide more information on the engine for 'Asian-language words as words identified by the new tokenization engine'? What does this mean exactly? Does it mean it recognizes 车 as one word and 汽车 as one word also? If so, how reliable is the engine?
  • In reply to Frances Nichol:

    You are amazing.
    I can not figure out any logic from the article linked.
    They are talking too much.
    I just felt, I'm an idiot. And gave it up full reading.

    If I saw some brief samples (car vs train) I'll be satisfied.
  • In reply to kellyedward:

    I agree, the language is not helpful or very clear, and examples will help a lot!

    Also for me, the word-based tokenisation (if it is what I think it is) is not useful at all, since I'm used to calculating work based on characters, not meaning units (which is what I think they mean by 'word').