Is there a more automatic way to remove the extra space between English text/numbers and a Chinese character?

Dear Studio 2017 users,

I just completed a big editing job where the client no longer wants a space between English text/numbers and a Chinese character, the style that was used previously.  I ended up spending 20-25% of the editing time manually removing these extra spaces from the matches coming from the project TM.

Is there a way or an app that can automatically remove these spaces for me?  Also, in some cases, the text/Chinese character is surround by a formatting tag pair, and the extra space is either before the opening tag or after.  I wonder if there is a way/expression that can ignore the tags and find/delete the extra space before the tag.

Thank you in advance for any suggestions!

Chunyi 

  • Hi Chunyi,

    Did you try this with the SDLXLIFF Toolkit? We made some fixes to this tool a few months ago to handle something like this where Studio fails with the out of the box search and replace. It would be really helpful if you created some small examples of the text in a file when you ask questions like this because then we won't spend so much time going backwards and forwards suggesting things that don't work for you. Just create a small word file containing the examples in a few segments and attach to your post.

    Regards

    Paul
  • Hi Paul,

    I pasted some sample sentences below. I will send the files to your email address, as this reply box is in simple mode and I can attach them using the image icon.
    I forgot to include a sample sentence that contains a proper name, say, Aefer Bee. I do want to keep the space between English words:)
    Thanks a lot for testing. I will also check out SDLXLIFF Toolkit later.

    This is a test. For emergencies, please call 911.
    Call toll free at 1 800 123 4567 to learn more.
    It is effective from July 1, 2017 to June 30, 2018.

    Chunyi
  • Hi Chunyi,

    You could click on "Advanced Editing Options" and then you can load your files there too.

    Quite tricky to find rules for all cases and some of them, around the tags for example I can only solve by working directly on the SDLXLIFF in a decent text editor.  So if I search for this:

    ([\u4E00-\u9FA5])\s(<[^/]+>\d)|([\u4E00-\u9FA5]<[^/]+>)\s(\d)

    And replace with this:

    $1$2$3$4

    Then this will resolve the ones with spaces either side of the tags.  The ones without tags at all I presume you have no problem with as these are easily handled in the Studio Editor or the Toolkit for example.

  • Hi Paul,

    Thanks for letting me know of the advanced editing options. I am sure I used it before but forgot:)
    I am very happy to learn that the ones without tags can be handled using the Toolkit. I am always concerned when I move the file to a text editor. I could accidentally deleted stuff that would end up destroying the file:)
    I am going to look at your video now.
    And one last question, is there a difference between a 32 bit Notepad++ and 64 bit? My computer is 64 bit, does it mean I should download the 64 bit Notepad++? I opened a Studio file in the 32 bit Notepad++ and saw a huge chunk of strange characters in the top of the file. That has made me more hesitant in handling Studio files in an external text editor.

    Chunyi
  • 64-bit is the way to go... more performant!

    The huge chunk of strange characters are just the encoded embedded file. This is what allows you to send the sdlxliff alone to someone and they will be able to save the target file. The text you'll be changing is below all of that. So don't worry about it, but don't change any of it.
  • Chunyi Chen said:
    I am very happy to learn that the ones without tags can be handled using the Toolkit.

    Could also be handled using the out of the box search and replace in Studio!

  • OK, I will take your word for it and give Notepad++ another try!
    Thanks for your advice on the version to choose!

    Chunyi
  • Paul, in fact I don't know how to remove the instances of extra spaces that do not involve tags using search and replace in Studio. Could you show me the way?
    Thanks a million!

    Chunyi
  • Interesting! I like it that I can use RegexBuddy to check what was replaced before committing these changes in NotePad++. I have both programs and should use them to boost my productivity!
    Thank you for the video, Paul!

    Chunyi
  • Chunyi Chen said:
    Paul, in fact I don't know how to remove the instances of extra spaces that do not involve tags using search and replace in Studio.

    Use Ctrl+H to bring up the replace dialogue then two two operations.  First to find Chinese characters followed by a space and a number.

    Search for:

    ([\u4E00-\u9FA5])\s(\d)

    Replace with:

    $1$2

    Then repeat for numbers followed by a Chinese character (this looked appropriate from your test files... but you be the judge of that).

    Search for:

    (\d)\s([\u4E00-\u9FA5])

    Replace with:

    $1$2

    You've got RegexBuddy... try it out!

    Regards

    Paul

  • And let me add, as I've done this for numbers again! Use this to catch text/numbers:

    ([\u4E00-\u9FA5])\s(\w)

    Replace with:

    $1$2

    Then repeat for numbers followed by a Chinese character (this looked appropriate from your test files... but you be the judge of that).

    Search for:

    (\w)\s([\u4E00-\u9FA5])

    Replace with:

    $1$2

    I'm sure we already had this conversation... or was I dreaming?
  • Hi Paul,

    I tried the regexes using Find and Replace in Studio but I don't think Studio's Find & Replace accepts Regex.
  • It sure does... always has done! Make sure you activate that you are using regular expressions at the bottom of the window. Might say "wildcards" by default.
  • Hi Paul,

    These two expressions work great in SDL Toolkit! I just tried both and they successfully removed the extra space before and after a number in the file!!

    Chunyi
  • So try this too: