Is there any way to get Studio's segmentation to include closing punctuation following a break plus a hard space?

I work in a French/English environment where we are translating in both directions and, in order to maximize leverage of our respective memories, wish to respect our respective styles. Our house style for French is to insert a space after opening French quotation marks («) and before the closing ones (»). I cannot figure out how to make the Studio 2015 break after the closing quotation mark where it is separated from a full stop, question mark or exclamation point by a space. I am new to regular expressions, but I tried to force Studio to include a closing quotation mark preceded by a space using (?:\p{Zs}\p{Pe})|[\s] in the After break space. I tried with and without Include closing punctuation, but it did not work. Any suggestions?

Top Replies

  • 1 month ago in reply to Michael Bauer +2 verified

    ok - quick dinner and brain refreshed. Easiest way is to use two rules:

    Rule #1

    Rule #2

    This gets you this:

    Depending on your filetype tag the rest with an appropriate method to get this…

  • Hi Janis,

    Can you share an example of the text you have, what happens now, and what you would like to see? I'm not too clear.

    Thanks

    Paul
  • Hi Paul; thanks for the quick response.

    Here is some sample text with the three stops that might be followed by a closing French quotation mark:

    « Voici un exemple avec un point final. » « Pourquoi est-ce que Studio ne marche pas comme il faut pour nous? » « Zut, alors! »

    French-Canadian style is to separate the quotation marks from the text they surround with a space. (Our house rules require a hard space, and incoming source documents are normally edited to change soft spaces to hard.) The default segmentation rules for our French-English memories cause Studio to place the closing quotation mark in a segment all on its own, requiring us to manually edit the source segments.

    My latest test regex, @"(?:\u0020\u00BB[\s])|[\s]", with Include closing punctuation unchecked, put all of the above three sentences in one segment. The same string without the @ and quotation marks gave me the same results as the default segmentation. (I guess I need to paste from Regex Buddy as is? I have .NET flavour selected).

    The result is four improperly punctuated segments instead of the three properly punctuated ones:

    « Voici un exemple avec un point final.
    » « Pourquoi est-ce que Studio ne marche pas comme il faut pour nous?
    » « Zut, alors!
    »

  • Hi Janis,

    It sounds like you need an exception to the full stop and other terminating punctuation rules, rather than a new segmentation rule, as explained here:

    To achieve what you want, add an exception like this:

    Make sure to add it both to the full stop rule and to the other terminating punctuation rule. That will do the trick. : )

    Edit: To elaborate on the above, make sure yo have selected both ? and ! when you add the exception to the "Other terminating punctuation rule", so that it will look like this:

    You will also need a segmentation rule to break before «, like this:

  • Thanks, Nora, but I don't see how this is going to work for us as 1) every sentence that ends with a closing quotation mark is not going to be followed by one that starts with an opening quotation mark and 2) French quotation marks can occur within a sentence as well, just like English quotation marks. (J'ai dit « bleu », pas « rouge »!/I said "blue", not "red"!) I think what we need is a rule that recognizes and includes a single space plus closing punctuation as closing punctuation.

  • Oh I see, sorry, I went by your provided example. The problem with the idea of the closing rule with the space plus closing quote is that you would need to get rid of the other rules, i.e., full stop and other terminating punctuation for it to work. In other words, the reason you're getting the segmentation you're seeing is because the full stop, ? and ! rules are being applied, regardless of what comes after them, so I really think exceptions are the best solution. Then you would just need to tweak the "Break before «" rule to make sure the break is only  applied when the « is followed by uppercase, something like this:

    And then a separate rule would need to be added for a break after » + space plus an uppercase letter, like this:

    Using the above rules in your provided sample text would result in this:

  • Wow, Nora, it looks like you have done it! I admit I have worn out my brain trying to wrap it around this business, but I can't think of any reason why this won't do the trick. Thanks, you're a gem!
  • Happy to help : ) I'm still learning RegEx myself and I know how confusing it can get.
  • Well this is interesting. We were having trouble getting this solution to work for us until we deselected the new filetype for Word 2007-2016 and used the old one. Just out of curiosity, Nora, which filetype do you use as your default?
  • Hi Janis,

    I use the new file type as my default (WordprocessingML v. 2).

    There was a recent discussion about this issue on the forum, which you can find here: community.sdl.com/.../7153.

    If I recall correctly, there were differences in how the segmentation rules were applied if Word's automatic spellchecking function was enabled, as apparently some kind of marker gets inserted (a spellchecking tag of some sort) that can disrupt the regex pattern, so the workaround is either to process the files with an old Word filter or to disable "spell check as you go" in Word.
  • Thanks, Nora. I had in fact printed off the xml document, and saw the very thing discussed in that thread, but didn't know how to interpret it, so again, many thanks!

  • Hi Nora. Can I ask you a related question? I have a very similar problem and it's probably easy for you but my brain doesn't work that way!

    Scenario: I get files where punctuation is often not tidy or where !! is used to separate segments, like this:

    A sentence missing a space at the end!And another sentence.

    A sentence.!!Another sentence without ending punctuation!!And another at the end

    The default settings for ?! have "Whitespace" selected as the After Break. I played around with it and discovered that if I set this to Anything, it will split at these ! but creates a problem elsewhere. In particular, it creates a problem with some placeholders such as these

    Open {1}!s! and then open {2}!s!.

    which it now splits as

    Open {1}!

    s!

    and then open {2}!

    s!

    .

    I can manually merge them but that gets a bit tedious. I get the feeling from your above solution that I should be able to tell it not to split before s! but I can't figure how to. Any suggestions would be highly welcome!

  • It depends on what you want to do.  For example, if you have this source:

    You could achieve this which looks sensible to me:

    I would tag the {nr}!s! rather than try and segment here.  I only tried this once and it seemed to work, but I guess you could be smarter with the expression.  I used this:

    Before break:

    (?:!(?!s!))[!]

    After break

    .

    So basically used a negative lookahead to suppress the segmentation if the construct was the part to tag.

  • Actually that's awful... will take another look.

  • ok - quick dinner and brain refreshed. Easiest way is to use two rules:

    Rule #1

    Rule #2

    This gets you this:

    Depending on your filetype tag the rest with an appropriate method to get this:

    I used the SDL Data Protection Suite to do this but if it was a filetype that supported embedded content in some way just use that.