Segmentation Rule

Dears,

 

I would like to ask the below questions.

  1. I have created TM with a segmentation rule full stop and translate the file against it.Can i export the TM as TMX and re-create new TM with the exported TMX but with a Paragraph mark segmentation rule ? if yes, How can I see the changes? as i can't see any change and the segments are still segmented with a full stop.
  2. Can we change the segmentation of a translated bilingual file from full stop to paragraph mark?

 

 

Best Regards,

Samar

Parents
  • Hi  

    A TMX file doesn't hold segmentation rules.

    If you want to use paragraph segmentation for all future files then creating a new TM with this option will work for all future files. But if you import your TMX into this new paragraph based TM the segments will still only be sentence based as these are already defined in the TMX. I'm not aware of any tools that can go from sentence to paragraph... only a few that go the other way around.  Part of the problem I guess is that a TM is not a true reflection of the original documents so making sure the paragraphs were really correct would be tricky if not impossible and technically the TMX puts all segments, whether sentence based or paragraph based into a single TU.  So there is nothing in the TMX to tell any tool whether the TUs were part of a larger entity or not.

    What may be useful if is that if you do this the fragment matching feature can pick out the TUs. So whilst you won't get proper pretranslation leverage at least you would still be able to leverage the work interactively:

    Once you have converted your bilingual file that's it.  You can't change the segmentation at this point, you need the source file for that.  Perhaps a potential solution would be to align the source and target files with a TM set up for paragraph based segmentation instead of trying to change the bilingual files... although I wouldn't hold my breath!

    If there is a solution for this out there I'd also be interested to learn.

  • Thank you for your replay.

    I would like to ask you a question. While trying to create SDL Project with a new Paragraph based TM, The new created file is not segmented with Paragraph as per the below is a screenshots.

    I have expected that each highlighted paragraph will presented  in only one segment in studio but this didn't happen.and the text is also segmented by full stop.

     

     

    Best Regards,

    Samar

  • Looks like you did not add the TM to the project (AND turned it on) BEFORE performing the 'Convert to translatable format' batch task.
    This task (and eventually also the 'Copy to target languages' task) actually does the segmentation.
  • Hi ,

    The short answer is you’ve done something wrong. Make sure you are really using the TM set up for para segmentation to prepare your project.

    Make sure you set this up using the option for para segmentation and not something based on some rule that you think might be driving this kind of segmentation.

    I’d check these things first.
  • Dears,

    Thanks a lot for your support, I know what's the wrong step, I have created the TM in the all language pairs. and it worked while creating the TM in the project language pair TM instead.

    Best Regards,
    Samar
  • Dears,

    sorry for interrupting you. but  i have other question in the segmentation. i have tried to create win align project by creating TM with a para segmentation and adding the source and target file but the bilingual file is segmented by sentence not paragraph as per the below screenshots.

     

     

    Best Regards,

    Samar

  • Dears,

    sorry for interrupting you. but  i have other question in the segmentation. i have tried to create win align project by creating TM with a para segmentation and adding the source and target file but the bilingual file is segmented by sentence not paragraph as per the below screenshots.

     

     

    Best Regards,

    Samar

  • Then you probably did NOT use TM with paragraph-based segmentation during the alignment process.
  • Hi  

    This is a bit of a tricky problem because only the source language will be segmented correctly.  There is no way to segment the target based on a target TM as you can only set a TM to act on the source.  So you have to do this sort of thing when you align:

    The source gets segmented correctly, by paragraph, but the target will be by sentence as the default rules are used for target whether you specify a TM or not.

  • Paul said:
    This is a bit of a tricky problem because only the source language will be segmented correctly.  There is no way to segment the target based on a target TM as you can only set a TM to act on the source.

    You can set segmentation rules for BOTH the source and target language in TM.

    And if I remember correctly, the last time I experimented with paragraph-based segmentation in alignment (autumn 2017), it worked pretty well.

  • Hi ,

    I'm waiting for your instructions... I really hope you're right as this would be very useful indeed. I have no idea how to do that for two reasons based on what I "thought I knew".

    - segmentation rules only apply to source on a TM
    - if you align an EN -> DE for example you select your own EN -> DE which segments the source file, but the DE file is segmented using a TM created by Studio in the background, DE to something.

    So very happy to be educated by you Evzen.
  • It could be that the source format played a big role in my case... it was MadCap Flare XML/HTML, so the segmentation is pretty much defined by the file type, rather than the TM-defined rules.

    Of course I cannot be expected to know how Studio works internally... all I know is that I:

    - created new empty TM where I changed the segmentation to Paragraph based for both source- and target language

    - used this TM for running the alignment

    That's all. I don't (and can't) know exactly which "magic" (or coincidence) made it to align just as one would expect ;-). Perhaps is the internal "reversed" TM created by reversing the actual TM (similarly to what AnyTM does)? It would quite make sense...

    I didn't explore it any deeper as we ended up not going further with paragraph-based segmentation and went the harder way of sentence-based  segmentation.

Reply
  • It could be that the source format played a big role in my case... it was MadCap Flare XML/HTML, so the segmentation is pretty much defined by the file type, rather than the TM-defined rules.

    Of course I cannot be expected to know how Studio works internally... all I know is that I:

    - created new empty TM where I changed the segmentation to Paragraph based for both source- and target language

    - used this TM for running the alignment

    That's all. I don't (and can't) know exactly which "magic" (or coincidence) made it to align just as one would expect ;-). Perhaps is the internal "reversed" TM created by reversing the actual TM (similarly to what AnyTM does)? It would quite make sense...

    I didn't explore it any deeper as we ended up not going further with paragraph-based segmentation and went the harder way of sentence-based  segmentation.

Children
  • Hi  

    I would never of thought of doing that as I usually select an existing TM and it's too late at this point.  But you are absolutely right... and I'm really happy to see this:

    Thank you for sharing this information.... something we should definitely document somewhere as I'm sure it will be useful to many users.  Or maybe I was the only one who didn't know this!!

    Thanks

    Paul

  • I believe that not many users actually know this... because the TM creation/settings GUI hides the fact that there is more languages than just the source one from the user, and even discovering the dropdown content does not make immediately clear to the user what are the consequences of it.
    I'm not quite sure this is intentional... though one may presume that 'not making it too complex for user' might have been the driver, but in that case I would assume "synchronizing" some elementary settings (at least the segmentation type for sure) between the source and target language automatically in the background.
  • Hi ,

    I spent some time testing this tonight and I have to say I think we were lucky with the paragraph segmentation rule. Try it with any other kind of segmentation rules and the effect is mindblowing... at least I'm really struggling to see any logic in how this works. I certainly think my original assumption in every other case I tested this evening was correct. I don't think you can effect the target segmentation rules in the way described at all. I also looked at retrofit and this seems to do something else again.

    I can only conclude that whilst it seemed to make perfect sense, and I really wanted that to work, it does not. At least not in every case. I'm going to try and get to the bottom of this so I can understand what's going on.
  • Paul said:
    Try it with any other kind of segmentation rules and the effect is mindblowing... at least I'm really struggling to see any logic in how this works.

    Hmmmm... just out of curiosity, you are testing it with the latest version, I suppose... the one with god-knows-how broken segmentation rules...
    Would you mind doing same tests with "last (kind-of)sensibly-behaving version", i.e. 2017 CU5 (last pre-SR1) and 2015 SR3 (vanilla, w/o CUs)?
    I feel that it may behave differently...

  • Hi Evzen,

    I have actually found a few more interesting things... albeit embarrassing!

    1. I was aligning a completely different target file (same name, different location) and didn't notice
    2. Studio 2017, current version, actually handles the custom rules exactly as you said and works as expected
    3. Studio 2015 completely ignores the use of custom rules so actually 2017 SR1 CU9 works correctly for me. It's an improvement.
    4. Retrofit alignment works differently and won't apply any custom rules