Xpath parser to exclude part of text from contetn

Hello,

I am trying to figure out how to exclude part of text using Xpath.

I have sample text in specific structure:

<main>

<section tag="1">

<sub-section id="a">sample_content<sub-section>

<sub-section id="b">sample_content<sub-section>

</section>

<section tag="2">

<sub-section id="c">sample_content<sub-section>

<sub-section id="d">[value_text] sample content<sub-section>

</section>

</main>

I've tried to get text  using Xpath:

  • //section[@tag='2']/sub-section[@id='d]

However, it is not enough to exclude "sample_content" from this line.

Result is:

[value_text] sample content.

My goal is:

value_text

I was looking for solution on internet (this website too) but I didn't get any.

I know that Trados Studio only use Xpath 1.0 that doesn't allow to mix Xpath with regular expressions. Also, I couldn't find any useful Xpath functions for my problem.

Do you have any ideas how to handle this problem?

I use Trados Studio 2019 SR2. I created Filetype XML (embedded content).

Kind Regards,

Adrian

Top Replies

  • It is possible, but not with xpath alone, at least not xpath 1.0.  First of all you create your parser rule, exactly as you have done.  Then you add some structure context like this for example…

  • In this case you'd be better of doing something like this:

    1. create your filetype with the xpath expression previously agreed
    2. Use this expression to create a placeholder instead of the…
  • It is possible, but not with xpath alone, at least not xpath 1.0.  First of all you create your parser rule, exactly as you have done.  Then you add some structure context like this for example:

    Then activate the embedded content processor and create a rule using regex with one of the ways available.  I used the "Defined by document structure information" as I added the "Paragraph" context above:

    I based this on your specific example, but it might give you an idea if your actual files are a little different.  This then gets me the following:

    Which seems to be what you're after.

  • Hello Paul,

    Thank you for answer. It is very helpful. You are very Experienced user.

    I've started to test it, however it seems to work only when "[...]" shows at the beginning of content. 

    What in case when I got more complicated sentence that contain more bracket text? Like in that example:

    <sub-section id="d">[value_text] sample content [value_text] sample content [value_text]<sub-section>

    Should I add some RegEx formula to not translate text between ']' and '[' ? Is it proper way to add every possible case separated? Do you know better way to handle this problem?

     

    Kind Regards,
    Adrian

  • however it seems to work only when "[...]" shows at the beginning of content. 

    Correct... that's because I based the expression on your simple example only.

    What in case when I got more complicated sentence that contain more bracket text? Like in that example:

    <sub-section id="d">[value_text] sample content [value_text] sample content [value_text]<sub-section>

    Should I add some RegEx formula to not translate text between ']' and '[' ? Is it proper way to add every possible case separated?

    This starts to get tricky for several reasons:

    1. you would really need multiple rules for each case
    2. even with multiple rules it would still be hard... is this possible for example?
      <sub-section id="d">[value_text] sample content [value_text] sample content [value_text] sample content [value_text] sample content [value_text]<sub-section>
    3. then you also have to deal with segmentation because how would this be to translate without proper segmentation?
      [value_text][value_text][value_text][value_text][value_text]
    Do you know better way to handle this problem?

    Tell us more about the file as a whole.

    • does the rest of the file need to be translated?
    • are the texts in the square brackets consistent and repeat themselves?

    Without the whole story it's very difficult to do what you are suggesting or to try and offer a sensible solution.

  • Paul,

     

    I have a very big .xml file that contains a lot of cases like in example.
    I figured out how to find only required content using Xpath like in my first post.

    Rest of file shouldn't be translated. Only specific cases. For example: "Please translate ONLY text in brackets (rest of text should be ignored by studio) localized in section tag ="5" and sub-section id="c". Text is random and inconsistent. They don't repeat themselves.

    For better representation this should be more clear:

    <section tag="2">
    <sub-section id="e">Lorem ipsum [text 1] dolor sit amet, consectetur adipiscing elit. [text 2] Integer id ullamcorper magna,...</sub-section>
    </section>

    My goal is  get in studio (in this case) 2 segments:

    text 1

    text 2

    I am trying to automate this formula for every case in file. So there are only few cases that can appear in content:

    [text] normal text
    normal text [text]
    normal text [text] normal text
    normal text
    [text] normal text [text]
    [text] normal text [text] normal text [text]

    or

    [text][text][text]

    Am I need to create all RegEx formula for each separately?

     

    This is not actual translation, It is kind of Localization skills test that shows if there is possibility to resolve specific problem.
    I am trying to improve my studio skills. Xpath and regular expressions seems to be core of good Localization knowledge.

     

    Kind Regards,
    Adrian 

  • In this case you'd be better of doing something like this:

    1. create your filetype with the xpath expression previously agreed
    2. Use this expression to create a placeholder instead of the tag pair:
      (?<!\[)\b[\w\s]+\b(?![\)])

    This will select everything apart from the text in the brackets... like this for example where I even used a really extreme example:

    And if you then set the embedded content rule to "exclude" you can even get the segmentation:

    Looks like it's what you needed?

  • Paul,

    That almost it. To make this easier I going to paste here some real examples. Text is totally random and doesn't have much sense. Getting text from brackets is goal:

    I have create XML file type with (Legacy embedded content).

    Here are parsers:

    And embedded content for Paragraph;

    Everything seem to be correct, However this formula doesn't recognise digits and non-Word characters.

    Result:

    Adding \d attribute to this RegEx should resolve part of missing digits.
    Dots and commas are bigger trouble for me. Is \W enough attribute to handle with them?

    Kind regards,

    Adrian