Html Codes within Excel File

Hello All

I want Trados Studio to split excel cell contents into segments based on embedded HTML codes

e.g:

<B>Product Features</B><BR>Throws are acrylic knitted<BR>Product size : 130x170 cm <BR>Product colour is beige. <BR><BR><B>Washing Recommendations</B><BR>Washable at 30 degrees.<BR>Do not bleach.<BR>Do not iron.

----------------------

Product Features
Throws are acrylic knitted
Product size : 130x170 cm
Product colour is beige. 
Washing Recommendations
Washable at 30 degrees.
Do not bleach.
Do not iron.

--------

this is a sample

sample-br.xlsx

thanks 

Parents
  • Maybe this is an 'overkill' solution, but I can see that your sample also seems to contain non-HTML columns. And while the Embedded Content solution provided by Paul does the trick to identify HTML tags, it does not recognise HTML character codes (if you should have any on your files).

    So, just my two cents:

    For such files, we use the XML options in Excel's Developer tab:

    1. Use Notepad++ to create a simple file with the names of the Excel columns, which look like this:

              <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
              <File xmlns:xsi="">www.w3.org/.../XMLSchema-instance">
              <Element>
              <Column1>a</Column1>
              <Column2>a</Column2>
              </Element>
              <Element>
              <Column1>a</Column1>
              <Column2>a</Column2>
              </Element>
              </File>

    2. Use the Source button in Developer > XML in Excel to add this XML map to the Excel file for translation. Then drag and drop each of the XML elements from the XML map onto the respective Excel column heading (which will create a table).

    3. Click Export in Developer > XML to export your table to an XML file.

    4. In Studio, create specific XML file type settings. As such you can configure which elements / columns need to be translated (maybe not all Excel columns need to be translated) and you can add document structure information to elements. You can then use this document structure information to have HTML content processed using Studio's embedded "Html Embedded Content 5 2.0.0.0" processor, which will recognise both HTML tags and HTML character codes. And non-HTML columns will not be processed as containing HTML.

    5. After translation of the XML file, just open the Excel file and click Developer > XML > Import to import the translated XML file.

    Bit of a long process, I know, but once you know how it works, we have found the results to make it worthwhile the effort.

    Won't work either for Excel files with multiple languages, of course...

    Best,
    Lieven

  • Wow... I did say the default aproach would help give the user the idea they needed to improve it for the content they have.  So adding a few rules here and there for this specific content is trivial.

    I find this regex-based approach rather amateurish, cumbersome and most importantly failing big time with just a little bit more complex HTML code, not mentioning anything more complicated (containing entities, comments, inline scripts, etc.)

    Well... in this case the file is not more complex.  I'm a firm believer in economy of accuracy and in this case your more professional approach is not needed.  Interesting discussion though and for other files it is a sensible way to go given the lack of a better embedded content handler for Excel in Studio.

  • in this case your more professional approach is not needed

    This is questionable.
    I've seen such decisions based on a short sample (or seeing just a few lines of one file, not bothering to look thoroughly through the WHOLE file), making a hell of the people's lives because just a few pages down (or in other file) there was a messy complex HTML code.

    So I prefer using robust solutions working reliably in all cases, rather than keeping solving endless issues with simple solutions.

  • So, as we all seem to agree, it all depends on the contents of the entire file.

    And, in reply to : if your file only contains BR and P tags, Paul's solution is definitely the best way to go as it will do the trick perfectly. If, however, the rest of the file also contains many other tags and tag pairs and HTML codes instead of characters, an Excel-to-XML solution may be the better option as you can then profit from Studio's integrated HTML processor.

    In any event, no disrespect intended to anyone here from my side...

  • Thank
    I will try your solution.
    Paul method resolved the Tags completely.
    Regards.

  • Hi 

    What about this:

    Would that work for you?

    Daniel

    PS: Here is your sample file:

    sample-br.xlsx.zip

  • Dear @Daiel Hug
    I really appreciate your solution 
    How did you do the segmentation?

    but @Paul solution, resolve the contents without any Tag


    Thanks to all members who shared their experiences 
    Best Regards 

  • Hi

    I just used Okapi (https://okapiframework.org/wiki/index.php/Main_Page)  out-of-the-box, so to say.

    (no custom filter configurations)

    This is the only custom setting I used (additional segmentation rule to segment at all the variations of <br/> tags that exist in your document:

    That's all. So you end up with an xlf file, but Okapi will convert it back into the source format once you're done. So it's XSLX->XLF (Okapi) -> SDLXLIFF (Studio) -> XLF (Studio) -> XSLX (Okapi) I've never had that fail so far.

    Daniel

  • Dear @

    Daniel Hug

    thanks a million,
    it is a new tool (Okapi)

    I will try it.
    Regards

Reply Children
No Data