Html Codes within Excel File

Hello All

I want Trados Studio to split excel cell contents into segments based on embedded HTML codes

e.g:

<B>Product Features</B><BR>Throws are acrylic knitted<BR>Product size : 130x170 cm <BR>Product colour is beige. <BR><BR><B>Washing Recommendations</B><BR>Washable at 30 degrees.<BR>Do not bleach.<BR>Do not iron.

----------------------

Product Features
Throws are acrylic knitted
Product size : 130x170 cm
Product colour is beige. 
Washing Recommendations
Washable at 30 degrees.
Do not bleach.
Do not iron.

--------

this is a sample

sample-br.xlsx

thanks 

  • The default rules will handle this and should give you an idea of how to improve it if you wish:

  • Or you can use more distinctive approach by adding individual rules for the htmł element.

    This would be for example

    <b> </b> as tag pair with no extra segmentation hint, but with selected feature "Tag acts as word end" (is really VERY important when translating and should be selected by default, dear SDL)

    <br> as placeable with the segmentation hint "exclude"

    and so on...

  • I find this regex-based approach rather amateurish, cumbersome and most importantly failing big time with just a little bit more complex HTML code, not mentioning anything more complicated (containing entities, comments, inline scripts, etc.)

    Therefore I use a simple script which exports the HTML content to very simple XML structure like this

    <string cell="A1">blablabla, some complicated HTML code</string>

    This is then easily processed using the XML with HTML embedded content, with all comfort of embedded content parser.

    And then it's again very easily injected back into the appropriate places into the original Excel sheet using another  rather dumb script which only reads the cell location from the XML element attribute and puts the string in the cell.

    Not a method for average Joe Translator, I know...

  • Why not then simply save Excel as XML and process that in Studio? Not that complicated as it seems...

  • Not usable with multilingual Excels - where target (or rather multiple targets... like 15 languages or so) is to be placed to other cells than the source.

  • Indeed. I have rarely to deal with such, sorry - this is why I usually go the easiest way for a freelancer.

  • I will try this also
    thank you for sharing ideas

  • Maybe this is an 'overkill' solution, but I can see that your sample also seems to contain non-HTML columns. And while the Embedded Content solution provided by Paul does the trick to identify HTML tags, it does not recognise HTML character codes (if you should have any on your files).

    So, just my two cents:

    For such files, we use the XML options in Excel's Developer tab:

    1. Use Notepad++ to create a simple file with the names of the Excel columns, which look like this:

              <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
              <File xmlns:xsi="">www.w3.org/.../XMLSchema-instance">
              <Element>
              <Column1>a</Column1>
              <Column2>a</Column2>
              </Element>
              <Element>
              <Column1>a</Column1>
              <Column2>a</Column2>
              </Element>
              </File>

    2. Use the Source button in Developer > XML in Excel to add this XML map to the Excel file for translation. Then drag and drop each of the XML elements from the XML map onto the respective Excel column heading (which will create a table).

    3. Click Export in Developer > XML to export your table to an XML file.

    4. In Studio, create specific XML file type settings. As such you can configure which elements / columns need to be translated (maybe not all Excel columns need to be translated) and you can add document structure information to elements. You can then use this document structure information to have HTML content processed using Studio's embedded "Html Embedded Content 5 2.0.0.0" processor, which will recognise both HTML tags and HTML character codes. And non-HTML columns will not be processed as containing HTML.

    5. After translation of the XML file, just open the Excel file and click Developer > XML > Import to import the translated XML file.

    Bit of a long process, I know, but once you know how it works, we have found the results to make it worthwhile the effort.

    Won't work either for Excel files with multiple languages, of course...

    Best,
    Lieven

  • Yes, that is basically where my method with the export/import script originated from.
    It then just evolved to a smarter script with more functionality... it can e.g. skip cells already containing a translated text... or, as mentioned, handle multiple languages (export/import separate XMLs for each target language).

  • Wow... I did say the default aproach would help give the user the idea they needed to improve it for the content they have.  So adding a few rules here and there for this specific content is trivial.

    I find this regex-based approach rather amateurish, cumbersome and most importantly failing big time with just a little bit more complex HTML code, not mentioning anything more complicated (containing entities, comments, inline scripts, etc.)

    Well... in this case the file is not more complex.  I'm a firm believer in economy of accuracy and in this case your more professional approach is not needed.  Interesting discussion though and for other files it is a sensible way to go given the lack of a better embedded content handler for Excel in Studio.

  • in this case your more professional approach is not needed

    This is questionable.
    I've seen such decisions based on a short sample (or seeing just a few lines of one file, not bothering to look thoroughly through the WHOLE file), making a hell of the people's lives because just a few pages down (or in other file) there was a messy complex HTML code.

    So I prefer using robust solutions working reliably in all cases, rather than keeping solving endless issues with simple solutions.

  • So, as we all seem to agree, it all depends on the contents of the entire file.

    And, in reply to : if your file only contains BR and P tags, Paul's solution is definitely the best way to go as it will do the trick perfectly. If, however, the rest of the file also contains many other tags and tag pairs and HTML codes instead of characters, an Excel-to-XML solution may be the better option as you can then profit from Studio's integrated HTML processor.

    In any event, no disrespect intended to anyone here from my side...