I am trying to import a tmx with HTML formatting. The HTML is escaped in the tus (">" instead of "<" etc.).
After importing, the HTML tags are in the TUs, like this:
Hallo, <em>dies/em> ist ein schöer Tag.
These tags persist even if I select "import without formatting".
I would like Studio to recognize the styles and treat them as tags, like using the HTML5 file type or embedded content processor. Should that not be possible, I would like to eliminate all inline tags and decode all HTML encoded characters. Is that possible within Studio?
Please post some sample... since it seems that you might be misunderstanding some of the applicable principles of TMX, escaping XML-reserved characters, HTML character entities, etc.
Embedding HTML inside XML requires that the XML-reserved characters are escaped (if the entire HTML is not enclosed in CDATA, which is not the case of TMX).Therefore the < character must be encoded as <... and, although it's not really required, the > character is usually encoded too, as >
The HTML character entities - like in your "schöer" example - is a separate thing, this has nothing to do with TMX... or at least usually, since TMX files are nowadays UTF-8 encoded, i.e. do not need to store extended characters as HTML entities.According to TMX specification, if TMX is 7-bit encoded, the extended characters should be encoded using numeric entities, not named entities.If your file contains the HTML character entities, then you would have to convert it manually before importing.
Anyway...In order for the tags to be imported as tags into the TM, they must be stored as tags already in the TMX (i.e. marked accordingly using the appropriate TMX elements, like "bpt" and "ept", etc.). If they are not - i.e. the HTML is in fact stored inside the segments as plain text - the import feature can not help you as it simply expects proper TMX content, it does not do any magical transformation.
Hi Evzen Polenka
That makes perfect sense. I am sure I am doing something wrong, since this is the first time I have anything really to do with the TMX format.Since you asked, here's a snippet of the tmx I produced:
<tuv xml:lang="en-GB"><seg><p>Since the year 2000, historic droughts and flooding rains have hit Australia&rsquo;s farmers hard.</seg></tuv>
<tuv xml:lang="de-DE"><seg><p>Seit der Jahrtausendwende haben historische D&uuml;rreperioden und sintflutartige Regenf&auml;lle den Landwirten in Australien hart zugesetzt.</seg></tuv>
To convert the HTML character entities manually is no problem, but the tags look like a bit of work...
Yeah, well, it's basically completely wrong...
First, <p> is a pair tag, i.e. even if it would be included in the segment, the ending tag is missing...
If this <p> tag at the very beginning of the segments is your entire problem, I would simply delete it completely... since it does not add any value to the segment; rather the opposite, since it's unclosed.
And there is also the second problem, the double-escaped HTML character entities: for example, &rsquo; is wrong, it should be ’... but the application which created the TMX, double-encoded the entity and converted the & at the beginning to &...It's just completely screwed up... :( :(
The point is that the cause - i.e. the stupid application/process creating such invalid crap - must be fixed in the first place. Not that you should keep doing god-knows-what harakiri every time you receive such crap, over and over again...
I thought writing something as TMX would include converting the format, but I was wrong. It seems to work fine now:
<tuv xml:lang="en-GB"><seg>Since the year 2000, historic droughts and flooding rains have hit Australia’s farmers hard.</seg></tuv>
<tuv xml:lang="de-DE"><seg>Seit der Jahrtausendwende haben historische Dürreperioden und sintflutartige Regenfälle den Landwirten in Australien hart zugesetzt.</seg></tuv>
<tuv xml:lang="en-GB"><seg><bpt i="1" type="em"><em></bpt>Where does this leave farmers who rely on crops for their livelihood?<ept i="1"></em></ept></seg></tuv>
<tuv xml:lang="de-DE"><seg><bpt i="1" type="em"><em></bpt>Wie sieht das für Landwirte aus, die ihren Lebensunterhalt mit dem Anbau von Feldfrüchten bestreiten?<ept i="1"></em></ept></seg></tuv>
Looks a bit better and imports nicely into Studio's TMs.
Thank you very much for your help!
Daniel Hug said:I thought writing something as TMX would include converting the format
Do you mean on the side of the application/process which creates the TMX?Yes, that DOES definitely require to do it the right way... the problem is the nowadays' young lame developers which don't understand what they are doing :(
Well, I don't think I qualify as young anymore, but since I am not claiming to be a developer I can still escape the "lame"?
However that is... ;) I am working on a workflow to handle bilingual XML files as we handle then every day here. Extracting the bilingual text e.g. with the Bilingual Excel file type is a bit cumbersome and won't perform well if a node contains lots of text, because that text won't be segmented. This bilingual Excel file type is really versatile, but it has its limits and I can customize the workflow because it is so repetitive.
I don't know whether I can automate Studio's Aligner, but with the Okapi toolkit I can extract both languages separately, *convert them to TMX* (ja, ja), then segement and align. (So I was using the Okapi toolkit incorrectly. My fault.)
Using the Okapi toolkit gives me a lot of options, either to convert the bilingual XMLs to segmented XLIFFs (and Okapi's segmentation and alignment works really well for, at least in these first trials) or to convert the XMLs to TMXs.
I want the translator to see the former translation, but not to stuck with it.
That's the whole background... and thank you again for helping.
BTW, this construct is actually incorrect:
<seg><bpt i="1" type="em"><em></bpt>Where does this leave farmers who rely on crops for their livelihood?<ept i="1"></em></ept></seg>
The tags itself should NOT be part of the TM. The TM holds only the "here goes a tag" information, but NOT the information which tag.i.e. the construct should be like this:
<seg><bpt i="1" type="em"/>Where does this leave farmers who rely on crops for their livelihood?<ept i="1"/></seg>
Moreover, it's strongly suggested to include also "x" attribute to the <bpt> and <ept> tags. The attribute serves the purpose of "cross-reference" between the tags in source and target segment... because in certain languages the sentence word order may be different and therefore the order of the tags' appearance in sentence may be different too. And the "x" attribute then serves as a "link" between tag in source segment and corresponding tag in target segment.
Like this (in this example the tags are in the same order, but you get the point...):
<tuv xml:lang="en-US"> <seg>In the Actions dialog box, click <bpt i="1" type="100" x="1" />Add Action<ept i="1" /> and then select <bpt i="2" type="101" x="2" />URL<ept i="2" />.</seg></tuv><tuv xml:lang="de-DE"> <seg>Klicken Sie im Dialogfeld "Aktionen" auf <bpt i="1" type="100" x="1" />Aktion hinzufügen<ept i="1" />, und wählen Sie <bpt i="2" type="101" x="2" />URL<ept i="2" /> aus.</seg></tuv>