Detecting language please wait for.......
This has been driving me nuts for a while now. I'm using embedded content (legacy) in an XML file type to catch some common html tags, among other things. All in all this is pretty straightforward, but html lists are giving me quite a headache.
Right now I have this in there:
Start tag: <[o|u]l>
End tag: </[o|u]l>
Segmentation hint: Exclude
What happens is that everything after the first list in a file--either ordered or unordered--is not extracted. I tried all kinds of variations of the above expressions, and also used separate tag pairs for ordered and unordered lists, but the result is always the same.
I should maybe also mention that, unfortunately, the embedded content processors that were introduced recently are not an option, because they are not available in WorldServer
I'd be grateful for any pointers to fix this.
What happens if you use this <(ol|ul)> and </(ol|ul)>? These should be more precise...
thanks, but unfortunately this one also produces the same result. I also tested this in other applications and it works just fine. It's just Studio that won't cooperate, and I have no clue where I'm going wrong.
I also tried escaping pretty much every single character just in case it does something unexpected, but to no avail.
This is indeed very strange. Maybe the element which includes the embedded content (such as cData or so) is not defined properly? From my experience that way of using embedded content should work. What you can try too is to change the "exclude" to "may excluce"...
It might help to see a sample of the xml file with the elements containing the html code you wish to handle? Perhaps also mention the other rules you have created? One of the biggest problems with using the legacy embedded content processor is when you start to add many rules as you can easily get some overlap which can cause unexpected behaviour when you parse the file.
Thank you both for your help. The overlap was a good hint. So I figured I'll add the rules one by one to see where they go awry, and sure enough it wasn't until the last one .
All embedded content is in a CDATA section. Basically, this section can hold any html formatting. For example:
<![CDATA[<ol><li>Punkt 1</li><li>Punkt 2</li><li>Punkt 3</li></ol><ul><li>Punkt 4</li><li>Punkt 5</li><li>Punkt 6</li></ul>]]>
These are the rules. Everything works fine until I add \n, at which point any content that comes after the first list disappears.
The trouble is that for the output it doesn't seem to make a difference if a new line is triggered by a manual line break or a break tag, so there's no consistency in the source files and I have to segment at both <br> and manual breaks.
I'm a bit lost now, because now that I was able to isolate \n as the cause of the problem, I have no idea how to prevent it.
\n is wrong, the "\" must be escaped, so it should read \\n
Now you have me confused. When I escape the backslash, content after the list is parsed correctly, but this isn't:
Line 1 w/ manual break
Line 2 w/ manual break
Line 3 w/ manual break
With \\n all three lines end up in the same segment. Note there's a break tag after line 3 in the example below. With just \n, lines one to three are parsed correctly with each line in a new segment, as is the list (Punkt 1 to Punkt 3), but everything after the list is gone
set \\n to external (structure)
Why not take a different approach? You were correct the first time I think because here the \n did not need to be escaped unless you were trying to find the \ in \n specifically as opposed to a line feed. So perhaps if you remove the \n rule altogether and then create a segmentation rule in the TM instead you will have more success.
I have not tried to recreate your situation but \n is quite a catch all and this may be causing a problem elsewhere in the file tagging. If you add it as a segmentation rule, which is what you are trying to achieve then you may find you have the desired result.
Technically, this sounds like a good solution, but alas, again WorldServer doesn't work that way. There simply are no TM-level segmentation rules. It's one of those cases where I wished the two were integrated better.
I might just have to take this up with the authors and see if they can achieve some consistency here and always use break tags, or indeed maybe even with the developers to see if they can automatically convert the line feeds to break tags when creating the xmls so I can do away with the \n rule.
Thanks again, I appreciate all your help.
Do you see any possibility to contact me off forum (jerzy at czopik dot com) and let me try to work with you on your settings?
Absolutely. I'm just having a crazy day and I didn't get round to compiling some sample files and the FTD yet. Also, it's not that big an issue anymore. For now I'll just drop the list tags. All that will do is give us translatable segments with a single plain text html tag, so no biggie, especially since lists don't come up all that often.
It's a lot more important that all the other segmentation rules are adhered to.
I'm inclined to select Paul's suggestion with the TM-level segmentation rule as the answer, because in any other scenario this is probably what I would do.
Either way I'm still curious about what's causing the conflict and how to avoid it. I seem to be unable to figure this out myself. So if you still want to help me out here even though it's not really necessary I'll be happy to send you some files.
Thanks, I'll be not there tomorrow, but you can send me some samples to let me look on them during the weekend.