How SDL Edge language detection works ?

The idea is to understand how SDL language detection works and on what corpus it is based. Is it based on the subset installed or the total of languages supported by SDL for example ? Is it N-gram based? Does the number of installed language pairs affect the detection result ?
This is necessary in order to decide if our workflow can switch to the SDL detection API, or this step will be performed by another module which is in place.

Support informed me that you are working on SRQ-14045  which is related

  • Hi Dimitrios,

    I'm sorry for the delay.

    We are using a combination of language identification models to offer an optimal language detection.

    The LPs installed don't affect the language detection. By default, the language identification can detect any language, even if not supported by SDL (full list is available in the API documentation at /docs/api/rest/index.html#language-and-script-codes)

    However, with the latest 8.5 release, it is possible to reduce the scope of the language identification to prevent LangID to return a language which is either not supported, or not installed (see language detection settings in the API documentation at /docs/api/rest/index.html#update-language-detection-settings)

    Note that when performing the language identification through the API, we will return up to 3 languages identified, with their probability/score (/docs/api/rest/index.html#language-detections).

    We are aware of challenges related to Language Identification, especially for short segments for example, but we are constantly improving our detection methods and models.

    Did you perform a comparison between our language identification and the one you are currently using? I'd be interested about the results, and if you have any feedback to help us improve our language identification, it will be very welcome.

    Thanks