Language Identification

This article contains information about how languages are automatically detected for each document.

Level of Language Detection

During Reveal processing language detection is performed separately on each segment of the document. If there are multiple languages present in one segment, our language detector will detect each of the languages separately. Therefore, any one segment may have multiple languages assigned to it.

Language detection is done on the segment body content only. It will not detect languages from the following sections:

Greeting
Signature
Disclaimer
Computer-generated content

Special Cases

There can be scenarios where the language detector does not produce confident language assignments. Defined below are the different special labels that will be assigned to a segment when language detection is not possible. In these cases, the entire segment is assigned a special language label.

Unknown_EmptySegment

Segment content is empty.

Unknown_TooShort

Language detector does an internal clean-up to remove digits, special characters, URLs, email addresses, etc.
After the cleanup, the remaining text is too short for accurate detection.
- For CJK text, the threshold for too short is 20 characters.
- For all other languages, the threshold is 50 characters.

Unknown_FailedLetterModel

Language detector does an internal clean-up to remove digits, special characters, URLs, email addresses, etc.
The language detector detects a language, but the secondary test fails.
- The secondary test checks if the character distribution in the segment is comparable to the character distribution in the suggested language.
- For example, a segment can contain only the letter ‘a’. The language detector may suggest this is English but the secondary test will fail because it is very unlikely to see English text with only one letter.

Unknown_AssumedSpreadsheet

We do not attempt language detection on segments with a processing status of ‘CharacterBasedFilterException’ and assign the language as ‘Unknown_AssumedSpreadsheet’.

Unknown

In all other cases when language detection does not have high confidence assignment, we set the language as ‘Unknown’.

Last Updated 8/18/2022