Reveal Analytics Requirements

This article provides a detailed description of the Data ingestion and mapping required for use of Reveal analytical tools.


Analytics Requirements

In order to leverage Reveal's data analytics features, there are a few minimum requirements that must be met. First, you must ingest at least 300 documents with usable text. Second, you must map and ingest the following metadata fields (please see chart below): 

Reveal Field Name

Reveal Display Name

Requirement

Type / Purpose

Body Text (The body text used is   based upon order <OCR / Loaded / Extracted>)

 

Required

- At least 300 documents with text are required.
- The text is populated from the Extracted, OCR, or Loaded text set (in order, based on availability).

BEGDOC

Begin Number

Required

Control Number - The text identifier of the document.

BEGATTACH

Begin Number Attach

Required

Used to group email and attachments together; default to Control Number if empty. Note: For a Parent, the Group Identifier must equal the Control Number.

CUSTODIAN_NAME

Custodian

Required

Name of the person or entity producing the document; default value is “empty_custodian”.

MD5_HASH

Duplicate ID

Required

MD5Hash value, default to Control Number if empty.

ITEMID

Item ID

Required

ID - The numeric identifier of the document.

SUBJECT_OTHER*

Email Subject

Highly Recommended

- Required for faster processing time.
- Email only, the message's "Subject" field.

* It is not required to populate the subject field with data, but doing so will speed up the data intake and processing.

Communications

The Communications data visualization in Reveal allows users to view and analyze thousands of messages quickly. Users can arrange clusters of messages, examine and update alias email addresses for communicators, and readily view the frequency and other attributes of messaging between them. These are the metadata fields required to implement communications analytics.

Reveal Field Name

Reveal Display Name

SENDER

From

RECIPIENT

To

CC_ADDRESSES

Cc

BCC

Bcc

Email Threading

Email Threading is a feature that determines unique messages belonging to the same email thread, marking the unique content of each message and selecting the sort order and hierarchy of the messages in each thread, reducing the time and complexity of the review.

Overview

The purpose of email threading is to:

  1. Identify email messages and attachments in the dataset.
  2. Identify duplicate email messages.
  3. Find messages that belong to the same email thread.
  4. Mark which messages contain unique content not present in any other message.
  5. Determine the hierarchy and the sort order of messages within each thread.

Email threading relies on specific document metadata fields during processing, such as the From, To, and CC email headers. Before initial processing, it is essential to examine the data to be processed and properly configure the field map settings to make these metadata fields available to the processing engine.

Fields are assigned field categories by the schema, which then controls how the contents of those fields are treated.

Email Threading Field Mapping

Reveal Field Name

Reveal Display Name

Type / Purpose          

Body Text (The body text used is based upon order <OCR / Loaded / Extracted>)

 

- At least 300 documents with text are required.
- The text is populated from the Extracted, OCR, or Loaded text set (in that order, based on availability).

SUBJECT_OTHER

Email Subject

Required for faster processing time.
Email only, Email's "Subject" field.

ATTACHMENT_LIST

Attachment List

“Attachment”

BCC

Bcc

Email only, Email's "BCC" field.

CC_ADDRESSES

Cc

Email only, Email's "CC" field.

SENT_DATE

Date Sent

All known (non-custom) DATE, TIME pairs are combined into a single field value when ingested for the histogram.

SENDER

From

Email only, Email's "From" field.

PARENT_ITEMID

Parent ID

BEGATTACH should be populated, as Parent ID is typically autopopulated from the contents of BEGATTACH. 

RECIPIENT

To

Email only, Email's "To" field.

Duplicate Detection Methods

Duplicative Type System fields are created and updated with the assigned duplicative category.

Analytics uses two levels of duplicate detection. In decreasing order of strictness, they are:

  • Exact Duplicate Detection (EDD) –

Exact Duplicates are based on body text and duplicate ID (if applicable).

Two documents are considered exact duplicates of each other if they are identical on all fields specified above the mapping table. An exact duplicate group (EDG) consists of all documents in a data set that are duplicates of a designated pivot document.

  • Near Duplicate Detection (NDD) -

Near Duplicate compares the body text to determine if highly similar (80%) to the pivot document. A near duplicate group (NDG) is a group of documents where each document has high similarity to the pivot document of the NDG.

Duplicate Detection Field Mapping

Reveal File Name

Reveal Display Name

Type / Purpose

Dup Group

Body Text (The body text used is based upon order <OCR / Loaded / Extracted>)

 

- At least 300 documents with text are required.
- The text is populated from the Extracted, OCR, or Loaded text set (in that order, based on availability).

EDD, NDD

MD5_HASH

Duplicate ID

MD5Hash value, default to Control Number if empty.

EDD

Candy Bar

The candy bar is a graphic display of Originals, Near Duplicates, Exact Duplicates, and documents Not Analyzed (because they are encrypted, lacking text, or contain excessive text) in the current view. A user may select Originals to examine only that subset of documents. The table below provides the category mapping based on the system created dup type field (BD_DupType).

Candy Bar Category

Analytic System Dup Type field (BD_DupType)

Originals

unique

exactorig

exactorignearorig

nearorig

Near Duplicates

neardup

exactorigneardup

Exact Duplicates

exactdup

Not Analyzed

excluded

Additional Recommended Fields

There are several fields used and mapped for the analytics process. The basic requirements to use analytics can vary, with each function requiring text or specific metadata fields. The following table lists other recommended fields.

Reveal File Name

Reveal Display Name

Requirement

Type / Purpose

ATTACHMENT_COUNT

Attachment Count

Recommended

Email only, number of attachments in the email.

AUTHOR

Author

Recommended

Last Author of the efile or attachments, if exists.

CONVERSATION_ID

Conversation ID

Recommended

“Conversation Index”

DOCUMENT_NAME

Document Name

Recommended

Filename of the efile or attachments.

FILE_EXTENSION

Extension

Recommended

 

MASTERDATETIME

Master Date Time

Recommended

- Date and Time the email was sent, or last modified date time for Attachments / EFiles.
- It is important to include the “Time” part for this field.

SENT_TIME

Time Sent

Recommended

All known (non-custom) DATE, TIME pairs are combined into a single field value when ingested for the histogram.

The Complete Analytics Data Mapping

This table provides all fields required or recommended for use in analytics. These fields are mapped and used when data is included in the field.

Reveal File Name

Reveal Display Name

Requirement

Type / Purpose

Email Thread

Dup Group

Body Text (The body text used is based upon order <OCR / Loaded / Extracted>)

 

Required

- At least 300 documents with text are required.
- The text is populated from the Extracted, OCR, or Loaded text set (in that order based on availability).

x

EDD, NDD

BEGATTACH

Begin Number Attach

Required

Used to group email and attachments together; default to Control Number if empty. Note: For a Parent, the Group Identifier must equal the Control Number.

 

 

BEGDOC

Begin Number

Required

Control Number - The text identifier of the document.

 

 

CUSTODIAN_NAME

Custodian

Required

Name of the person or entity producing the document; default value “empty_custodian

 

 

ITEMID

Item ID

Required

ID - The numeric identifier of the document.

x

 

MD5_HASH

Duplicate ID

Required

MD5Hash value, default to Control Number if empty.

 

EDD

SUBJECT_OTHER

Email Subject

Highly Recommended

- Required for faster processing time.
- Email only, Email's "Subject" field.

x

 

ATTACHMENT_LIST

Attachment List

Recommended

“Attachment”

x

 

BCC

Bcc

Recommended

Email only, Email's "BCC" field.

x

 

CC_ADDRESSES

Cc

Recommended

Email only, Email's "CC" field.

x

 

PARENT_ITEMID

Parent ID

Recommended

 

x

 

RECIPIENT

To

Recommended

Email only, Email's "To" field.

x

 

SENDER

From

Recommended

Email only, Email's "From" field.

x

 

SENT_DATE

Date Sent

Recommended

All known (non-custom) DATE, TIME pairs are combined into a single field value when ingested for the histogram.

x

 

ATTACHMENT_COUNT

Attachment Count

Recommended

Email only, number of attachments in the email.

   

AUTHOR

Author

Recommended

Last Author of the efile or attach­ments (if exist).

   

CONVERSATION_ID

Conversation ID

Recommended

“Conversation Index”

   

DOCUMENT_NAME

Document Name

Recommended

Filename of the efile or attach­ments.

   

FILE_EXTENSION

Extension

Recommended

     

MASTERDATETIME

Master Date Time

Recommended

- Date and Time the email was sent, or last modified date time for Attachments/EFiles.

- It is important to include the “Time” part for this field.

   

SENT_TIME

Time Sent

Recommended

All known (non-custom) DATE, TIME pairs are combined into a single field value when ingested for the histogram.

   

 

Last Updated 3/22/2024