Candy Bar

Interactive visual that filters data based on duplicate type


The Candy Bar is an interactive data visualization within the Dashboard that provides users with the ability to filter documents based on duplicate type.

130 - 01 - Candy Bar-1The Candy Bar provides a document total that dynamically updates after each applied filter or search. The pill-shaped visual breaks down the document population into four duplicate types: Originals (Non-Duplicates), Near Duplicates, Exact Duplicates and Not Analyzed. The sizes of the four sections of the Candy Bar convey the number of documents identified for each duplicate type. A longer section indicates there are more documents within the specific duplicate type. In the screenshot above, the Originals section of the candy bar is longer than the other sections to indicate that a large number of documents were identified as Originals.

  • Originals are unique and represent the set of documents that are different, in terms of content, from all of the other documents in the data set. The Originals section of the candy does not contain any duplicates. Documents in this category are classified under the BD DupType categories (see below): 
    • Nearorig 
    • Unique 
    • Exactorig 
    •  Exactorignearorig 
      130 - 03 - CandyBar Originals BD DupType-1
  • Near Duplicates  are documents analyzed as highly similar (a rating of 80 or higher) to an original “pivot” document. Documents in this category are classified under the BD DupType categories: 
    • Neardup 
    • Exactorigneardup 
      130 - 04 - CandyBar NearDup BD DupType-1
  • Exact Duplicates are documents that are exact duplicates of other documents within the data set and have identical checksum hash values. Documents in this category are classified under the BD DupType category: 
    • Exactdup 
      130 - 05 - CandyBar Duplicates BD DupType-1
  • Not Analyzed documents may have had no text to analyze, been encrypted, or otherwise lacking content that would allow them to be categorized as one of the other duplicate types. Documents in this category are classified under the BD DupType category: 
    • Excluded 
      130 - 06 - CandyBar NotAnalyzed BD DupType-1

Hovering a pointer over any portion of the Candy Bar opens a bubble with the label its number of documents for the currently selected set. Clicking on the item will select that duplicate status as a filter.

Duplicate Status Metadata

The metadata graphic display in the Dashboard may also be used to break out and select more exact duplicate type information. Click the drop-down list (which defaults to Custodian) to select BD DupType.

130 - 02 - DB DupType metadata widget-1

Here again, selecting any of the listed bar graph entries will select that duplicate type as a filter. The labels assigned through analysis are:

  • Unique – A document in a Normal cluster (e.g., no duplicates or near duplicates).
  • Exactdup – A document in an ExactDup cluster, where the ‘pivot’ document is an ExactOrig.
  • Neardup - A document in a NearDup cluster, where the ‘pivot’ document is an NearOrig.
  • Nearorig – The pivot document in a NearDup cluster, to which all the NearDup documents in the cluster are compared.
  • Exactorig - The pivot document in an ExactDup cluster, to which all the ExactDup documents in the cluster are compared.
  • Exactorigneardup – The pivot document where the cluster is ExactDup AND (isPivotDocument = false) for that cluster.
  • Exactorignearorig - The pivot document where the cluster is ExactDup AND (isPivotDocument = true) for that cluster.
  • Excluded – In an Excluded cluster, all documents are classified as Unique.

 

Last Updated 9/19/2023