Reveal Query Language

Details on the syntax and use of RQL in Reveal


Required Query Features

In addressing all the features needed in the Reveal Query Language (RQL), an attempt is being made to remain backwards compatible with Review 10’s query language. Here are the known places where RQL is not backwards compatible with the Review 10 query language:

  1. In RQL you must use quotes for phrases. Review 10 assumed that a sequence of terms was a phrase, while the initial RQL implementation will call that an error.

  2. Review 10 supports both w/3 and /3/ as syntax for proximity queries. RQL only supports w/3.

  3. Review 10 supports ?, !, and * as wildcard characters. The initial RQL release will only support *, and will interpret * as zero or more characters.

  4.  Special characters can make the index interpret what looks like a single word, for example “AT&T”, as a phrase, so RQL users will need to put quotes around these terms to specify the correct phrase search behavior. Please note, Reveal ElasticSearch does not support searching of individual special characters, e.g., "%". If your project data has been processed via Reveal Processing Discovery Manager, there are capabilities utilizing its dtSearch engine for searching special characters. More information can be found here: https://processing-help.revealdata.com/en/APPENDIX-F---dtSearch-Syntax-Guide.html

  5. The Reveal 10 documentation talks about only supporting binary NOT, and specifically says it does not support AND NOT. To keep the binary NOT from being ambiguous, RQL supports A AND NOT B, and A OR NOT B. RQL does not support A NOT B, nor does it support A NOT W/N B.

  6. Common words will be treated differently in Reveal and RQL. Instead of using the StopFilter to keep common words from being indexed, the Elasticsearch CommonGramsFilter will be used instead to support efficient searching of common words. This eliminates most of the edge cases associated with common words appearing in queries.

The following tables describe features of the query language.

Operator Precedence - Highest to Lowest

()

HAS, IN, REGEX

W/N, PRE/N, NOT W/N, NOT PRE/N

<, >, <=, >=

EXACT

NOT, EXPAND

AND

OR

Operator precedence is supported to maintain backwards compatibility with Review’s prior query language.

RQL will validate keyword search syntax before submitting a search. It will parse the query as entered in the search bar to identify any errors and notify the user to examine and correct the syntax before submitting. 12 - 01 - RQL syntax validation error

Special RQL Field Names and Search Profiles

Reveal allows users to create search profiles, which are lists of fields to search. When a search profile is used with an RQL query, any query term that does not specify a field name will search the fields specified in the search profile. RQL also supports some special field names that can be used to conveniently search multiple fields and whose behavior won’t change if a search profile is used.

Field Name

Default Behavior

Changes if Search Profile Specified

[blank]

When no field name is specified, all body text fields are searched.

Only search fields specified in search profile. Body text and/or text metadata fields can be in a search profile.

BODY

Search all body text fields.

N/A

METADATA

Search all full text searchable metadata fields.

N/A

ALL

Search all body text fields and full text searchable metadata fields.

N/A

When no field name is specified, the default behavior is to search the fields that the special field name BODY searches. When no field name is specified and a search profile is used, the search profile fields are used.

Search profiles can only contain text and body text fields.

Where field names are used in an RQL search, they must be the TABLE names (no spaces or dashes, underscores allowed). These field names may be looked up and selected using the RQL Helper’s Insert Fields… link under Advanced Keyword Search; a tilde (~) character may also be used to open the field list.

12 - 02 - Insert RQL Field name

RQL Description

The intent is that these operators can be combined to create interesting query expressions. This includes arbitrarily nested parentheses. How deep “arbitrarily nested” goes is negotiable.

Example

Meaning

term

“term” in any of RQL specified fields.

BODY::term

“term" in the field Body, which must be expressed in ALL CAPS or all lowercase (CAPS is better for visibility).

BODY::"two words"

True if the phrase “two words” is found in body text.

BODY::"p&g"

Due to details of how the index works, to get the expected search results for searches for terms like “P&G”, “AT&T”, or "support@revealdata.com" the terms will have to appear in quotes. The index sees these terms as phrases, and the query parser needs the quotes to know to search for these terms as a phrase.

BODY::"dogs and cats"

If query syntax appears within a quoted string, it is treated as text, not query syntax. This query is looking for the phrase “dogs and cats”.

term^1.5

Add a boost to how search results are scored for a term.

BODY::term^1.5

Add a boost to a named field search.

SUBJECT::"cats and dogs"^1.5

Boost the score for searching a particular metadata field for a phrase.

EXACT FROM::support@revealdata.com

Return documents that have a “From” field whose contents are exactly “support@revealdata.com".

EXACT support@revealdata.com

ERROR - a field name must be specified for EXACT queries.

EXACT (name::Bob OR name::Joe)

ERROR - EXACT can only be used with a single term. Use “EXACT name::Bob OR EXACT name::Joe” instead.

A OR B

The term A is found in any RQL specified field AND/OR the term B is found in any RQL field.

A AND B

The term A is found in any RQL specified field AND the term B is found in any RQL field.

NOT A

Supporting unary NOT: The term A is not found in any RQL specified field.

NOT (query-expression)

Typical use of unary NOT: a user saves a query (saved as RQL string). Later the user asks to find all documents NOT matched by the saved query.

A NOT B

ILLEGAL - not allowed in the query language

A OR NOT B

The term A is found in any RQL specified field AND/OR the term B is NOT found in any RQL specified field.

A AND NOT B

The term A is found in any RQL specified field AND the term B is NOT found in any RQL specified field.

HAS Subject

Matches documents that have a non-null and non-empty field (in example the field is “Subject”).

NOT HAS Subject

Unary NOT can be used with HAS to find documents that do not have a non-null or non-empty field of the given name.

A W/0 B

Matches “A B” or “B A”

A W/1 B

Matches "A B”, “B A”, “A word B”, or “B word A” where “word” is any indexed term.

A W/3 B

A is within 3 of B. A and B can be in any order.

A PRE/0 B

Matches "A B"

A PRE/2 B

Matches "A B”, “A word B”, or “A word word B” where “word” is any indexed term.

(A OR B) AND C

True if any RQL specified field contains C and also A OR B is found in any RQL specified field.

A OR B AND C

True if A is found in any RQL specified field or if both B and C are found in any RQL specified fields.

word*

Wildcard query, true if a term that begins with “word” is found.

*word

Leading wildcard query. True if a term that ends with “word” is found.

w*rd

Embedded wildcard query. True if a term that begins with “w” and ends with “rd” is found. Embedded wildcards can be used in phrases, e.g., "som* phrase", "some phra*se", “s*me phrase”. The wildcard must be at the beginning, middle, or end of a word; "some * phrase" is not a valid embedded wildcard query because this wildcard is not at the beginning, middle, or end of a word.

CUSTODIAN IN (“some-id-1”, “some-id-2”, … , “some-id-100”)

Extract the value of CUSTODIAN from items found in the following records. Can be used to specify a list of search values as a single clause or pill.  

SUBJECT::REGEX("[0-9]{5}")

REGEX is used to find documents with any five digits between 0-9, in any order, in the Subject field, as when looking for invoices whose purchase orders are five-digit numbers.

SENT_DATE >= 2001-02-22T00:00:00 AND SENT_DATE < 2001-02-23T00:00:00

The >, <, >=, and <= can be used with dates and numeric fields. Combined with AND these operators can be used to define date ranges and numeric ranges.

SENT_DATE >= 2001

DateSent >= 2001-01-01T00:00:00

SENT_DATE > 2001

DateSent >= 2002-01-01T00:00:00

SENT_DATE >= 2001-02

DateSent >= 2002-02-01T00:00:00

SENT_DATE > 2001-02

DateSent >= 2002-03-01T00:00:00

SENT_DATE >= 2001-02-22

DateSent >= 2002-02-22T00:00:00

SENT_DATE > 2001-02-22

DateSent >= 2002-02-23T00:00:00

SENT_DATE >= 2001-02-22T00

DateSent >= 2002-02-22T00:00:00

SENT_DATE > 2001-02-22T00

DateSent >= 2002-02-22T01:00:00

SENT_DATE >= 2001-02-22T00:00

DateSent >= 2002-02-22T00:00:00

SENT_DATE > 2001-02-22T00:00

DateSent >= 2002-02-22T00:01:00

SENT_DATE >= 2001-02-22T00:00:00

DateSent >= 2002-02-22T00:00:00

SENT_DATE > 2001-02-22T00:00:00

DateSent >= 2002-02-22T00:00:01

SENT_DATE < 2001

DateSent < 2001-01-01T00:00:00

SENT_DATE <= 2001

DateSent < 2002-01-01T00:00:00

SENT_DATE < 2001-02

DateSent < 2001-02-01T00:00:00

SENT_DATE <= 2001-02

DateSent < 2001-03-01T00:00:00

SENT_DATE < 2001-02-22

DateSent < 2001-02-22T00:00:00

SENT_DATE <= 2001-02-22

DateSent < 2001-02-23T00:00:00

SENT_DATE < 2001-02-22T00

DateSent < 2001-02-22T00:00:00

SENT_DATE <= 2001-02-22T00

DateSent < 2001-02-22T01:00:00

SENT_DATE < 2001-02-22T00:00

DateSent < 2001-02-22T00:00:00

SENT_DATE <= 2001-02-22T00:00

DateSent < 2001-02-22T00:01:00

SENT_DATE < 2001-02-22T00:00:00

DateSent < 2001-02-22T00:00:00

SENT_DATE <= 2001-02-22T00:00:00

DateSent < 2001-02-22T00:00:01

SENT_DATE::2001

DateSent >= 2001-01-01T00:00:00 AND DateSent < 2002-01-01T00:00:00

SENT_DATE::2001-02

DateSent >= 2001-02-01T00:00:00 AND DateSent < 2001-03-01T00:00:00

SENT_DATE::2001-02-22

True if date found in interval (end is open) [2001-02-22T00:00:00 to 2001-02-23T00:00:00]

SENT_DATE::2001-02-22T00

True if date found in interval (end is open) [2001-02-22T00:00:00 to 2001-02-22T01:00:00]

SENT_DATE::2001-02-22T00:00

True if date found in interval (end is open) [2001-02-22T00:00:00 to 2001-02-22T00:01:00]

SENT_DATE::2001-02-22T00:00:00

True if date found in interval (end is open) [2001-02-22T00:00:00 to 2001-02-22T00:00:01]

SENT_TIME >= 00 

TimeSent >= 00:00:00

SENT_TIME > 00 

TimeSent >= 01:00:00

SENT_TIME >= 00:00 

TimeSent >= 00:00:00

SENT_TIME > 00:00 

TimeSent >= 00:01:00

SENT_TIME >= 00:00:00 

TimeSent >= 00:00:00

SENT_TIME > 00:00:00 

TimeSent >= 00:00:01

SENT_TIME >= 00:00:00.000 

TimeSent >= 00:00:00.000

SENT_TIME > 00:00:00.000 

TimeSent >= 00:00:00.001

SENT_TIME >= 00:00:00.000Z 

TimeSent >= 00:00:00.000Z

SENT_TIME > 00:00:00.000Z 

TimeSent >= 00:00:00.001Z

SENT_TIME < 14 

TimeSent < 14:00:00

SENT_TIME <= 14 

TimeSent < 15:00:00

SENT_TIME < 14:00 

TimeSent < 14:00:00

SENT_TIME <= 14:00 

TimeSent < 14:01:00

SENT_TIME < 14:00:00 

TimeSent < 14:00:00

SENT_TIME <= 14:00:00 

TimeSent < 14:00:01

SENT_TIME< 14:00:00.000 

TimeSent < 14:00:00.000

SENT_TIME <= 14:00:00.000 

TimeSent < 14:00:00.001

SENT_TIME < 14:00:00.000Z 

TimeSent < 14:00:00.000Z

SENT_TIME <= 14:00:00.000Z 

TimeSent < 14:00:00.001Z

SENT_TIME::00 

TimeSent >= 00:00:00 AND TimeSent < 01:00:00

SENT_TIME::00:00 

TimeSent >= 00:00:00 AND TimeSent < 00:01:00

SENT_TIME::00:00:00 

TimeSent >= 00:00:00 AND TimeSent < 00:00:01


Date, Time, and Date-Time Queries

The details of how Date, Time, and Date-Time fields are indexed ultimately determine what results you will get from the date and time queries RQL supports.


How Date, Time, and Date-Time are Indexed

All of these fields are indexed as milliseconds since epoch, and each value is represented as a single specific millisecond. Here is what you need to know to understand how each of these fields are indexed:

  • Epoch is midnight (12am) January 1, 1970 UTC (in milliseconds since epoch this is 0).
  • Dates and times prior to midnight January 1, 1970 UTC have a negative value for milliseconds since epoch.
  • If only a date is supplied, the time is assumed to be midnight (00:00:00.000).
  • If only a time is supplied, the date is assumed to be January 1, 1970 (1970-01-01).

The following table has some examples.

field type value  ISO date-time equivalent  Milliseconds since epoch
date-time  June 10, 2022 2:29 PM  2022-06-10T14:29:00.000Z 1,654,871,340,000
date  June 10, 2022  2022-06-10T00:00:00.000Z  1,654,819,200,000
time  2:29 PM  1970-01-01T14:29:00.000Z  52,140,000
date  January 1, 1970  1970-01-01T00:00:00.000Z  0
date-time  Dec 31, 1969 1:00 PM  1969-12-31T13:00:00.000Z  -39,600,000


Querying Date and Date-Time Fields

Currently RQL treats Date and Date-Time field queries the same. It is up to the user to understand that dates from a Date field are treated as Date-Time fields where the time is always midnight.

Let’s look at some queries and see which queries find 10 June 2022 (2022-06-10), 11 June 2022 (2022-06-11), or 12 June 2022 (2022-06-12).

# RQL Full ISO RQL Equivalent Dates Found
1 MASTERDATE::2022-06-11 MASTERDATE >= 2022-06-11T00:00:00 AND
MASTERDATE < 2022-06-12T00:00:00
2022-06-11
2 MASTERDATE >= 2022-06-10 AND
MASTERDATE <= 2022-06-12
MASTERDATE >= 2022-06-10T00:00:00 AND
MASTERDATE <= 2022-06-13T00:00:00
2022-06-10
2022-06-11
2022-06-12
3 MASTERDATE > 2022-06-11T00:00:00 AND
MASTERDATE < 2022-06-12T00:00:00
MASTERDATE >= 2022-06-11T00:00:01 AND
MASTERDATE < 2022-06-12T00:00:00
none - midnight not included

Here is what happens in the above queries:

  1. The RQL query MASTERDATE::2022-06-11 searches for date-time values that fall anywhere on the day 2022-06-11. The query finds date fields that contained 2022-06-11 and does not find date fields that contained 2022-06-10 or 2022-06-12.

  2. The RQL query MASTERDATE >= 2022-06-10 AND MASTERDATE <= 2022-06-12 searches for date-time values that fall anywhere on the days 2022-06-10, 2022-06-11, and 2022-06-12. The date fields containing 2022-06-10, 2022-06-11, or 2022-06-12 are found.

  3. The RQL query MASTERDATE > 2022-06-11T00:00:00 AND MASTERDATE < 2022-06-12T00:00:00 includes all but one second of 2022-06-11. Since that excluded second is midnight 2022-06-11 and since no other midnights are included in this query, no date values are found.

Querying Time Fields

In RQL it is illegal to query a time field with date information. Even though times are treated as if they are date-time fields where the date is 1970-01-01, in RQL you will never specify this date when searching a time field. Here are some examples:

# RQL Time ISO RQL Equivalent Is time 14:30:05 found?
1 MASTERTIME::14 MASTERTIME >= 14:00:00 AND
MASTERTIME < 15:00:00
yes
2 MASTERTIME >= 14:30 MASTERTIME >= 14:30:00 yes
3 MASTERTIME >= 14:30 MASTERTIME >= 14:31:00 no
4 MASTERTIME >= 1970-01-01T14:30:00 Error Error - can’t search time with date-time

Expansion Queries using the EXPAND operator

While the initial release of Reveal supports expanding query results to include families, threads, etc., having queries with multiple steps of expanding results, searching within that expansion, then expanding those results, is not easy to do.

Why is RQL support for expansions needed?

Here is an example of a series of searches that includes four expansions of the results:

  1. Perform a search.
  2. Expand to families — adds attachments and/or messages.
  3. Restrict to Excel files — step 2 likely added Excel files not found in step 1.
  4. Expand to families — results in only families with at least 1 Excel file.
  5. Search results of step 4 to find documents out of that set that are marked responsive.
  6. Expand to families — results in families with at least 1 Excel file and at least one family member marked responsive.
  7. Search within results of step 6 for documents that are not marked responsive.
  8. Expand to families — results in families with at least 1 Excel file, at least one family member marked responsive, and at least one family member marked non-responsive.

To do the above without RQL support for expansions, every time you want to search within the expanded results, you would have to first save the expanded results to a folder, so you could start a new search that searches within that folder.

Instead of running a single query, you would have to run four separate queries, creating 3 different folders to hold the temporary results.

If you want to rerun the queries later to see how document additions and more review has changed the results, you again need to run four separate queries and create three new folders. Adding RQL support makes running and rerunning a query like this much easier and potentially more performant.

The EXPAND operator

EXPAND is a unary operator with syntax similar to NOT. One big difference is that you have to specify what fields to use for the expansion.

For this discussion of how the EXPAND operator works, assume that your dataset has the following fields:

Field Description
FAMILYID All family members (message and attachments) have the same unique family ID
THREADID All members of a single thread have the same unique thread ID
NEARDUPSETID All near duplicate set members share the same unique NearDupSetID
EXACTDUPSETID All exact duplicate set members share the same unique ExactDupSetID

What the EXPAND operator looks like

The EXPAND operator itself must include the fields that will be used for the expansion. Here are some examples and their meanings:

  • EXPAND{FAMILYID} — Expand to include all family members based on the FamilyID field
  • EXPAND{FAMILYID, THREADID} — Expand to include all family members and thread members based on FamilyID and THREADID. (NOTE - threads include family members, so you could just EXPAND{THREADID}.)
  • EXPAND{NEARDUPSETID, EXACTDUPSETID} — Expand to include both the near duplicates and exact duplicates of the query’s results
  • EXPAND{NEARDUPSETID, THREADID} — NOTE - both expansions are based on the set of results returned by the query expression this operator is applied to. Near duplicates that are in threads that aren’t in the original set of results won’t expand to include their thread members. You would have to say EXPAND{THREADID} EXPAND{NEARDUPSETID} <expr> to include the threads of all the near duplicate documents.

EXPAND examples

Simple Expansion

EXPAND{FAMILYID} enron

Search for enron then expand the results

Multi-Field Less Simple Expansion

EXPAND{NEARDUPSETID, THREADID} (enron AND gas)

Search for enron AND gas then add near duplicates of search results and thread members of search results.

Nested Expansions
EXPAND{THREADID} EXPAND{NEARDUPSETID} (Enron AND gas)

Search for Enron AND gas then expand to include near duplicates of the results. Finally expand to include the thread members of both the original results and the near duplicates of the original results.

Recreating the Original Example

Here are the fields that this example assumes are in the dataset for this example:

Field Description
FAMILYID All family members (message and attachments) have the same unique family ID.
EXTENSION The file extension of a document.
RESPONSIVE “true” if responsive, false otherwise. (Not how responsive is normally indexed, just a simple representation for the example.)

Earlier the following example was provided as motivation for providing RQL syntax for query expansions:

  1. Perform a search.
  2. Expand to families — adds attachments and/or messages.
  3. Restrict to Excel files — step 2 likely added Excel files not found in step 1.
  4. Expand to families — results in only families with at least 1 Excel file.
  5. Search results of step 4 to find documents out of that set that are marked responsive.
  6. Expand to families — results in families with at least 1 Excel file and at least one family member marked responsive.
  7. Search within results of step 6 for documents that are not marked responsive.
  8. Expand to families — results in families with at least 1 Excel file, at least one family member marked responsive, and at least one family member marked non-responsive.

Here is how the RQL could be built up to finally produce a single query expression:

  1. enron AND gas — an arbitrary query for this example.
  2. EXPAND{FAMILYID} (enron AND gas) — original query expanded to include families.
  3. (EXTENSION::xls OR EXTENSION::xlsx) AND EXPAND{FAMILYID} (enron AND gas) — restrict to Excel files.
  4. EXPAND{FAMILYID} ((EXTENSION::xls OR EXTENSION::xlsx) AND EXPAND{FAMILYID} (enron AND gas)) — families with Excel files.
  5. RESPONSIVE::true AND EXPAND{FAMILYID} ((EXTENSION::xls OR EXTENSION::xlsx) AND EXPAND{FAMILYID} (enron AND gas)) — responsive documents from (4).
  6. EXPAND{FAMILYID} (RESPONSIVE::true AND EXPAND{FAMILYID} ((EXTENSION::xls OR EXTENSION::xlsx) AND EXPAND{FAMILYID} (enron AND gas))) — families with at least 1 Excel file and 1 responsive document.
  7. RESPONSIVE::false AND EXPAND{FAMILYID} (RESPONSIVE::true AND EXPAND{FAMILYID} ((EXTENSION::xls OR EXTENSION::xlsx) AND EXPAND{FAMILYID} (enron AND gas))) — non-responsive documents from (6).
  8. EXPAND{FAMILYID} (RESPONSIVE::false AND EXPAND{FAMILYID} (RESPONSIVE::true AND EXPAND{FAMILYID} ((EXTENSION::xls OR EXTENSION::xlsx) AND EXPAND{FAMILYID} (enron AND gas)))) — families with at least 1 Excel file, 1 responsive document, and 1 non-responsive document.

The nesting of the queries gets complex once we get to step 8, but what used to have to be run as four separate queries in Reveal is now a single RQL query that can be saved and rerun later to see how the results have changed.

 

Last Updated 3/07/2024