ArangoDB v3.4 reached End of Life (EOL) and is no longer supported.
This documentation is outdated. Please see the most recent version here: Latest Docs
ArangoSearch Views in AQL
Views of type arangosearch
are an integration layer meant to seamlessly
integrate with and natively expose the full power of the
IResearch library
to the ArangoDB user.
They provide the capability to:
- evaluate together documents located in different collections
- search documents based on AQL boolean expressions and functions
- sort the result set based on how closely each document matched the search
Overview and Significance
Looking up documents in an ArangoSearch View is done via the FOR
keyword:
FOR doc IN someView
...
FOR
operations over ArangoSearch Views have an additional, optional, SEARCH
keyword:
FOR doc IN someView
SEARCH searchExpression
ArangoSearch views cannot be used as edge collections in traversals:
FOR v IN 1..3 ANY startVertex someView /* invalid! */
SEARCH
SEARCH
expressions look a lot like FILTER
operations, but have some noteable
differences.
First of all, filters and functions in SEARCH
, when applied to documents
emitted from an ArangoSearch View, work only on attributes linked in the
view.
For example, given a collection myCol
with the following documents:
[
{ someAttr: 'One', anotherAttr: 'One' },
{ someAttr: 'Two', anotherAttr: 'Two' }
]
with a view, where someAttr
is indexed by the following view myView
:
{
"type": "arangosearch",
"links": {
"myCol": {
"fields": {
"someAttr": {}
}
}
}
}
Then, a search on someAttr
yields the following result:
FOR doc IN myView
SEARCH doc.someAttr == 'One'
RETURN doc
[ { someAttr: 'One', anotherAttr: 'One' } ]
While a search on anotherAttr
yields an empty result:
FOR doc IN myView
SEARCH doc.anotherAttr == 'One'
RETURN doc
[]
- This only applies to the expression after the
SEARCH
keyword. - This only applies to tests regarding documents emitted from a view. Other tests are not affected.
- In order to use
SEARCH
using all attributes of a linked sources, the specialincludeAllFields
link property was designed.
SORT
The document search via the SEARCH
keyword and the sorting via the
ArangoSearch functions, namely BM25()
and TFIDF()
, are closely intertwined.
The query given in the SEARCH
expression is not only used to filter documents,
but also is used with the sorting functions to decide which document matches
the query best. Other documents in the view also affect this decision.
Therefore the ArangoSearch sorting functions can work only on documents
emitted from a view, as both the corresponding SEARCH
expression and the view
itself are consulted in order to sort the results.
The BOOST()
function, described below, can be used to fine-tune the resulting
ranking by weighing sub-expressions in SEARCH
differently.
Arrays and trackListPositions
Unless trackListPositions
is set to true
, which it is not by default, arrays behave differently. Namely
they behave like a disjunctive superposition of their values - this is best
shown with an example.
With trackListPositions: false
, which is the default, and given a document
doc
containing
{ attr: [ 'valueX', 'valueY', 'valueZ' ] }
in a SEARCH
clause, the expression
doc.attr == 'valueX'
will be true, as will be
doc.attr == 'valueY'
and == valueZ
. With trackListPositions: true
,
doc.attr[0] == 'valueX'
would work as usual.
Comparing analyzed fields
As described in value analysis, when a field is
processed by a specific analyzer, comparison tests are done per word. For
example, given the field text
is analyzed with "text_en"
and contains the
string "a quick brown fox jumps over the lazy dog"
, the following expression
will be true:
ANALYZER(d.text == 'fox', "text_en")
Note also, that the words analyzed in the text are stemmed, so this is also true:
ANALYZER(d.text == 'jump', "text_en")
So a comparison will actually test if a word is contained in the text. With
trackListPositions: false
, this means for arrays if the word is contained in
any element of the array. For example, given
d.text = [ "a quick", "brown fox", "jumps over the", "lazy dog"]
the following will be true:
ANALYZER(d.text == 'jump', "text_en")
ArangoSearch value analysis
A concept of value ‘analysis’ that is meant to break up a given value into a set of sub-values internally tied together by metadata which influences both the search and sort stages to provide the most appropriate match for the specified conditions, similar to queries to web search engines.
In plain terms this means a user can for example:
- request documents where the ‘body’ attribute best matches ‘a quick brown fox’
- request documents where the ‘dna’ attribute best matches a DNA sub sequence
- request documents where the ‘name’ attribute best matches gender
- etc. (via custom analyzers)
To a limited degree the concept of ‘analysis’ is even available in non-ArangoSearch AQL, e.g. the TOKENS(…) function will utilize the power of IResearch to break up a value into an AQL array that can be used anywhere in the AQL query.
In plain terms this means a user can match a document attribute when its value matches at least one entry from a set, e.g. to match docs with ‘word == quick’ OR ‘word == brown’ OR ‘word == fox’
FOR doc IN someCollection
FILTER doc.word IN TOKENS('a quick brown fox', 'text_en')
RETURN doc
ArangoSearch filters
The basic ArangoSearch functionality can be accessed via the SEARCH
statement
with common AQL filters and operators, e.g.:
AND
OR
NOT
==
<=
>=
<
>
!=
IN <ARRAY>
IN <RANGE>
However, the full power of ArangoSearch is harnessed and exposed via functions, during both the search and sort stages.
Note, that SEARCH
statement, in contrast to FILTER
, is meant to be treated
as a part of the FOR
operation, not as an individual statement.
The supported AQL context functions are:
ANALYZER()
ANALYZER(searchExpression, analyzer)
Override analyzer in a context of searchExpression with another one, denoted by a specified analyzer argument, making it available for search functions.
- searchExpression - any valid search expression
- analyzer - string with the analyzer to imbue, i.e. “text_en” or one of the other available string analyzers
By default, context contains Identity
analyzer.
BOOST()
BOOST(searchExpression, boost)
Override boost in a context of searchExpression with a specified value, making it available for scorer functions.
- searchExpression - any valid search expression
- boost - numeric boost value
By default, context contains boost value equal to 1.0
.
The supported search functions are:
EXISTS()
Note: Will only match values when the specified attribute has been processed with the link property storeValues set to “id” (by default it’s “none”).
EXISTS(doc.someAttr)
Match documents doc where the attribute someAttr exists in the document.
This also works with sub-attributes, e.g.
EXISTS(doc.someAttr.anotherAttr)
as long as the field is processed by the view with storeValues not none.
EXISTS(doc.someAttr, "analyzer", analyzer)
Match documents where doc.someAttr exists in the document and was indexed
by the specified analyzer. analyzer is optional and defaults to the
current context analyzer (e.g. specified by ANALYZER
function).
EXISTS(doc.someAttr, type)
Match documents where the doc.someAttr exists in the document and is of the specified type.
- doc.someAttr - the path of the attribute to exist in the document
- analyzer - string with the analyzer used, i.e. “text_en” or one of the other available string analyzers
- type - data type as string; one of:
- bool
- boolean
- numeric
- null
- string
In case if analyzer isn’t specified, current context analyzer (e.g.
specified by ANALYZER
function) will be used.
PHRASE()
PHRASE(doc.someAttr,
phrasePart [, skipTokens] [, phrasePart | , phrasePart, skipTokens]*
[, analyzer])
Search for a phrase in the referenced attributes.
The phrase can be expressed as an arbitrary number of phraseParts separated by skipToken number of tokens.
- doc.someAttr - the path of the attribute to compare against in the document
- phrasePart - a string to search in the token stream; may consist of several words; will be split using the specified analyzer
- skipTokens number of words or tokens to treat as wildcards
- analyzer - string with the analyzer used, i.e. “text_en” or one of the other available string analyzers
For example, given a document doc
containing the text "Lorem ipsum dolor sit
amet, consectetur adipiscing elit"
, the following expression will be true
:
PHRASE(doc.text, "ipsum", 1, "sit", 2, "adipiscing", "text_de")
Specifying deep attributes like doc.some.deep.attr
is also allowed. The
attribute has to be processed by the view as specified in the link.
STARTS_WITH()
STARTS_WITH(doc.someAttr, prefix)
Match the value of the doc.someAttr that starts with prefix
- doc.someAttr - the path of the attribute to compare against in the document
- prefix - a string to search at the start of the text
Specifying deep attributes like doc.some.deep.attr
is also allowed. The
attribute has to be processed by the view as specified in the link.
TOKENS()
TOKENS(input, analyzer)
Split the input string with the help of the specified analyzer into an
Array. The resulting Array can i.e. be used in subsequent FILTER
or SEARCH
statements with the IN operator. This can be used to better understand how
the specific analyzer is going to behave.
- input string to tokenize
- analyzer one of the available string_analyzers
MIN_MATCH()
MIN_MATCH(searchExpression [, searchExpression]*, minMatchCount)
Match documents where at least minMatchCount of the specified searchExpressions are satisfied.
- searchExpression - any valid search expression
- minMatchCount - minimum number of searchExpressions that should be satisfied
For example,
MIN_MATCH(doc.text == 'quick', doc.text == 'brown', doc.text == 'fox', 2)
if doc.text
, as analyzed by the current analyzer, contains 2 out of ‘quick’,
‘brown’ and ‘fox’, it will be included as matched one.
Searching examples
to match documents which have a ‘name’ attribute
FOR doc IN someView SEARCH EXISTS(doc.name)
RETURN doc
or
FOR doc IN someView SEARCH EXISTS(doc['name'])
RETURN doc
to match documents where ‘body’ was analyzed via the ‘text_en’ analyzer
FOR doc IN someView SEARCH EXISTS(doc.body, 'analyzer', 'text_en')
RETURN doc
or
FOR doc IN someView SEARCH EXISTS(doc['body'], 'analyzer', 'text_en')
RETURN doc
or
FOR doc IN someView SEARCH ANALYZER(EXISTS(doc['body'], 'analyzer'), 'text_en')
RETURN doc
to match documents which have an ‘age’ attribute of type number
FOR doc IN someView SEARCH EXISTS(doc.age, 'numeric')
RETURN doc
or
FOR doc IN someView SEARCH EXISTS(doc['age'], 'numeric')
RETURN doc
to match documents where ‘description’ contains word ‘quick’ or word ‘brown’ and has been analyzed with ‘text_en’ analyzer
FOR doc IN someView SEARCH ANALYZER(doc.description == 'quick' OR doc.description == 'brown', 'text_en')
RETURN doc
to match documents where ‘description’ contains at least 2 of 3 words ‘quick’, ‘brown’, ‘fox’ and has been analyzed with ‘text_en’ analyzer
FOR doc IN someView SEARCH ANALYZER(
MIN_MATCH(doc.description == 'quick', doc.description == 'brown', doc.description == 'fox', 2),
'text_en'
)
RETURN doc
to match documents where ‘description’ contains a phrase ‘quick brown’
FOR doc IN someView SEARCH PHRASE(doc.description, [ 'quick brown' ], 'text_en')
RETURN doc
or
FOR doc IN someView SEARCH PHRASE(doc['description'], [ 'quick brown' ], 'text_en')
RETURN doc
or
FOR doc IN someView SEARCH ANALYZER(PHRASE(doc['description'], [ 'quick brown' ]), 'text_en')
RETURN doc
to match documents where ‘body’ contains the phrase consisting of a sequence like this: ‘quick’ * ‘fox jumps’ (where the asterisk can be any single word)
FOR doc IN someView SEARCH PHRASE(doc.body, [ 'quick', 1, 'fox jumps' ], 'text_en')
RETURN doc
or
FOR doc IN someView SEARCH PHRASE(doc['body'], [ 'quick', 1, 'fox jumps' ], 'text_en')
RETURN doc
or
FOR doc IN someView SEARCH ANALYZER(PHRASE(doc['body'], [ 'quick', 1, 'fox jumps' ]), 'text_en')
RETURN doc
to match documents where ‘story’ starts with ‘In the beginning’
FOR doc IN someView SEARCH STARTS_WITH(doc.story, 'In the beginning')
RETURN DOC
or
FOR doc IN someView SEARCH STARTS_WITH(doc['story'], 'In the beginning')
RETURN DOC
to watch the analyzer doing its work
RETURN TOKENS('a quick brown fox', 'text_en')
to match documents where ‘description’ best matches ‘a quick brown fox’
FOR doc IN someView SEARCH ANALYZER(doc.description IN TOKENS('a quick brown fox', 'text_en'), 'text_en')
RETURN doc
ArangoSearch sorting
A major feature of ArangoSearch Views is their capability of sorting results
based on the creation-time search conditions and zero or more sorting functions.
The ArangoSearch sorting functions available are TFIDF()
and BM25()
.
Note: The first argument to any ArangoSearch sorting function is always the
document emitted by a FOR
operation over an ArangoSearch View.
Note: An ArangoSearch sorting function is only allowed as an argument to a
SORT
operation. But they can be mixed with other arguments to SORT
.
So the following examples are valid:
FOR doc IN someView
SORT TFIDF(doc)
FOR a IN viewA
FOR b IN viewB
SORT BM25(a), TFIDF(b)
FOR a IN viewA
FOR c IN someCollection
FOR b IN viewB
SORT TFIDF(b), c.name, BM25(a)
while these will not work:
FOR doc IN someCollection
SORT TFIDF(doc) // !!! Error
FOR doc IN someCollection
RETURN BM25(doc) // !!! Error
FOR doc IN someCollection
SORT BM25(doc.someAttr) // !!! Error
FOR doc IN someView
SORT TFIDF("someString") // !!! Error
FOR doc IN someView
SORT BM25({some: obj}) // !!! Error
The following sorting methods are available:
Literal sorting
You can sort documents by simply specifying arbitrary values or expressions, as you do in other places.
BM25()
BM25(doc, k, b)
- k (number, optional): calibrates the text term frequency scaling, the default is 1.2. A k value of 0 corresponds to a binary model (no term frequency), and a large value corresponds to using raw term frequency
- b (number, optional): determines the scaling by the total text length, the default
is 0.75. At the extreme values of the coefficient b, BM25 turns into ranking
functions known as BM11 (for b =
1
, corresponds to fully scaling the term weight by the total text length) and BM15 (for b =0
, corresponds to no length normalization)
Sorts documents using the Best Matching 25 algorithm.
See the BM25()
section in ArangoSearch Scorers
for details.
TFIDF()
TFIDF(doc, withNorms)
- doc (document): must be emitted by
FOR doc IN someView
- withNorms (bool, optional): specifying whether scores should be normalized, the default is false
Sorts documents using the
term frequency–inverse document frequency algorithm.
See the
TFIDF()
section in ArangoSearch Scorers
for details.
Sorting examples
to sort documents by the value of the ‘name’ attribute
FOR doc IN someView
SORT doc.name
RETURN doc
or
FOR doc IN someView
SORT doc['name']
RETURN doc
to sort documents via the BM25 algorithm
FOR doc IN someView
SORT BM25(doc)
RETURN doc
to sort documents via the BM25 algorithm with ‘k’ = 1.2 and ‘b’ = 0.75
FOR doc IN someView
SORT BM25(doc, 1.2, 0.75)
RETURN doc
to sort documents via the TFIDF algorithm
FOR doc IN someView
SORT TFIDF(doc)
RETURN doc
to sort documents via the TFIDF algorithm with norms
FOR doc IN someView
SORT TFIDF(doc, true)
RETURN doc
to sort documents by value of ‘name’ and then by the TFIDF algorithm where ‘name’ values are equivalent
FOR doc IN someView
SORT doc.name, TFIDF(doc)
RETURN doc
Use cases
Prefix search
The data contained in our view looks like that:
{ "id": 1, "body": "ThisIsAVeryLongWord" }
{ "id": 2, "body": "ThisIsNotSoLong" }
{ "id": 3, "body": "ThisIsShorter" }
{ "id": 4, "body": "ThisIs" }
{ "id": 5, "body": "ButNotThis" }
We now want to search for documents where the attribute body
starts with “ThisIs”,
A simple AQL query executing this prefix search:
FOR doc IN someView SEARCH STARTS_WITH(doc.body, 'ThisIs')
RETURN doc
It will find the documents with the ids 1
, 2
, 3
, 4
, but not 5
.