Faceted Search with ArangoSearch
Combine aggregation with search queries to retrieve how often values occur overall
A popular method for filtering items in an online shop is to display product categories in a list, together with the number of items in each category. This way, users get an idea of how many items will be left after applying a certain filter before they actually enable it. This concept can be extended to any properties, also called facets.
To implement such a feature in ArangoDB, you can use a COLLECT
operation
to group and count how many documents share an attribute value. This is also
possible with ArangoSearch Views by simply iterating over a View instead of a
collection.
Dataset: IMDB movie dataset
View definition:
{
"links": {
"imdb_vertices": {
"fields": {
"genre": {
"analyzers": [
"identity"
]
}
}
}
}
}
AQL Queries:
Find out all genre values by grouping by the genre
attribute and count the
number of occurrences:
FOR doc IN imdb
COLLECT genre = doc.genre WITH COUNT INTO count
RETURN { genre, count }
genre | count |
---|---|
null | 51287 |
Action | 2449 |
Adventure | 312 |
Animation | 426 |
British | 1 |
Comedy | 3188 |
… | … |
The COLLECT
operation is applied as a post-operation. In other words, it is
not accelerated by the View index. On the other hand, the genre
field does
not need to be indexed for this query.
To look up a specific genre, the field needs to be indexed. The lookup itself
utilizes the View index, but the COLLECT
operation is still a post-operation:
FOR doc IN imdb
SEARCH ANALYZER(doc.genre == "Action", "identity")
COLLECT WITH COUNT INTO count
RETURN count
[
2690
]
For above query with a single, simple condition, there is an optimization you
can enable that can accurately determine the count from index data faster than
the standard COLLECT
. Also see
Count Approximation.
FOR doc IN imdb
SEARCH ANALYZER(doc.genre == "Action", "identity")
OPTIONS { countApproximate: "cost" }
COLLECT WITH COUNT INTO count
RETURN count
To apply this optimization to the faceted search paradigm over all genres, you can run and cache the following query that determines all unique genre values:
FOR doc IN imdb
RETURN DISTINCT doc.genre
You may use the AQL query cache if the data does not change much, or you could execute above query periodically and store the result in a separate collection. The numbers will not be fully up to date in that case, but it is often an acceptable tradeoff for a faceted search.
You can then use the genre list to look up each genre and retrieve the count while utilizing the count approximation optimization:
LET genres = [ "Action", "Adventure", "Animation", /* ... */ ]
FOR genre IN genres
LET count = FIRST(
FOR doc IN imdb
SEARCH ANALYZER(doc.genre == genre, "identity")
OPTIONS { countApproximate: "cost" }
COLLECT WITH COUNT INTO count
RETURN count
)
RETURN { genre, count }