Fuzzy Search with ArangoSearch
Loosely match strings to find partial congruences and to compensate for typing errors
Fuzzy search is an umbrella term for various approximate matching algorithms. What they allow you to do is to find matches even if the search terms are not spelled exactly like the words in the stored text. This will include terms that are similar, alternatively spelled, or mistyped but potentially relevant for the search request as well.
If you want to try out fuzzy search with ArangoDB for yourself, then check out our interactive Fuzzy Search tutorial.
Similarity Measures
How the approximate string matching algorithms usually work is that they measure how similar (or different) two strings are. The following similarity measures are available in ArangoDB:
- Levenshtein distance and a variant called Damerau-Levenshtein distance, which work best with short strings
- n-gram similarity and a variant called positional n-gram similarity, that are well-suited to assess the similarity of longer strings that share subsequences
Levenshtein Distance
The Levenshtein distance is an edit distance string metric. It is informally defined as the minimum number of single-character edits to go from one string to the other. The allowed operations are:
- insertions
- deletions
- substitutions
The lowest possible number of the required operations is the Levenshtein distance. Consider the following two strings:
galaxy
galxy
The latter word is misspelled (it is missing the second a
). We can correct it
by adding an a
. One operation in total means that the Levenshtein distance is
1
. It also is for the following pair of strings, but this time requiring the
removal of a character (an extra l
):
galaxy
gallaxy
In the following example, the e
needs to be replaced with an a
:
galaxy
gelaxy
This is again one operation, so the edit distance is 1
. Here is an example that
requires more than one operation:
galaxy
glaaxy
The first a
and the l
are in the wrong order, perhaps cause the latter
string was entered hastily. We can remove the l
from the second position and
insert an l
at the third position to correct the string. Therefore, the
Levenshtein distance is 2
. We could also substitute the second and third
character instead, but the edit distance remains the same.
Damerau-Levenshtein Distance
The Damerau-Levenshtein distance is like the Levenshtein distance
but with transpositions as additional operation. Consider the following example
where the latter string has the a
and l
in the wrong order:
galaxy
glaaxy
This can be corrected by shifting the l
one character to the right. Hence,
the Damerau-Levenshtein distance is only 1
. The pure Levenshtein distance
is 2
.
N-Gram Similarity
The n-grams of a string are all of its possible substrings of a given length.
The n
in n-gram stands for that substring size. N-grams of size 2 are
called 2-grams or bigrams, and n-grams of size 3 are called 3-grams or
trigrams.
Here is an example for the word avocado
that has three trigrams:
avocado
avo
voc
oca
cad
ado
To compare the similarity of two sets of n-grams, one can simply count how many n-grams of the target string match the source string (the target string being the search term and the source string being the stored string), and divide the sum by the number of the target string’s n-grams. Only fully matching n-grams count.
Consider the following example with avocado
as source string and vocals
as
target string:
avocado vocals
avo voc
voc oca
oca cal
cad als
ado
We find the trigrams voc
and oca
from the right side also on the left side.
The other two do not have a corresponding trigram on the left side. That means,
2 / 4
trigrams match, resulting in a similarity of 0.5
using this
similarity measure.
Positional N-Gram Similarity
Instead of considering only full n-gram matches, one can also consider partial matches where the characters are in the same positions and counting the longest common sequence.
avocado vocals
avo voc
voc oca
oca cal
cad als
ado
In above example, voc
and oca
on the right side have a fully matching
trigram on the left side. cal
and als
do not, but there is cad
which has
a c
in the first position and an a
in the second position, and both avo
and ado
have a matching a
in the first position.
voc voc 3 / 3 = 1
oca oca 3 / 3 = 1
cad cal 2 / 3 = 0.666…
ado als 1 / 3 = 0.333…
We sum up the highest similarity we found for each trigram and divide by the
total n-gram count, which ever of the two is higher (here the left side with
5 trigrams). The result is 3 / 5
or 0.6
in positional n-gram similarity.
Fuzzy Search in ArangoDB
There are the following AQL string functions to calculate the similarity of two arbitrary strings:
RETURN [
LEVENSHTEIN_DISTANCE("galaxy", "glaaxy"), // 1 (with transpositions)
NGRAM_SIMILARITY("avocado", "vocals", 3) // 0.5 (using trigrams)
NGRAM_POSITIONAL_SIMILARITY("avocado", "vocals", 3) // 0.6 (using trigrams)
]
To perform fuzzy searches, there are two ArangoSearch functions:
They can be used in conjunction with a View for index-accelerated fuzzy search
queries. The string similarity affects the overall BM25()
/ TFIDF()
score
when ranking results.
Searching with LEVENSHTEIN_MATCH()
Dataset: IMDB movie dataset
Custom Analyzer:
Create a text
Analyzer in arangosh to tokenize text, normalize case to all
lowercase and to remove diacritics, but with stemming disabled unlike the
built-in text_en
Analyzer. The search will be accent- and case-insensitive,
and the Levenshtein distance between the stored and searched text will be taken
into account accurately. With stemming enabled, it would be less accurate with
respect to the original strings, but potentially find more matches that are
also relevant.
View definition:
{
"links": {
"imdb_vertices": {
"fields": {
"description": {
"analyzers": [
"text_en_no_stem"
]
}
}
}
}
}
AQL queries:
Search for the token galxy
in the movie descriptions with some fuzziness.
The maximum allowed Levenshtein distance is set to 1
. Everything with a
Levenshtein distance equal to or lower than this value will be a match and the
respective documents will be included in the search result. The query will find
the token galaxy
as the edit distance to galxy
is 1
:
FOR doc IN imdb
SEARCH ANALYZER(
LEVENSHTEIN_MATCH(
doc.description,
TOKENS("galxy", "text_en_no_stem")[0],
1, // max distance
false // without transpositions
),
"text_en_no_stem"
)
SORT BM25(doc) DESC
RETURN {
title: doc.title,
description: doc.description
}
title | description |
---|---|
Stargate: The Ark of Truth | … in the Ori’s own home galaxy. … SG-1 travels to the Ori galaxy aboard the Odyssey. … finds themselves in a distant galaxy fighting two powerful enemies. |
The Ice Pirates | … the most precious commodity in the galaxy is water. … the unreachable centre of the galaxy at the end of the galactic trade wars. The galaxy is ruled by an evil emperor … |
Star Wars: Episode III: Revenge of the Sith | … leading a massive clone army into a galaxy-wide battle against the Separatists. When the sinister Sith unveil a thousand-year-old plot to rule the galaxy, … |
… | … |
Searching with NGRAM_MATCH()
Dataset: IMDB movie dataset
Custom Analyzer:
NGRAM_MATCH()
requires an ngram
Analyzer. For this example, create a
trigram Analyzer in arangosh with a minimum and maximum n-gram size of 3,
not including the original string:
View definition:
{
"links": {
"imdb_vertices": {
"fields": {
"name": {
"analyzers": [
"trigram"
]
}
}
}
}
}
AQL queries:
Search for actor names with an n-gram similarity of at least 50%.
FOR doc IN imdb
SEARCH NGRAM_MATCH(
doc.name,
"avocado",
0.5, // similarity threshold
"trigram" // custom n-gram Analyzer
)
RETURN {
name: doc.name
}
name |
---|
Nancy Savoca |
John Savoca |