Elasticsearch — Setting up a synonyms search
Hey folks! Throughout this post, we will see how to configure a basic synonyms search using Elasticsearch (:
Regardless the expertise with Elasticsearch, you might be able to understand this basic implementation.
To keep this post short, we will go through the most basic implementation, I may write more about this topic in the future, let’s start! (:
Our Problem
Imagine that we are a Brazilian e-commerce selling laptops, we look at our search statistics and realize our users are searching for the word notebook
and they get angry because they don’t get any result, that’s so bad :(
After some investigation, we noticed that in Brazil notebook
is a synonym for laptop
, all our competitors have updated they systems to work with that. It’s our turn to fix it, increase our sales and make our customers happy again :)
With this in mind, let’s start setting up the Elasticsearch environment.
Setting up the environment
We aren’t covering the basic usage of Elasticsearch, I’m using Docker to start the service and run it.
To start the Elasticsearch cluster run:
$ docker run -p 9200:9200 docker.elastic.co/elasticsearch/elasticsearch:6.0.0
We are supposed to see something like this on the console:
...
[2017-11-24T21:19:48,835][INFO ][o.e.x.s.t.n.SecurityNetty4HttpServerTransport] [3FHsLdj] publish_address {172.17.0.2:9200}, bound_addresses {0.0.0.0:9200}
[2017-11-24T21:19:48,836][INFO ][o.e.n.Node ] [3FHsLdj] started
[2017-11-24T21:19:48,923][INFO ][o.e.g.GatewayService ] [3FHsLdj] recovered [0] indices into cluster_state
Note: if you get the max virtual memory areas
error, you can run sudo systcl -w vm.max_map_count=262144
, I also recommend you search for this error to understand this workaround.
Our Elasticsearch cluster is running, to make sure it’s working properly we can access the http://localhost:9200/
and we should see something like this:
{
name: "3FHsLdj",
cluster_name: "docker-cluster",
cluster_uuid: "d3TUN9siQiWAnziLqK3K7w",
version: {
number: "6.0.0",
build_hash: "8f0685b",
build_date: "2017-11-10T18:41:22.859Z",
build_snapshot: false,
lucene_version: "7.0.1",
minimum_wire_compatibility_version: "5.6.0",
minimum_index_compatibility_version: "5.0.0"
},
tagline: "You Know, for Search"
}
Or we call the REST API of our cluster:
$ curl -XGET 'localhost:9200/_cat/health?v&pretty'epoch timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1511623576 15:26:16 docker-cluster yellow 1 1 2 2 0 0 2 0 - 50.0%
Yeey!! Our cluster is working as expected, let’s create our index and set it up =)
Creating our index
To create an index we need to call the API using the HTTP method PUT, let’s create an index called laptops
:
$ curl -XPUT 'localhost:9200/laptops?pretty'{
"acknowledged":true,
"shards_acknowledged":true,
"index":"laptops"
}
Our index was successfully created, it’s time to insert some data!
If you need to understand the basic concepts such as index, access this link:
Inserting data in our index
We are moving forward, huh? That’s great!!! Shall we insert some data into our index? Yaaaz!!!
$ curl -XPUT 'localhost:9200/laptops/doc/1?pretty&pretty' -H 'Content-Type: application/json' -d'
{
"title": "Laptop X1 i7 8gb RAM "
}
'
After inserting this doc
we will get a result similar to this:
{
"_index" : "laptops",
"_type" : "doc",
"_id" : "1",
"_version" : 1,
"result" : "created",
"_shards" : {
"total" : 2,
"successful" : 1,
"failed" : 0
},
"_seq_no" : 0,
"_primary_term" : 1
}
Great! Let’s insert more data…
$ curl -XPUT 'localhost:9200/laptops/doc/2?pretty&pretty' -H 'Content-Type: application/json' -d'
{
"title": "Laptop X2 i5 4gb RAM "
}
'$ curl -XPUT 'localhost:9200/laptops/doc/3?pretty&pretty' -H 'Content-Type: application/json' -d'
{
"title": "Laptop X3 i3 2gb RAM "
}
'$ curl -XPUT 'localhost:9200/laptops/doc/4?pretty&pretty' -H 'Content-Type: application/json' -d'
{
"title": "Laptop Z1 i7 6gb RAM "
}
'
Cool! We have 4 laptops on our Elasticsearch, let’s search for them :D
The notebook treasure map
As I said, in Brazil we use notebook
as a synonym for laptop
and when people search for notebook
they usually expect a laptop
as result, crazy, huh? Yaz, we are crazy :D
Ok, first let’s test the basic search request:
$ curl -XGET 'localhost:9200/laptops/_search?pretty' -H 'Content-Type: application/json' -d'
{
"query": { "match": { "title": "notebook i7 8gb" } }
}
'
Using a match
with notebook i7 8gb
we will get two results:
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 0.6931472,
"hits" : [
{
"_index" : "laptops",
"_type" : "doc",
"_id" : "4",
"_score" : 0.6931472,
"_source" : {
"title" : "Laptop Z1 i7 6gb RAM "
}
},
{
"_index" : "laptops",
"_type" : "doc",
"_id" : "1",
"_score" : 0.5753642,
"_source" : {
"title" : "Laptop X1 i7 8gb RAM "
}
}
]
}
}
We got these results because this query is using i7
and 8gb
, the notebook
term was ignored. As a user, we expect to get just the laptops
with i7
and 8gb
, let’s change our query to return only when those 3 requirements match.
$ curl -XGET 'localhost:9200/laptops/_search?pretty' -H 'Content-Type: application/json' -d'
{
"query": {
"bool": {
"must": [
{ "match": { "title": "notebook" } },
{ "match": { "title": "i7" } },
{ "match": { "title": "8gb" } }
]
}
}
}
'
At this point we face the problems with synonyms
, no results were returned for this search:
{
"took" : 10,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : [ ]
}
}
Why is this happening? This happens because we are using the match for 3 requirements notebook, i7 and 8gb
and we don’t have a notebook
in our Elasticsearch, so we need to configure our index to treat notebook
as a synonym to laptop
.
Configuring our index
The most basic configuration defines the synonyms directly in the configuration request. As we need to change the index settings
the first thing we need to do is delete our index:
$ curl -XDELETE 'localhost:9200/laptops/?pretty'{
"acknowledged" : true
}
Now we can recreate our index with the right analyzer and filter:
$ curl -XPUT 'localhost:9200/laptops/?pretty' -H 'Content-Type: application/json' -d'
{
"settings": {
"index" : {
"analysis" : {
"filter" : {
"synonym_filter" : {
"type" : "synonym",
"synonyms" : [
"laptop, notebook"
]
}
},
"analyzer" : {
"synonym_analyzer" : {
"tokenizer" : "standard",
"filter" : ["lowercase", "synonym_filter"]
}
}
}
}
},
"mappings": {
"doc": {
"properties": {
"title": {
"type": "text",
"analyzer": "synonym_analyzer"
}
}
}
}
}'
Note: we can insert as many synonyms as we want, for simplicity we are using just the notebook and laptop.
Ok, let’s understand what we configured:
[...]
"filter" : {
"synonym_filter" : {
"type" : "synonym",
"synonyms" : [
"laptop, notebook"
]
}
}
[...]
First, we created a filter
called synonym_filter
with the synonym
type and with laptop and notebook
as synonyms in the synonyms
list, it will be used by our analyzer:
[...]
"analyzer" : {
"synonym_analyzer" : {
"tokenizer" : "standard",
"filter" : ["lowercase", "synonym_filter"]
}
}
[...]
We created an analyzer
called synonym_analyzer
, this analyzer will use the standard tokenizer and two filters, the lowercase
filter will convert all tokens to lowercase
and the synonym_filter
will introduce the synonyms into the tokens stream.
Ok, we have our analyzer working, now we need to map it with our field and make it possible to search by synonyms:
"mappings": {
"doc": {
"properties": {
"title": {
"type": "text",
"analyzer": "synonym_analyzer"
}
}
}
}
This configuration will map our title
field to use our synonym_analyzer
.
Cool! Now we have our basic settings in place, let’s open re-index our data and test our simple implementation :D
Testing
Let’s insert our data again and try to search:
$ curl -XPUT 'localhost:9200/laptops/doc/1?pretty&pretty' -H 'Content-Type: application/json' -d'{"title": "Laptop X1 i7 8gb RAM"}'
$ curl -XPUT 'localhost:9200/laptops/doc/2?pretty&pretty' -H 'Content-Type: application/json' -d'{"title": "Laptop X2 i5 4gb RAM"}'
$ curl -XPUT 'localhost:9200/laptops/doc/3?pretty&pretty' -H 'Content-Type: application/json' -d'{"title": "Laptop X3 i3 2gb RAM"}'
$ curl -XPUT 'localhost:9200/laptops/doc/4?pretty&pretty' -H 'Content-Type: application/json' -d'{"title": "Laptop Z1 i7 6gb RAM"}'
Ok, once we have our data we can start searching for it. First, we will search for just notebook
:
$ curl -XGET 'localhost:9200/laptops/_search?pretty' -H 'Content-Type: application/json' -d'
{
"query": { "match": { "title": "notebook" } }
}
'
And we get… ALL RESSULTS!!!!
{
"took" : 8,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 4,
"max_score" : 0.41501677,
"hits" : [
{
"_index" : "laptops",
"_type" : "doc",
"_id" : "1",
"_score" : 0.41501677,
"_source" : {
"title" : "Laptop X1 i7 8gb RAM"
}
},
{
"_index" : "laptops",
"_type" : "doc",
"_id" : "3",
"_score" : 0.41501677,
"_source" : {
"title" : "Laptop X3 i3 2gb RAM"
}
},
{
"_index" : "laptops",
"_type" : "doc",
"_id" : "2",
"_score" : 0.26302126,
"_source" : {
"title" : "Laptop X2 i5 4gb RAM"
}
},
{
"_index" : "laptops",
"_type" : "doc",
"_id" : "4",
"_score" : 0.26302126,
"_source" : {
"title" : "Laptop Z1 i7 6gb RAM"
}
}
]
}
}
YEEEY!!! Let’s test to get a more specific result again, matching notebook, i7, and 8gb
:
$ curl -XGET 'localhost:9200/laptops/_search?pretty' -H 'Content-Type: application/json' -d'
{
"query": {
"bool": {
"must": [
{ "match": { "title": "notebook" } },
{ "match": { "title": "i7" } },
{ "match": { "title": "8gb" } }
]
}
}
}
'
BA DUM TSSS!!!!!!!! We got just one result:
{
"took" : 9,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0324807,
"hits" : [
{
"_index" : "laptops",
"_type" : "doc",
"_id" : "1",
"_score" : 1.0324807,
"_source" : {
"title" : "Laptop X1 i7 8gb RAM"
}
}
]
}
}
WOOWW!!! Our synonym search is working!!!! A.W.E.S.O.M.E!!!!
Wrapping up
There are other ways to configure the synonyms filter, the most common ways are putting them directly into the configuration file or using the synonym_path
attribute and use a synonym text file
holding our synonyms. The latter deserves its own post about it.
The main advice here is to play around and try do understand how it works, it could be beneficial use a synonym configuration for an e-commerce system for instance.
That’s it for today, hope you enjoyed our simple filter and had fun playing with me (:
See you!