Elasticsearch — Setting up a synonyms search

Hey folks! Throughout this post, we will see how to configure a basic synonyms search using Elasticsearch (:

Regardless the expertise with Elasticsearch, you might be able to understand this basic implementation.

To keep this post short, we will go through the most basic implementation, I may write more about this topic in the future, let’s start! (:

Our Problem

After some investigation, we noticed that in Brazil notebook is a synonym for laptop, all our competitors have updated they systems to work with that. It’s our turn to fix it, increase our sales and make our customers happy again :)

With this in mind, let’s start setting up the Elasticsearch environment.

Setting up the environment

To start the Elasticsearch cluster run:

$ docker run -p 9200:9200 docker.elastic.co/elasticsearch/elasticsearch:6.0.0

We are supposed to see something like this on the console:

...
[2017-11-24T21:19:48,835][INFO ][o.e.x.s.t.n.SecurityNetty4HttpServerTransport] [3FHsLdj] publish_address {172.17.0.2:9200}, bound_addresses {0.0.0.0:9200}
[2017-11-24T21:19:48,836][INFO ][o.e.n.Node ] [3FHsLdj] started
[2017-11-24T21:19:48,923][INFO ][o.e.g.GatewayService ] [3FHsLdj] recovered [0] indices into cluster_state

Note: if you get the max virtual memory areas error, you can run sudo systcl -w vm.max_map_count=262144 , I also recommend you search for this error to understand this workaround.

Our Elasticsearch cluster is running, to make sure it’s working properly we can access the http://localhost:9200/ and we should see something like this:

{
name: "3FHsLdj",
cluster_name: "docker-cluster",
cluster_uuid: "d3TUN9siQiWAnziLqK3K7w",
version: {
number: "6.0.0",
build_hash: "8f0685b",
build_date: "2017-11-10T18:41:22.859Z",
build_snapshot: false,
lucene_version: "7.0.1",
minimum_wire_compatibility_version: "5.6.0",
minimum_index_compatibility_version: "5.0.0"
},
tagline: "You Know, for Search"
}

Or we call the REST API of our cluster:

$ curl -XGET 'localhost:9200/_cat/health?v&pretty'epoch      timestamp cluster        status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1511623576 15:26:16 docker-cluster yellow 1 1 2 2 0 0 2 0 - 50.0%

Yeey!! Our cluster is working as expected, let’s create our index and set it up =)

Creating our index

$ curl -XPUT 'localhost:9200/laptops?pretty'{
"acknowledged":true,
"shards_acknowledged":true,
"index":"laptops"
}

Our index was successfully created, it’s time to insert some data!

If you need to understand the basic concepts such as index, access this link:

Inserting data in our index

$ curl -XPUT 'localhost:9200/laptops/doc/1?pretty&pretty' -H 'Content-Type: application/json' -d'
{
"title": "Laptop X1 i7 8gb RAM "
}
'

After inserting this doc we will get a result similar to this:

{
"_index" : "laptops",
"_type" : "doc",
"_id" : "1",
"_version" : 1,
"result" : "created",
"_shards" : {
"total" : 2,
"successful" : 1,
"failed" : 0
},
"_seq_no" : 0,
"_primary_term" : 1
}

Great! Let’s insert more data…

$ curl -XPUT 'localhost:9200/laptops/doc/2?pretty&pretty' -H 'Content-Type: application/json' -d'
{
"title": "Laptop X2 i5 4gb RAM "
}
'
$ curl -XPUT 'localhost:9200/laptops/doc/3?pretty&pretty' -H 'Content-Type: application/json' -d'
{
"title": "Laptop X3 i3 2gb RAM "
}
'
$ curl -XPUT 'localhost:9200/laptops/doc/4?pretty&pretty' -H 'Content-Type: application/json' -d'
{
"title": "Laptop Z1 i7 6gb RAM "
}
'

Cool! We have 4 laptops on our Elasticsearch, let’s search for them :D

The notebook treasure map

Ok, first let’s test the basic search request:

$ curl -XGET 'localhost:9200/laptops/_search?pretty' -H 'Content-Type: application/json' -d'
{
"query": { "match": { "title": "
notebook i7 8gb" } }
}
'

Using a match with notebook i7 8gb we will get two results:

{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 0.6931472,
"hits" : [
{
"_index" : "laptops",
"_type" : "doc",
"_id" : "4",
"_score" : 0.6931472,
"_source" : {
"title" : "Laptop Z1 i7 6gb RAM "
}
},
{
"_index" : "laptops",
"_type" : "doc",
"_id" : "1",
"_score" : 0.5753642,
"_source" : {
"title" : "Laptop X1 i7 8gb RAM "
}
}
]
}
}

We got these results because this query is using i7 and 8gb, the notebook term was ignored. As a user, we expect to get just the laptops with i7 and 8gb, let’s change our query to return only when those 3 requirements match.

$ curl -XGET 'localhost:9200/laptops/_search?pretty' -H 'Content-Type: application/json' -d'
{
"query": {
"bool": {
"must": [
{ "match": { "title": "
notebook" } },
{ "match": { "title": "
i7" } },
{ "match": { "title": "
8gb" } }
]
}
}
}
'

At this point we face the problems with synonyms , no results were returned for this search:

{
"took" : 10,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : [ ]
}
}

Why is this happening? This happens because we are using the match for 3 requirements notebook, i7 and 8gb and we don’t have a notebook in our Elasticsearch, so we need to configure our index to treat notebook as a synonym to laptop.

Configuring our index

$ curl -XDELETE 'localhost:9200/laptops/?pretty'{
"acknowledged" : true
}

Now we can recreate our index with the right analyzer and filter:

$ curl -XPUT 'localhost:9200/laptops/?pretty' -H 'Content-Type: application/json' -d'
{
"settings": {
"index" : {
"analysis" : {
"filter" : {
"synonym_filter" : {
"type" : "synonym",
"synonyms" : [
"laptop, notebook"
]
}
},
"analyzer" : {
"synonym_analyzer" : {
"tokenizer" : "standard",
"filter" : ["lowercase", "synonym_filter"]
}
}
}
}
},
"mappings": {
"doc": {
"properties": {
"title": {
"type": "text",
"analyzer": "synonym_analyzer"
}
}
}
}
}'

Note: we can insert as many synonyms as we want, for simplicity we are using just the notebook and laptop.

Ok, let’s understand what we configured:

[...]
"filter" : {
"synonym_filter" : {
"type" : "synonym",
"synonyms" : [
"laptop, notebook"
]
}
}

[...]

First, we created a filter called synonym_filter with the synonym type and with laptop and notebook as synonyms in the synonyms list, it will be used by our analyzer:

[...]
"analyzer" : {
"synonym_analyzer" : {
"tokenizer" : "standard",
"filter" : ["lowercase", "synonym_filter"]
}
}
[...]

We created an analyzer called synonym_analyzer , this analyzer will use the standard tokenizer and two filters, the lowercase filter will convert all tokens to lowercase and the synonym_filter will introduce the synonyms into the tokens stream.

Ok, we have our analyzer working, now we need to map it with our field and make it possible to search by synonyms:

"mappings": {
"doc": {
"properties": {
"title": {
"type": "text",
"analyzer": "synonym_analyzer"
}
}
}
}

This configuration will map our title field to use our synonym_analyzer .

Cool! Now we have our basic settings in place, let’s open re-index our data and test our simple implementation :D

Testing

$ curl -XPUT 'localhost:9200/laptops/doc/1?pretty&pretty' -H 'Content-Type: application/json' -d'{"title": "Laptop X1 i7 8gb RAM"}'
$ curl -XPUT 'localhost:9200/laptops/doc/2?pretty&pretty' -H 'Content-Type: application/json' -d'{"title": "Laptop X2 i5 4gb RAM"}'
$ curl -XPUT 'localhost:9200/laptops/doc/3?pretty&pretty' -H 'Content-Type: application/json' -d'{"title": "Laptop X3 i3 2gb RAM"}'
$ curl -XPUT 'localhost:9200/laptops/doc/4?pretty&pretty' -H 'Content-Type: application/json' -d'{"title": "Laptop Z1 i7 6gb RAM"}'

Ok, once we have our data we can start searching for it. First, we will search for just notebook :

$ curl -XGET 'localhost:9200/laptops/_search?pretty' -H 'Content-Type: application/json' -d'
{
"query": { "match": { "title": "
notebook" } }
}
'

And we get… ALL RESSULTS!!!!

{
"took" : 8,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 4,
"max_score" : 0.41501677,
"hits" : [
{
"_index" : "laptops",
"_type" : "doc",
"_id" : "1",
"_score" : 0.41501677,
"_source" : {
"title" : "Laptop X1 i7 8gb RAM"
}
},
{
"_index" : "laptops",
"_type" : "doc",
"_id" : "3",
"_score" : 0.41501677,
"_source" : {
"title" : "Laptop X3 i3 2gb RAM"
}
},
{
"_index" : "laptops",
"_type" : "doc",
"_id" : "2",
"_score" : 0.26302126,
"_source" : {
"title" : "Laptop X2 i5 4gb RAM"
}
},
{
"_index" : "laptops",
"_type" : "doc",
"_id" : "4",
"_score" : 0.26302126,
"_source" : {
"title" : "Laptop Z1 i7 6gb RAM"
}
}

]
}
}

YEEEY!!! Let’s test to get a more specific result again, matching notebook, i7, and 8gb :

$ curl -XGET 'localhost:9200/laptops/_search?pretty' -H 'Content-Type: application/json' -d'
{
"query": {
"bool": {
"must": [
{ "match": { "title": "notebook" } },
{ "match": { "title": "i7" } },
{ "match": { "title": "8gb" } }
]
}
}
}
'

BA DUM TSSS!!!!!!!! We got just one result:

{
"took" : 9,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0324807,
"hits" : [
{
"_index" : "laptops",
"_type" : "doc",
"_id" : "1",
"_score" : 1.0324807,
"_source" : {
"title" : "Laptop X1 i7 8gb RAM"
}
}

]
}
}

WOOWW!!! Our synonym search is working!!!! A.W.E.S.O.M.E!!!!

Wrapping up

The main advice here is to play around and try do understand how it works, it could be beneficial use a synonym configuration for an e-commerce system for instance.

That’s it for today, hope you enjoyed our simple filter and had fun playing with me (:

See you!

Useful links

I will show you the world through my eyes, everything is a point of view. https://www.youtube.com/c/LucasMagnum