Searching period and hyphen-delimited fields in Elasticsearch

I'm trying to find a way to use Elasticsearch to query a field that is both period and hyphen-delimited.

I have a (MySQL) data-set like this (using SQLAlchemy to access it):

id    text        tag

====================================

1     some-text   A.B.c3

2     more. text  A.B-C.c4

3     even more.  B.A-32.D-24.f9

The core reason I use ES for search in the first place is that I want to query against the text field. That part works awesome!

But, (I think) I want the the tag to appear in the inverted index like this (I probably won't take case into account, just including it for illustration):

A.B.c3            1

A.B-C.c4          2

B.A-C2.D-24.f9    3

Then, I want to search the tag field like this:

{ "query": {

      "prefix" : { "tag" : "A.B" }

    }

}

And have the query return id/rows/documents 1 and 2.

Basically, I want the query to match the index(es) in this truth table:

"A." = 1, 2

"A-" = 3

How do I accomplish both the "A." match at the beginning, differentiate between a period and a hyphen (possibly boost this), and match mid-phrase based on those same delimiters?

I'd also like to weight these matches higher if they occur at the beginning of the tag field if possible.

How do I do this, or is Elasticsearch not the right tool for the job? It seems like Elasticsearch works great for my text-field comparisons on normally delimited English text, but the tag-based searches seem much harder.

UPDATE: It seems that when I index only a subset of the data that my searches return the results I would expect but when querying against the full data-set, I get fewer hits.

edited Nov 21 '18 at 21:33

asked Nov 20 '18 at 3:24

Jonathan Rys

1,156520

add a comment |

I'm trying to find a way to use Elasticsearch to query a field that is both period and hyphen-delimited.

I have a (MySQL) data-set like this (using SQLAlchemy to access it):

id    text        tag

====================================

1     some-text   A.B.c3

2     more. text  A.B-C.c4

3     even more.  B.A-32.D-24.f9

The core reason I use ES for search in the first place is that I want to query against the text field. That part works awesome!

But, (I think) I want the the tag to appear in the inverted index like this (I probably won't take case into account, just including it for illustration):

A.B.c3            1

A.B-C.c4          2

B.A-C2.D-24.f9    3

Then, I want to search the tag field like this:

{ "query": {

      "prefix" : { "tag" : "A.B" }

    }

}

And have the query return id/rows/documents 1 and 2.

Basically, I want the query to match the index(es) in this truth table:

"A." = 1, 2

"A-" = 3

How do I accomplish both the "A." match at the beginning, differentiate between a period and a hyphen (possibly boost this), and match mid-phrase based on those same delimiters?

I'd also like to weight these matches higher if they occur at the beginning of the tag field if possible.

UPDATE: It seems that when I index only a subset of the data that my searches return the results I would expect but when querying against the full data-set, I get fewer hits.

edited Nov 21 '18 at 21:33

asked Nov 20 '18 at 3:24

Jonathan Rys

1,156520

add a comment |

I'm trying to find a way to use Elasticsearch to query a field that is both period and hyphen-delimited.

I have a (MySQL) data-set like this (using SQLAlchemy to access it):

id    text        tag

====================================

1     some-text   A.B.c3

2     more. text  A.B-C.c4

3     even more.  B.A-32.D-24.f9

The core reason I use ES for search in the first place is that I want to query against the text field. That part works awesome!

But, (I think) I want the the tag to appear in the inverted index like this (I probably won't take case into account, just including it for illustration):

A.B.c3            1

A.B-C.c4          2

B.A-C2.D-24.f9    3

Then, I want to search the tag field like this:

{ "query": {

      "prefix" : { "tag" : "A.B" }

    }

}

And have the query return id/rows/documents 1 and 2.

Basically, I want the query to match the index(es) in this truth table:

"A." = 1, 2

"A-" = 3

How do I accomplish both the "A." match at the beginning, differentiate between a period and a hyphen (possibly boost this), and match mid-phrase based on those same delimiters?

I'd also like to weight these matches higher if they occur at the beginning of the tag field if possible.

UPDATE: It seems that when I index only a subset of the data that my searches return the results I would expect but when querying against the full data-set, I get fewer hits.

edited Nov 21 '18 at 21:33

asked Nov 20 '18 at 3:24

Jonathan Rys

1,156520

I'm trying to find a way to use Elasticsearch to query a field that is both period and hyphen-delimited.

I have a (MySQL) data-set like this (using SQLAlchemy to access it):

id    text        tag

====================================

1     some-text   A.B.c3

2     more. text  A.B-C.c4

3     even more.  B.A-32.D-24.f9

The core reason I use ES for search in the first place is that I want to query against the text field. That part works awesome!

But, (I think) I want the the tag to appear in the inverted index like this (I probably won't take case into account, just including it for illustration):

A.B.c3            1

A.B-C.c4          2

B.A-C2.D-24.f9    3

Then, I want to search the tag field like this:

{ "query": {

      "prefix" : { "tag" : "A.B" }

    }

}

And have the query return id/rows/documents 1 and 2.

Basically, I want the query to match the index(es) in this truth table:

"A." = 1, 2

"A-" = 3

How do I accomplish both the "A." match at the beginning, differentiate between a period and a hyphen (possibly boost this), and match mid-phrase based on those same delimiters?

I'd also like to weight these matches higher if they occur at the beginning of the tag field if possible.

UPDATE: It seems that when I index only a subset of the data that my searches return the results I would expect but when querying against the full data-set, I get fewer hits.

python elasticsearch search

edited Nov 21 '18 at 21:33

asked Nov 20 '18 at 3:24

Jonathan Rys

1,156520

edited Nov 21 '18 at 21:33

asked Nov 20 '18 at 3:24

Jonathan Rys

1,156520

edited Nov 21 '18 at 21:33

asked Nov 20 '18 at 3:24

Jonathan Rys

1,156520

asked Nov 20 '18 at 3:24

Jonathan Rys

1,156520

asked Nov 20 '18 at 3:24

Jonathan Rys

1,156520

add a comment |

2 Answers
2

active

oldest

votes

This can be done via N-Gram tokenizer.

Based on what you've provided in question, I've created its corresponding mapping, documents and a sample query to give you what you are looking for.

Mapping

PUT idtesttag

{

  "settings": {

    "analysis": {

      "analyzer": {

        "my_analyzer": {

          "tokenizer": "my_tokenizer"

        }

      },

      "tokenizer": {

        "my_tokenizer": {

          "type": "ngram",

          "min_gram": 2,

          "max_gram": 5

        }

      }

    }

  },

  "mappings": {

    "mydocs": {

      "properties": {

        "id": {

          "type": "long"

        },

        "text": {

          "type": "text",

          "analyzer": "my_analyzer"

        },

        "tag": {

          "type": "text",

          "analyzer": "my_analyzer"

        }

      }

    }

  }

}

What this would do is, if you have a document with id = 1 has a tag A.B it would store following group of characters in its inverted index.

 A. -> 1

 .B -> 1

A.B -> 1

So if your query has any of these three words, your document with id=1 would be returned.

Sample Documents

POST idtesttag/mydocs/1

{

  "id": 1,

  "text": "some-text",

  "tag": "A.B.c3"

}



POST idtesttag/mydocs/2

{

  "id": 2,

  "text": "more. text",

  "tag": "A.B-C.c4"

}



POST idtesttag/mydocs/3

{

  "id": 3,

  "text": "even more.",

  "tag": "B.A-32.D-24.f9"

}



POST idtesttag/mydocs/4

{

  "id": 3,

  "text": "even more.",

  "tag": "B.A.B-32.D-24.f9"

}

Sample Query

POST idtesttag/_search

{

  "query": {

    "match": {

      "tag": "A.B"

    }

  }

}

Query Response

{

  "took": 139,

  "timed_out": false,

  "_shards": {

    "total": 5,

    "successful": 5,

    "skipped": 0,

    "failed": 0

  },

  "hits": {

    "total": 3,

    "max_score": 0.8630463,

    "hits": [

      {

        "_index": "idtesttag",

        "_type": "mydocs",

        "_id": "1",

        "_score": 0.8630463,

        "_source": {

          "id": 1,

          "text": "some-text",

          "tag": "A.B.c3"

        }

      },

      {

        "_index": "idtesttag",

        "_type": "mydocs",

        "_id": "2",

        "_score": 0.66078395,

        "_source": {

          "id": 2,

          "text": "more. text",

          "tag": "A.B-C.c4"

        }

      },

      {

        "_index": "idtesttag",

        "_type": "mydocs",

        "_id": "4",

        "_score": 0.46659434,

        "_source": {

          "id": 3,

          "text": "even more.",

          "tag": "B.A.B-32.D-24.f9"

        }

      }

    ]

  }

}

Note that the documents 1, 2 and 4 are returned in the response. The document 4 is the mid sentence match while documents 1 & 2 are at the beginning.

Also note the score value as how it appears.

Boosting based on hypen

Now with regards to boosting based on hypen character, I'd suggest you to have Bool query along with Regex Query with Boosting. Below is the sample query I came up with.

Note that just for sake of simplicity I've added regex where it would only boost if hypen is next to A.B.

POST idtesttag/_search

{

  "query": {

    "bool": {

      "must" : {

        "match" : { "tag" : "A.B" }

      },

      "should": [

        {

          "regexp": {

            "tag": {

              "value": "A.B-.*",

              "boost": 3

            }

          }

        }

      ]

    }

  }

}

Boosting Query Response

{

  "took": 2,

  "timed_out": false,

  "_shards": {

    "total": 5,

    "successful": 5,

    "skipped": 0,

    "failed": 0

  },

  "hits": {

    "total": 3,

    "max_score": 3.660784,

    "hits": [

      {

        "_index": "idtesttag",

        "_type": "mydocs",

        "_id": "2",

        "_score": 3.660784,

        "_source": {

          "id": 2,

          "text": "more. text",

          "tag": "A.B-C.c4"

        }

      },

      {

        "_index": "idtesttag",

        "_type": "mydocs",

        "_id": "4",

        "_score": 3.4665942,

        "_source": {

          "id": 3,

          "text": "even more.",

          "tag": "B.A.B-32.D-24.f9"

        }

      },

      {

        "_index": "idtesttag",

        "_type": "mydocs",

        "_id": "1",

        "_score": 0.8630463,

        "_source": {

          "id": 1,

          "text": "some-text",

          "tag": "A.B.c3"

        }

      }

    ]

  }

}

Just ensure that your testing is thorough when it comes to boosting because its all about influencing the score & make sure you do that with prod data ingested in DEV/TEST Elastic index.

That way you'd not be spooked when you see totally different results if you move to PROD Elastic.

I'm sorry its pretty long answer but I hope this helps!

answered Nov 22 '18 at 8:08

Kamal

1,7681920

add a comment |

But, (I think) I want the the tag to appear in the inverted index like this (I probably won't take case into account, just including it for illustration):

Then, I want to search the tag field like this:

Based on what you've described in your post reg. the 'tag' field, here's my 2 cents.

Your Mysql data should be in 1 type (in 6.5 it's 'doc' by default). You do need to explicitly define your Index Mapping though - especially on the 'tag' field, as you seem to have search requirements.

I would define your 'tag' field as a multi-field of:

type 'keyword' for aggregations

type 'text' for searches, with a custom analyzer (that might use 'whitespace' tokenizer, and an 'edge ngram' token filter

(if you don't need aggregations, then just define a 'text' type field with the custom analyzer)

FYI, The Analyze API will show you what ES is doing with your 'tag' data, and will help you define the Mapping that meets your requirements.

answered Nov 22 '18 at 5:55

kevvo83

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53385768%2fsearching-period-and-hyphen-delimited-fields-in-elasticsearch%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

This can be done via N-Gram tokenizer.

Based on what you've provided in question, I've created its corresponding mapping, documents and a sample query to give you what you are looking for.

Mapping

PUT idtesttag

{

  "settings": {

    "analysis": {

      "analyzer": {

        "my_analyzer": {

          "tokenizer": "my_tokenizer"

        }

      },

      "tokenizer": {

        "my_tokenizer": {

          "type": "ngram",

          "min_gram": 2,

          "max_gram": 5

        }

      }

    }

  },

  "mappings": {

    "mydocs": {

      "properties": {

        "id": {

          "type": "long"

        },

        "text": {

          "type": "text",

          "analyzer": "my_analyzer"

        },

        "tag": {

          "type": "text",

          "analyzer": "my_analyzer"

        }

      }

    }

  }

}

What this would do is, if you have a document with id = 1 has a tag A.B it would store following group of characters in its inverted index.

 A. -> 1

 .B -> 1

A.B -> 1

So if your query has any of these three words, your document with id=1 would be returned.

Sample Documents

POST idtesttag/mydocs/1

{

  "id": 1,

  "text": "some-text",

  "tag": "A.B.c3"

}



POST idtesttag/mydocs/2

{

  "id": 2,

  "text": "more. text",

  "tag": "A.B-C.c4"

}



POST idtesttag/mydocs/3

{

  "id": 3,

  "text": "even more.",

  "tag": "B.A-32.D-24.f9"

}



POST idtesttag/mydocs/4

{

  "id": 3,

  "text": "even more.",

  "tag": "B.A.B-32.D-24.f9"

}

Sample Query

POST idtesttag/_search

{

  "query": {

    "match": {

      "tag": "A.B"

    }

  }

}

Query Response

{

  "took": 139,

  "timed_out": false,

  "_shards": {

    "total": 5,

    "successful": 5,

    "skipped": 0,

    "failed": 0

  },

  "hits": {

    "total": 3,

    "max_score": 0.8630463,

    "hits": [

      {

        "_index": "idtesttag",

        "_type": "mydocs",

        "_id": "1",

        "_score": 0.8630463,

        "_source": {

          "id": 1,

          "text": "some-text",

          "tag": "A.B.c3"

        }

      },

      {

        "_index": "idtesttag",

        "_type": "mydocs",

        "_id": "2",

        "_score": 0.66078395,

        "_source": {

          "id": 2,

          "text": "more. text",

          "tag": "A.B-C.c4"

        }

      },

      {

        "_index": "idtesttag",

        "_type": "mydocs",

        "_id": "4",

        "_score": 0.46659434,

        "_source": {

          "id": 3,

          "text": "even more.",

          "tag": "B.A.B-32.D-24.f9"

        }

      }

    ]

  }

}

Note that the documents 1, 2 and 4 are returned in the response. The document 4 is the mid sentence match while documents 1 & 2 are at the beginning.

Also note the score value as how it appears.

Boosting based on hypen

Now with regards to boosting based on hypen character, I'd suggest you to have Bool query along with Regex Query with Boosting. Below is the sample query I came up with.

Note that just for sake of simplicity I've added regex where it would only boost if hypen is next to A.B.

POST idtesttag/_search

{

  "query": {

    "bool": {

      "must" : {

        "match" : { "tag" : "A.B" }

      },

      "should": [

        {

          "regexp": {

            "tag": {

              "value": "A.B-.*",

              "boost": 3

            }

          }

        }

      ]

    }

  }

}

Boosting Query Response

{

  "took": 2,

  "timed_out": false,

  "_shards": {

    "total": 5,

    "successful": 5,

    "skipped": 0,

    "failed": 0

  },

  "hits": {

    "total": 3,

    "max_score": 3.660784,

    "hits": [

      {

        "_index": "idtesttag",

        "_type": "mydocs",

        "_id": "2",

        "_score": 3.660784,

        "_source": {

          "id": 2,

          "text": "more. text",

          "tag": "A.B-C.c4"

        }

      },

      {

        "_index": "idtesttag",

        "_type": "mydocs",

        "_id": "4",

        "_score": 3.4665942,

        "_source": {

          "id": 3,

          "text": "even more.",

          "tag": "B.A.B-32.D-24.f9"

        }

      },

      {

        "_index": "idtesttag",

        "_type": "mydocs",

        "_id": "1",

        "_score": 0.8630463,

        "_source": {

          "id": 1,

          "text": "some-text",

          "tag": "A.B.c3"

        }

      }

    ]

  }

}

Just ensure that your testing is thorough when it comes to boosting because its all about influencing the score & make sure you do that with prod data ingested in DEV/TEST Elastic index.

That way you'd not be spooked when you see totally different results if you move to PROD Elastic.

I'm sorry its pretty long answer but I hope this helps!

answered Nov 22 '18 at 8:08

Kamal

1,7681920

add a comment |

This can be done via N-Gram tokenizer.

Based on what you've provided in question, I've created its corresponding mapping, documents and a sample query to give you what you are looking for.

Mapping

PUT idtesttag

{

  "settings": {

    "analysis": {

      "analyzer": {

        "my_analyzer": {

          "tokenizer": "my_tokenizer"

        }

      },

      "tokenizer": {

        "my_tokenizer": {

          "type": "ngram",

          "min_gram": 2,

          "max_gram": 5

        }

      }

    }

  },

  "mappings": {

    "mydocs": {

      "properties": {

        "id": {

          "type": "long"

        },

        "text": {

          "type": "text",

          "analyzer": "my_analyzer"

        },

        "tag": {

          "type": "text",

          "analyzer": "my_analyzer"

        }

      }

    }

  }

}

What this would do is, if you have a document with id = 1 has a tag A.B it would store following group of characters in its inverted index.

 A. -> 1

 .B -> 1

A.B -> 1

So if your query has any of these three words, your document with id=1 would be returned.

Sample Documents

POST idtesttag/mydocs/1

{

  "id": 1,

  "text": "some-text",

  "tag": "A.B.c3"

}



POST idtesttag/mydocs/2

{

  "id": 2,

  "text": "more. text",

  "tag": "A.B-C.c4"

}



POST idtesttag/mydocs/3

{

  "id": 3,

  "text": "even more.",

  "tag": "B.A-32.D-24.f9"

}



POST idtesttag/mydocs/4

{

  "id": 3,

  "text": "even more.",

  "tag": "B.A.B-32.D-24.f9"

}

Sample Query

POST idtesttag/_search

{

  "query": {

    "match": {

      "tag": "A.B"

    }

  }

}

Query Response

{

  "took": 139,

  "timed_out": false,

  "_shards": {

    "total": 5,

    "successful": 5,

    "skipped": 0,

    "failed": 0

  },

  "hits": {

    "total": 3,

    "max_score": 0.8630463,

    "hits": [

      {

        "_index": "idtesttag",

        "_type": "mydocs",

        "_id": "1",

        "_score": 0.8630463,

        "_source": {

          "id": 1,

          "text": "some-text",

          "tag": "A.B.c3"

        }

      },

      {

        "_index": "idtesttag",

        "_type": "mydocs",

        "_id": "2",

        "_score": 0.66078395,

        "_source": {

          "id": 2,

          "text": "more. text",

          "tag": "A.B-C.c4"

        }

      },

      {

        "_index": "idtesttag",

        "_type": "mydocs",

        "_id": "4",

        "_score": 0.46659434,

        "_source": {

          "id": 3,

          "text": "even more.",

          "tag": "B.A.B-32.D-24.f9"

        }

      }

    ]

  }

}

Note that the documents 1, 2 and 4 are returned in the response. The document 4 is the mid sentence match while documents 1 & 2 are at the beginning.

Also note the score value as how it appears.

Boosting based on hypen

Now with regards to boosting based on hypen character, I'd suggest you to have Bool query along with Regex Query with Boosting. Below is the sample query I came up with.

Note that just for sake of simplicity I've added regex where it would only boost if hypen is next to A.B.

POST idtesttag/_search

{

  "query": {

    "bool": {

      "must" : {

        "match" : { "tag" : "A.B" }

      },

      "should": [

        {

          "regexp": {

            "tag": {

              "value": "A.B-.*",

              "boost": 3

            }

          }

        }

      ]

    }

  }

}

Boosting Query Response

{

  "took": 2,

  "timed_out": false,

  "_shards": {

    "total": 5,

    "successful": 5,

    "skipped": 0,

    "failed": 0

  },

  "hits": {

    "total": 3,

    "max_score": 3.660784,

    "hits": [

      {

        "_index": "idtesttag",

        "_type": "mydocs",

        "_id": "2",

        "_score": 3.660784,

        "_source": {

          "id": 2,

          "text": "more. text",

          "tag": "A.B-C.c4"

        }

      },

      {

        "_index": "idtesttag",

        "_type": "mydocs",

        "_id": "4",

        "_score": 3.4665942,

        "_source": {

          "id": 3,

          "text": "even more.",

          "tag": "B.A.B-32.D-24.f9"

        }

      },

      {

        "_index": "idtesttag",

        "_type": "mydocs",

        "_id": "1",

        "_score": 0.8630463,

        "_source": {

          "id": 1,

          "text": "some-text",

          "tag": "A.B.c3"

        }

      }

    ]

  }

}

Just ensure that your testing is thorough when it comes to boosting because its all about influencing the score & make sure you do that with prod data ingested in DEV/TEST Elastic index.

That way you'd not be spooked when you see totally different results if you move to PROD Elastic.

I'm sorry its pretty long answer but I hope this helps!

answered Nov 22 '18 at 8:08

Kamal

1,7681920

add a comment |

This can be done via N-Gram tokenizer.

Based on what you've provided in question, I've created its corresponding mapping, documents and a sample query to give you what you are looking for.

Mapping

PUT idtesttag

{

  "settings": {

    "analysis": {

      "analyzer": {

        "my_analyzer": {

          "tokenizer": "my_tokenizer"

        }

      },

      "tokenizer": {

        "my_tokenizer": {

          "type": "ngram",

          "min_gram": 2,

          "max_gram": 5

        }

      }

    }

  },

  "mappings": {

    "mydocs": {

      "properties": {

        "id": {

          "type": "long"

        },

        "text": {

          "type": "text",

          "analyzer": "my_analyzer"

        },

        "tag": {

          "type": "text",

          "analyzer": "my_analyzer"

        }

      }

    }

  }

}

What this would do is, if you have a document with id = 1 has a tag A.B it would store following group of characters in its inverted index.

 A. -> 1

 .B -> 1

A.B -> 1

So if your query has any of these three words, your document with id=1 would be returned.

Sample Documents

POST idtesttag/mydocs/1

{

  "id": 1,

  "text": "some-text",

  "tag": "A.B.c3"

}



POST idtesttag/mydocs/2

{

  "id": 2,

  "text": "more. text",

  "tag": "A.B-C.c4"

}



POST idtesttag/mydocs/3

{

  "id": 3,

  "text": "even more.",

  "tag": "B.A-32.D-24.f9"

}



POST idtesttag/mydocs/4

{

  "id": 3,

  "text": "even more.",

  "tag": "B.A.B-32.D-24.f9"

}

Sample Query

POST idtesttag/_search

{

  "query": {

    "match": {

      "tag": "A.B"

    }

  }

}

Query Response

{

  "took": 139,

  "timed_out": false,

  "_shards": {

    "total": 5,

    "successful": 5,

    "skipped": 0,

    "failed": 0

  },

  "hits": {

    "total": 3,

    "max_score": 0.8630463,

    "hits": [

      {

        "_index": "idtesttag",

        "_type": "mydocs",

        "_id": "1",

        "_score": 0.8630463,

        "_source": {

          "id": 1,

          "text": "some-text",

          "tag": "A.B.c3"

        }

      },

      {

        "_index": "idtesttag",

        "_type": "mydocs",

        "_id": "2",

        "_score": 0.66078395,

        "_source": {

          "id": 2,

          "text": "more. text",

          "tag": "A.B-C.c4"

        }

      },

      {

        "_index": "idtesttag",

        "_type": "mydocs",

        "_id": "4",

        "_score": 0.46659434,

        "_source": {

          "id": 3,

          "text": "even more.",

          "tag": "B.A.B-32.D-24.f9"

        }

      }

    ]

  }

}

Note that the documents 1, 2 and 4 are returned in the response. The document 4 is the mid sentence match while documents 1 & 2 are at the beginning.

Also note the score value as how it appears.

Boosting based on hypen

Now with regards to boosting based on hypen character, I'd suggest you to have Bool query along with Regex Query with Boosting. Below is the sample query I came up with.

Note that just for sake of simplicity I've added regex where it would only boost if hypen is next to A.B.

POST idtesttag/_search

{

  "query": {

    "bool": {

      "must" : {

        "match" : { "tag" : "A.B" }

      },

      "should": [

        {

          "regexp": {

            "tag": {

              "value": "A.B-.*",

              "boost": 3

            }

          }

        }

      ]

    }

  }

}

Boosting Query Response

{

  "took": 2,

  "timed_out": false,

  "_shards": {

    "total": 5,

    "successful": 5,

    "skipped": 0,

    "failed": 0

  },

  "hits": {

    "total": 3,

    "max_score": 3.660784,

    "hits": [

      {

        "_index": "idtesttag",

        "_type": "mydocs",

        "_id": "2",

        "_score": 3.660784,

        "_source": {

          "id": 2,

          "text": "more. text",

          "tag": "A.B-C.c4"

        }

      },

      {

        "_index": "idtesttag",

        "_type": "mydocs",

        "_id": "4",

        "_score": 3.4665942,

        "_source": {

          "id": 3,

          "text": "even more.",

          "tag": "B.A.B-32.D-24.f9"

        }

      },

      {

        "_index": "idtesttag",

        "_type": "mydocs",

        "_id": "1",

        "_score": 0.8630463,

        "_source": {

          "id": 1,

          "text": "some-text",

          "tag": "A.B.c3"

        }

      }

    ]

  }

}

Just ensure that your testing is thorough when it comes to boosting because its all about influencing the score & make sure you do that with prod data ingested in DEV/TEST Elastic index.

That way you'd not be spooked when you see totally different results if you move to PROD Elastic.

I'm sorry its pretty long answer but I hope this helps!

answered Nov 22 '18 at 8:08

Kamal

1,7681920

This can be done via N-Gram tokenizer.

Based on what you've provided in question, I've created its corresponding mapping, documents and a sample query to give you what you are looking for.

Mapping

PUT idtesttag

{

  "settings": {

    "analysis": {

      "analyzer": {

        "my_analyzer": {

          "tokenizer": "my_tokenizer"

        }

      },

      "tokenizer": {

        "my_tokenizer": {

          "type": "ngram",

          "min_gram": 2,

          "max_gram": 5

        }

      }

    }

  },

  "mappings": {

    "mydocs": {

      "properties": {

        "id": {

          "type": "long"

        },

        "text": {

          "type": "text",

          "analyzer": "my_analyzer"

        },

        "tag": {

          "type": "text",

          "analyzer": "my_analyzer"

        }

      }

    }

  }

}

What this would do is, if you have a document with id = 1 has a tag A.B it would store following group of characters in its inverted index.

 A. -> 1

 .B -> 1

A.B -> 1

So if your query has any of these three words, your document with id=1 would be returned.

Sample Documents

POST idtesttag/mydocs/1

{

  "id": 1,

  "text": "some-text",

  "tag": "A.B.c3"

}



POST idtesttag/mydocs/2

{

  "id": 2,

  "text": "more. text",

  "tag": "A.B-C.c4"

}



POST idtesttag/mydocs/3

{

  "id": 3,

  "text": "even more.",

  "tag": "B.A-32.D-24.f9"

}



POST idtesttag/mydocs/4

{

  "id": 3,

  "text": "even more.",

  "tag": "B.A.B-32.D-24.f9"

}

Sample Query

POST idtesttag/_search

{

  "query": {

    "match": {

      "tag": "A.B"

    }

  }

}

Query Response

{

  "took": 139,

  "timed_out": false,

  "_shards": {

    "total": 5,

    "successful": 5,

    "skipped": 0,

    "failed": 0

  },

  "hits": {

    "total": 3,

    "max_score": 0.8630463,

    "hits": [

      {

        "_index": "idtesttag",

        "_type": "mydocs",

        "_id": "1",

        "_score": 0.8630463,

        "_source": {

          "id": 1,

          "text": "some-text",

          "tag": "A.B.c3"

        }

      },

      {

        "_index": "idtesttag",

        "_type": "mydocs",

        "_id": "2",

        "_score": 0.66078395,

        "_source": {

          "id": 2,

          "text": "more. text",

          "tag": "A.B-C.c4"

        }

      },

      {

        "_index": "idtesttag",

        "_type": "mydocs",

        "_id": "4",

        "_score": 0.46659434,

        "_source": {

          "id": 3,

          "text": "even more.",

          "tag": "B.A.B-32.D-24.f9"

        }

      }

    ]

  }

}

Note that the documents 1, 2 and 4 are returned in the response. The document 4 is the mid sentence match while documents 1 & 2 are at the beginning.

Also note the score value as how it appears.

Boosting based on hypen

Now with regards to boosting based on hypen character, I'd suggest you to have Bool query along with Regex Query with Boosting. Below is the sample query I came up with.

Note that just for sake of simplicity I've added regex where it would only boost if hypen is next to A.B.

POST idtesttag/_search

{

  "query": {

    "bool": {

      "must" : {

        "match" : { "tag" : "A.B" }

      },

      "should": [

        {

          "regexp": {

            "tag": {

              "value": "A.B-.*",

              "boost": 3

            }

          }

        }

      ]

    }

  }

}

Boosting Query Response

{

  "took": 2,

  "timed_out": false,

  "_shards": {

    "total": 5,

    "successful": 5,

    "skipped": 0,

    "failed": 0

  },

  "hits": {

    "total": 3,

    "max_score": 3.660784,

    "hits": [

      {

        "_index": "idtesttag",

        "_type": "mydocs",

        "_id": "2",

        "_score": 3.660784,

        "_source": {

          "id": 2,

          "text": "more. text",

          "tag": "A.B-C.c4"

        }

      },

      {

        "_index": "idtesttag",

        "_type": "mydocs",

        "_id": "4",

        "_score": 3.4665942,

        "_source": {

          "id": 3,

          "text": "even more.",

          "tag": "B.A.B-32.D-24.f9"

        }

      },

      {

        "_index": "idtesttag",

        "_type": "mydocs",

        "_id": "1",

        "_score": 0.8630463,

        "_source": {

          "id": 1,

          "text": "some-text",

          "tag": "A.B.c3"

        }

      }

    ]

  }

}

Just ensure that your testing is thorough when it comes to boosting because its all about influencing the score & make sure you do that with prod data ingested in DEV/TEST Elastic index.

That way you'd not be spooked when you see totally different results if you move to PROD Elastic.

I'm sorry its pretty long answer but I hope this helps!

answered Nov 22 '18 at 8:08

Kamal

1,7681920

answered Nov 22 '18 at 8:08

Kamal

1,7681920

answered Nov 22 '18 at 8:08

Kamal

1,7681920

answered Nov 22 '18 at 8:08

Kamal

1,7681920

add a comment |

But, (I think) I want the the tag to appear in the inverted index like this (I probably won't take case into account, just including it for illustration):

Then, I want to search the tag field like this:

Based on what you've described in your post reg. the 'tag' field, here's my 2 cents.

I would define your 'tag' field as a multi-field of:

type 'keyword' for aggregations

type 'text' for searches, with a custom analyzer (that might use 'whitespace' tokenizer, and an 'edge ngram' token filter

(if you don't need aggregations, then just define a 'text' type field with the custom analyzer)

FYI, The Analyze API will show you what ES is doing with your 'tag' data, and will help you define the Mapping that meets your requirements.

answered Nov 22 '18 at 5:55

kevvo83

add a comment |

But, (I think) I want the the tag to appear in the inverted index like this (I probably won't take case into account, just including it for illustration):

Then, I want to search the tag field like this:

Based on what you've described in your post reg. the 'tag' field, here's my 2 cents.

I would define your 'tag' field as a multi-field of:

type 'keyword' for aggregations

type 'text' for searches, with a custom analyzer (that might use 'whitespace' tokenizer, and an 'edge ngram' token filter

(if you don't need aggregations, then just define a 'text' type field with the custom analyzer)

FYI, The Analyze API will show you what ES is doing with your 'tag' data, and will help you define the Mapping that meets your requirements.

answered Nov 22 '18 at 5:55

kevvo83

add a comment |

But, (I think) I want the the tag to appear in the inverted index like this (I probably won't take case into account, just including it for illustration):

Then, I want to search the tag field like this:

Based on what you've described in your post reg. the 'tag' field, here's my 2 cents.

I would define your 'tag' field as a multi-field of:

type 'keyword' for aggregations

type 'text' for searches, with a custom analyzer (that might use 'whitespace' tokenizer, and an 'edge ngram' token filter

(if you don't need aggregations, then just define a 'text' type field with the custom analyzer)

FYI, The Analyze API will show you what ES is doing with your 'tag' data, and will help you define the Mapping that meets your requirements.

answered Nov 22 '18 at 5:55

kevvo83

But, (I think) I want the the tag to appear in the inverted index like this (I probably won't take case into account, just including it for illustration):

Then, I want to search the tag field like this:

Based on what you've described in your post reg. the 'tag' field, here's my 2 cents.

I would define your 'tag' field as a multi-field of:

type 'keyword' for aggregations

type 'text' for searches, with a custom analyzer (that might use 'whitespace' tokenizer, and an 'edge ngram' token filter

(if you don't need aggregations, then just define a 'text' type field with the custom analyzer)

FYI, The Analyze API will show you what ES is doing with your 'tag' data, and will help you define the Mapping that meets your requirements.

answered Nov 22 '18 at 5:55

kevvo83

answered Nov 22 '18 at 5:55

kevvo83

answered Nov 22 '18 at 5:55

kevvo83

answered Nov 22 '18 at 5:55

kevvo83

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here