Searching period and hyphen-delimited fields in Elasticsearch
I'm trying to find a way to use Elasticsearch to query a field that is both period and hyphen-delimited.
I have a (MySQL) data-set like this (using SQLAlchemy to access it):
id text tag
====================================
1 some-text A.B.c3
2 more. text A.B-C.c4
3 even more. B.A-32.D-24.f9
The core reason I use ES for search in the first place is that I want to query against the text field. That part works awesome!
But, (I think) I want the the tag to appear in the inverted index like this (I probably won't take case into account, just including it for illustration):
A.B.c3 1
A.B-C.c4 2
B.A-C2.D-24.f9 3
Then, I want to search the tag field like this:
{ "query": {
"prefix" : { "tag" : "A.B" }
}
}
And have the query return id/rows/documents 1 and 2.
Basically, I want the query to match the index(es) in this truth table:
"A." = 1, 2
"A-" = 3
How do I accomplish both the "A." match at the beginning, differentiate between a period and a hyphen (possibly boost this), and match mid-phrase based on those same delimiters?
I'd also like to weight these matches higher if they occur at the beginning of the tag field if possible.
How do I do this, or is Elasticsearch not the right tool for the job? It seems like Elasticsearch works great for my text-field comparisons on normally delimited English text, but the tag-based searches seem much harder.
UPDATE: It seems that when I index only a subset of the data that my searches return the results I would expect but when querying against the full data-set, I get fewer hits.
python elasticsearch search
add a comment |
I'm trying to find a way to use Elasticsearch to query a field that is both period and hyphen-delimited.
I have a (MySQL) data-set like this (using SQLAlchemy to access it):
id text tag
====================================
1 some-text A.B.c3
2 more. text A.B-C.c4
3 even more. B.A-32.D-24.f9
The core reason I use ES for search in the first place is that I want to query against the text field. That part works awesome!
But, (I think) I want the the tag to appear in the inverted index like this (I probably won't take case into account, just including it for illustration):
A.B.c3 1
A.B-C.c4 2
B.A-C2.D-24.f9 3
Then, I want to search the tag field like this:
{ "query": {
"prefix" : { "tag" : "A.B" }
}
}
And have the query return id/rows/documents 1 and 2.
Basically, I want the query to match the index(es) in this truth table:
"A." = 1, 2
"A-" = 3
How do I accomplish both the "A." match at the beginning, differentiate between a period and a hyphen (possibly boost this), and match mid-phrase based on those same delimiters?
I'd also like to weight these matches higher if they occur at the beginning of the tag field if possible.
How do I do this, or is Elasticsearch not the right tool for the job? It seems like Elasticsearch works great for my text-field comparisons on normally delimited English text, but the tag-based searches seem much harder.
UPDATE: It seems that when I index only a subset of the data that my searches return the results I would expect but when querying against the full data-set, I get fewer hits.
python elasticsearch search
add a comment |
I'm trying to find a way to use Elasticsearch to query a field that is both period and hyphen-delimited.
I have a (MySQL) data-set like this (using SQLAlchemy to access it):
id text tag
====================================
1 some-text A.B.c3
2 more. text A.B-C.c4
3 even more. B.A-32.D-24.f9
The core reason I use ES for search in the first place is that I want to query against the text field. That part works awesome!
But, (I think) I want the the tag to appear in the inverted index like this (I probably won't take case into account, just including it for illustration):
A.B.c3 1
A.B-C.c4 2
B.A-C2.D-24.f9 3
Then, I want to search the tag field like this:
{ "query": {
"prefix" : { "tag" : "A.B" }
}
}
And have the query return id/rows/documents 1 and 2.
Basically, I want the query to match the index(es) in this truth table:
"A." = 1, 2
"A-" = 3
How do I accomplish both the "A." match at the beginning, differentiate between a period and a hyphen (possibly boost this), and match mid-phrase based on those same delimiters?
I'd also like to weight these matches higher if they occur at the beginning of the tag field if possible.
How do I do this, or is Elasticsearch not the right tool for the job? It seems like Elasticsearch works great for my text-field comparisons on normally delimited English text, but the tag-based searches seem much harder.
UPDATE: It seems that when I index only a subset of the data that my searches return the results I would expect but when querying against the full data-set, I get fewer hits.
python elasticsearch search
I'm trying to find a way to use Elasticsearch to query a field that is both period and hyphen-delimited.
I have a (MySQL) data-set like this (using SQLAlchemy to access it):
id text tag
====================================
1 some-text A.B.c3
2 more. text A.B-C.c4
3 even more. B.A-32.D-24.f9
The core reason I use ES for search in the first place is that I want to query against the text field. That part works awesome!
But, (I think) I want the the tag to appear in the inverted index like this (I probably won't take case into account, just including it for illustration):
A.B.c3 1
A.B-C.c4 2
B.A-C2.D-24.f9 3
Then, I want to search the tag field like this:
{ "query": {
"prefix" : { "tag" : "A.B" }
}
}
And have the query return id/rows/documents 1 and 2.
Basically, I want the query to match the index(es) in this truth table:
"A." = 1, 2
"A-" = 3
How do I accomplish both the "A." match at the beginning, differentiate between a period and a hyphen (possibly boost this), and match mid-phrase based on those same delimiters?
I'd also like to weight these matches higher if they occur at the beginning of the tag field if possible.
How do I do this, or is Elasticsearch not the right tool for the job? It seems like Elasticsearch works great for my text-field comparisons on normally delimited English text, but the tag-based searches seem much harder.
UPDATE: It seems that when I index only a subset of the data that my searches return the results I would expect but when querying against the full data-set, I get fewer hits.
python elasticsearch search
python elasticsearch search
edited Nov 21 '18 at 21:33
Jonathan Rys
asked Nov 20 '18 at 3:24
Jonathan RysJonathan Rys
1,156520
1,156520
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
This can be done via N-Gram tokenizer.
Based on what you've provided in question, I've created its corresponding mapping, documents and a sample query to give you what you are looking for.
Mapping
PUT idtesttag
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 2,
"max_gram": 5
}
}
}
},
"mappings": {
"mydocs": {
"properties": {
"id": {
"type": "long"
},
"text": {
"type": "text",
"analyzer": "my_analyzer"
},
"tag": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
}
What this would do is, if you have a document with id = 1 has a tag A.B it would store following group of characters in its inverted index.
A. -> 1
.B -> 1
A.B -> 1
So if your query has any of these three words, your document with id=1 would be returned.
Sample Documents
POST idtesttag/mydocs/1
{
"id": 1,
"text": "some-text",
"tag": "A.B.c3"
}
POST idtesttag/mydocs/2
{
"id": 2,
"text": "more. text",
"tag": "A.B-C.c4"
}
POST idtesttag/mydocs/3
{
"id": 3,
"text": "even more.",
"tag": "B.A-32.D-24.f9"
}
POST idtesttag/mydocs/4
{
"id": 3,
"text": "even more.",
"tag": "B.A.B-32.D-24.f9"
}
Sample Query
POST idtesttag/_search
{
"query": {
"match": {
"tag": "A.B"
}
}
}
Query Response
{
"took": 139,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0.8630463,
"hits": [
{
"_index": "idtesttag",
"_type": "mydocs",
"_id": "1",
"_score": 0.8630463,
"_source": {
"id": 1,
"text": "some-text",
"tag": "A.B.c3"
}
},
{
"_index": "idtesttag",
"_type": "mydocs",
"_id": "2",
"_score": 0.66078395,
"_source": {
"id": 2,
"text": "more. text",
"tag": "A.B-C.c4"
}
},
{
"_index": "idtesttag",
"_type": "mydocs",
"_id": "4",
"_score": 0.46659434,
"_source": {
"id": 3,
"text": "even more.",
"tag": "B.A.B-32.D-24.f9"
}
}
]
}
}
Note that the documents 1, 2 and 4 are returned in the response. The document 4 is the mid sentence match while documents 1 & 2 are at the beginning.
Also note the score value as how it appears.
Boosting based on hypen
Now with regards to boosting based on hypen character, I'd suggest you to have Bool query along with Regex Query with Boosting. Below is the sample query I came up with.
Note that just for sake of simplicity I've added regex where it would only boost if hypen is next to A.B.
POST idtesttag/_search
{
"query": {
"bool": {
"must" : {
"match" : { "tag" : "A.B" }
},
"should": [
{
"regexp": {
"tag": {
"value": "A.B-.*",
"boost": 3
}
}
}
]
}
}
}
Boosting Query Response
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 3.660784,
"hits": [
{
"_index": "idtesttag",
"_type": "mydocs",
"_id": "2",
"_score": 3.660784,
"_source": {
"id": 2,
"text": "more. text",
"tag": "A.B-C.c4"
}
},
{
"_index": "idtesttag",
"_type": "mydocs",
"_id": "4",
"_score": 3.4665942,
"_source": {
"id": 3,
"text": "even more.",
"tag": "B.A.B-32.D-24.f9"
}
},
{
"_index": "idtesttag",
"_type": "mydocs",
"_id": "1",
"_score": 0.8630463,
"_source": {
"id": 1,
"text": "some-text",
"tag": "A.B.c3"
}
}
]
}
}
Just ensure that your testing is thorough when it comes to boosting because its all about influencing the score & make sure you do that with prod data ingested in DEV/TEST Elastic index.
That way you'd not be spooked when you see totally different results if you move to PROD Elastic.
I'm sorry its pretty long answer but I hope this helps!
add a comment |
But, (I think) I want the the tag to appear in the inverted index like this (I probably won't take case into account, just including it for illustration):
Then, I want to search the tag field like this:
Based on what you've described in your post reg. the 'tag' field, here's my 2 cents.
Your Mysql data should be in 1 type (in 6.5 it's 'doc' by default). You do need to explicitly define your Index Mapping though - especially on the 'tag' field, as you seem to have search requirements.
I would define your 'tag' field as a multi-field of:
- type 'keyword' for aggregations
- type 'text' for searches, with a custom analyzer (that might use 'whitespace' tokenizer, and an 'edge ngram' token filter
(if you don't need aggregations, then just define a 'text' type field with the custom analyzer)
FYI, The Analyze API will show you what ES is doing with your 'tag' data, and will help you define the Mapping that meets your requirements.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53385768%2fsearching-period-and-hyphen-delimited-fields-in-elasticsearch%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
This can be done via N-Gram tokenizer.
Based on what you've provided in question, I've created its corresponding mapping, documents and a sample query to give you what you are looking for.
Mapping
PUT idtesttag
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 2,
"max_gram": 5
}
}
}
},
"mappings": {
"mydocs": {
"properties": {
"id": {
"type": "long"
},
"text": {
"type": "text",
"analyzer": "my_analyzer"
},
"tag": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
}
What this would do is, if you have a document with id = 1 has a tag A.B it would store following group of characters in its inverted index.
A. -> 1
.B -> 1
A.B -> 1
So if your query has any of these three words, your document with id=1 would be returned.
Sample Documents
POST idtesttag/mydocs/1
{
"id": 1,
"text": "some-text",
"tag": "A.B.c3"
}
POST idtesttag/mydocs/2
{
"id": 2,
"text": "more. text",
"tag": "A.B-C.c4"
}
POST idtesttag/mydocs/3
{
"id": 3,
"text": "even more.",
"tag": "B.A-32.D-24.f9"
}
POST idtesttag/mydocs/4
{
"id": 3,
"text": "even more.",
"tag": "B.A.B-32.D-24.f9"
}
Sample Query
POST idtesttag/_search
{
"query": {
"match": {
"tag": "A.B"
}
}
}
Query Response
{
"took": 139,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0.8630463,
"hits": [
{
"_index": "idtesttag",
"_type": "mydocs",
"_id": "1",
"_score": 0.8630463,
"_source": {
"id": 1,
"text": "some-text",
"tag": "A.B.c3"
}
},
{
"_index": "idtesttag",
"_type": "mydocs",
"_id": "2",
"_score": 0.66078395,
"_source": {
"id": 2,
"text": "more. text",
"tag": "A.B-C.c4"
}
},
{
"_index": "idtesttag",
"_type": "mydocs",
"_id": "4",
"_score": 0.46659434,
"_source": {
"id": 3,
"text": "even more.",
"tag": "B.A.B-32.D-24.f9"
}
}
]
}
}
Note that the documents 1, 2 and 4 are returned in the response. The document 4 is the mid sentence match while documents 1 & 2 are at the beginning.
Also note the score value as how it appears.
Boosting based on hypen
Now with regards to boosting based on hypen character, I'd suggest you to have Bool query along with Regex Query with Boosting. Below is the sample query I came up with.
Note that just for sake of simplicity I've added regex where it would only boost if hypen is next to A.B.
POST idtesttag/_search
{
"query": {
"bool": {
"must" : {
"match" : { "tag" : "A.B" }
},
"should": [
{
"regexp": {
"tag": {
"value": "A.B-.*",
"boost": 3
}
}
}
]
}
}
}
Boosting Query Response
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 3.660784,
"hits": [
{
"_index": "idtesttag",
"_type": "mydocs",
"_id": "2",
"_score": 3.660784,
"_source": {
"id": 2,
"text": "more. text",
"tag": "A.B-C.c4"
}
},
{
"_index": "idtesttag",
"_type": "mydocs",
"_id": "4",
"_score": 3.4665942,
"_source": {
"id": 3,
"text": "even more.",
"tag": "B.A.B-32.D-24.f9"
}
},
{
"_index": "idtesttag",
"_type": "mydocs",
"_id": "1",
"_score": 0.8630463,
"_source": {
"id": 1,
"text": "some-text",
"tag": "A.B.c3"
}
}
]
}
}
Just ensure that your testing is thorough when it comes to boosting because its all about influencing the score & make sure you do that with prod data ingested in DEV/TEST Elastic index.
That way you'd not be spooked when you see totally different results if you move to PROD Elastic.
I'm sorry its pretty long answer but I hope this helps!
add a comment |
This can be done via N-Gram tokenizer.
Based on what you've provided in question, I've created its corresponding mapping, documents and a sample query to give you what you are looking for.
Mapping
PUT idtesttag
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 2,
"max_gram": 5
}
}
}
},
"mappings": {
"mydocs": {
"properties": {
"id": {
"type": "long"
},
"text": {
"type": "text",
"analyzer": "my_analyzer"
},
"tag": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
}
What this would do is, if you have a document with id = 1 has a tag A.B it would store following group of characters in its inverted index.
A. -> 1
.B -> 1
A.B -> 1
So if your query has any of these three words, your document with id=1 would be returned.
Sample Documents
POST idtesttag/mydocs/1
{
"id": 1,
"text": "some-text",
"tag": "A.B.c3"
}
POST idtesttag/mydocs/2
{
"id": 2,
"text": "more. text",
"tag": "A.B-C.c4"
}
POST idtesttag/mydocs/3
{
"id": 3,
"text": "even more.",
"tag": "B.A-32.D-24.f9"
}
POST idtesttag/mydocs/4
{
"id": 3,
"text": "even more.",
"tag": "B.A.B-32.D-24.f9"
}
Sample Query
POST idtesttag/_search
{
"query": {
"match": {
"tag": "A.B"
}
}
}
Query Response
{
"took": 139,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0.8630463,
"hits": [
{
"_index": "idtesttag",
"_type": "mydocs",
"_id": "1",
"_score": 0.8630463,
"_source": {
"id": 1,
"text": "some-text",
"tag": "A.B.c3"
}
},
{
"_index": "idtesttag",
"_type": "mydocs",
"_id": "2",
"_score": 0.66078395,
"_source": {
"id": 2,
"text": "more. text",
"tag": "A.B-C.c4"
}
},
{
"_index": "idtesttag",
"_type": "mydocs",
"_id": "4",
"_score": 0.46659434,
"_source": {
"id": 3,
"text": "even more.",
"tag": "B.A.B-32.D-24.f9"
}
}
]
}
}
Note that the documents 1, 2 and 4 are returned in the response. The document 4 is the mid sentence match while documents 1 & 2 are at the beginning.
Also note the score value as how it appears.
Boosting based on hypen
Now with regards to boosting based on hypen character, I'd suggest you to have Bool query along with Regex Query with Boosting. Below is the sample query I came up with.
Note that just for sake of simplicity I've added regex where it would only boost if hypen is next to A.B.
POST idtesttag/_search
{
"query": {
"bool": {
"must" : {
"match" : { "tag" : "A.B" }
},
"should": [
{
"regexp": {
"tag": {
"value": "A.B-.*",
"boost": 3
}
}
}
]
}
}
}
Boosting Query Response
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 3.660784,
"hits": [
{
"_index": "idtesttag",
"_type": "mydocs",
"_id": "2",
"_score": 3.660784,
"_source": {
"id": 2,
"text": "more. text",
"tag": "A.B-C.c4"
}
},
{
"_index": "idtesttag",
"_type": "mydocs",
"_id": "4",
"_score": 3.4665942,
"_source": {
"id": 3,
"text": "even more.",
"tag": "B.A.B-32.D-24.f9"
}
},
{
"_index": "idtesttag",
"_type": "mydocs",
"_id": "1",
"_score": 0.8630463,
"_source": {
"id": 1,
"text": "some-text",
"tag": "A.B.c3"
}
}
]
}
}
Just ensure that your testing is thorough when it comes to boosting because its all about influencing the score & make sure you do that with prod data ingested in DEV/TEST Elastic index.
That way you'd not be spooked when you see totally different results if you move to PROD Elastic.
I'm sorry its pretty long answer but I hope this helps!
add a comment |
This can be done via N-Gram tokenizer.
Based on what you've provided in question, I've created its corresponding mapping, documents and a sample query to give you what you are looking for.
Mapping
PUT idtesttag
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 2,
"max_gram": 5
}
}
}
},
"mappings": {
"mydocs": {
"properties": {
"id": {
"type": "long"
},
"text": {
"type": "text",
"analyzer": "my_analyzer"
},
"tag": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
}
What this would do is, if you have a document with id = 1 has a tag A.B it would store following group of characters in its inverted index.
A. -> 1
.B -> 1
A.B -> 1
So if your query has any of these three words, your document with id=1 would be returned.
Sample Documents
POST idtesttag/mydocs/1
{
"id": 1,
"text": "some-text",
"tag": "A.B.c3"
}
POST idtesttag/mydocs/2
{
"id": 2,
"text": "more. text",
"tag": "A.B-C.c4"
}
POST idtesttag/mydocs/3
{
"id": 3,
"text": "even more.",
"tag": "B.A-32.D-24.f9"
}
POST idtesttag/mydocs/4
{
"id": 3,
"text": "even more.",
"tag": "B.A.B-32.D-24.f9"
}
Sample Query
POST idtesttag/_search
{
"query": {
"match": {
"tag": "A.B"
}
}
}
Query Response
{
"took": 139,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0.8630463,
"hits": [
{
"_index": "idtesttag",
"_type": "mydocs",
"_id": "1",
"_score": 0.8630463,
"_source": {
"id": 1,
"text": "some-text",
"tag": "A.B.c3"
}
},
{
"_index": "idtesttag",
"_type": "mydocs",
"_id": "2",
"_score": 0.66078395,
"_source": {
"id": 2,
"text": "more. text",
"tag": "A.B-C.c4"
}
},
{
"_index": "idtesttag",
"_type": "mydocs",
"_id": "4",
"_score": 0.46659434,
"_source": {
"id": 3,
"text": "even more.",
"tag": "B.A.B-32.D-24.f9"
}
}
]
}
}
Note that the documents 1, 2 and 4 are returned in the response. The document 4 is the mid sentence match while documents 1 & 2 are at the beginning.
Also note the score value as how it appears.
Boosting based on hypen
Now with regards to boosting based on hypen character, I'd suggest you to have Bool query along with Regex Query with Boosting. Below is the sample query I came up with.
Note that just for sake of simplicity I've added regex where it would only boost if hypen is next to A.B.
POST idtesttag/_search
{
"query": {
"bool": {
"must" : {
"match" : { "tag" : "A.B" }
},
"should": [
{
"regexp": {
"tag": {
"value": "A.B-.*",
"boost": 3
}
}
}
]
}
}
}
Boosting Query Response
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 3.660784,
"hits": [
{
"_index": "idtesttag",
"_type": "mydocs",
"_id": "2",
"_score": 3.660784,
"_source": {
"id": 2,
"text": "more. text",
"tag": "A.B-C.c4"
}
},
{
"_index": "idtesttag",
"_type": "mydocs",
"_id": "4",
"_score": 3.4665942,
"_source": {
"id": 3,
"text": "even more.",
"tag": "B.A.B-32.D-24.f9"
}
},
{
"_index": "idtesttag",
"_type": "mydocs",
"_id": "1",
"_score": 0.8630463,
"_source": {
"id": 1,
"text": "some-text",
"tag": "A.B.c3"
}
}
]
}
}
Just ensure that your testing is thorough when it comes to boosting because its all about influencing the score & make sure you do that with prod data ingested in DEV/TEST Elastic index.
That way you'd not be spooked when you see totally different results if you move to PROD Elastic.
I'm sorry its pretty long answer but I hope this helps!
This can be done via N-Gram tokenizer.
Based on what you've provided in question, I've created its corresponding mapping, documents and a sample query to give you what you are looking for.
Mapping
PUT idtesttag
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 2,
"max_gram": 5
}
}
}
},
"mappings": {
"mydocs": {
"properties": {
"id": {
"type": "long"
},
"text": {
"type": "text",
"analyzer": "my_analyzer"
},
"tag": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
}
What this would do is, if you have a document with id = 1 has a tag A.B it would store following group of characters in its inverted index.
A. -> 1
.B -> 1
A.B -> 1
So if your query has any of these three words, your document with id=1 would be returned.
Sample Documents
POST idtesttag/mydocs/1
{
"id": 1,
"text": "some-text",
"tag": "A.B.c3"
}
POST idtesttag/mydocs/2
{
"id": 2,
"text": "more. text",
"tag": "A.B-C.c4"
}
POST idtesttag/mydocs/3
{
"id": 3,
"text": "even more.",
"tag": "B.A-32.D-24.f9"
}
POST idtesttag/mydocs/4
{
"id": 3,
"text": "even more.",
"tag": "B.A.B-32.D-24.f9"
}
Sample Query
POST idtesttag/_search
{
"query": {
"match": {
"tag": "A.B"
}
}
}
Query Response
{
"took": 139,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0.8630463,
"hits": [
{
"_index": "idtesttag",
"_type": "mydocs",
"_id": "1",
"_score": 0.8630463,
"_source": {
"id": 1,
"text": "some-text",
"tag": "A.B.c3"
}
},
{
"_index": "idtesttag",
"_type": "mydocs",
"_id": "2",
"_score": 0.66078395,
"_source": {
"id": 2,
"text": "more. text",
"tag": "A.B-C.c4"
}
},
{
"_index": "idtesttag",
"_type": "mydocs",
"_id": "4",
"_score": 0.46659434,
"_source": {
"id": 3,
"text": "even more.",
"tag": "B.A.B-32.D-24.f9"
}
}
]
}
}
Note that the documents 1, 2 and 4 are returned in the response. The document 4 is the mid sentence match while documents 1 & 2 are at the beginning.
Also note the score value as how it appears.
Boosting based on hypen
Now with regards to boosting based on hypen character, I'd suggest you to have Bool query along with Regex Query with Boosting. Below is the sample query I came up with.
Note that just for sake of simplicity I've added regex where it would only boost if hypen is next to A.B.
POST idtesttag/_search
{
"query": {
"bool": {
"must" : {
"match" : { "tag" : "A.B" }
},
"should": [
{
"regexp": {
"tag": {
"value": "A.B-.*",
"boost": 3
}
}
}
]
}
}
}
Boosting Query Response
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 3.660784,
"hits": [
{
"_index": "idtesttag",
"_type": "mydocs",
"_id": "2",
"_score": 3.660784,
"_source": {
"id": 2,
"text": "more. text",
"tag": "A.B-C.c4"
}
},
{
"_index": "idtesttag",
"_type": "mydocs",
"_id": "4",
"_score": 3.4665942,
"_source": {
"id": 3,
"text": "even more.",
"tag": "B.A.B-32.D-24.f9"
}
},
{
"_index": "idtesttag",
"_type": "mydocs",
"_id": "1",
"_score": 0.8630463,
"_source": {
"id": 1,
"text": "some-text",
"tag": "A.B.c3"
}
}
]
}
}
Just ensure that your testing is thorough when it comes to boosting because its all about influencing the score & make sure you do that with prod data ingested in DEV/TEST Elastic index.
That way you'd not be spooked when you see totally different results if you move to PROD Elastic.
I'm sorry its pretty long answer but I hope this helps!
answered Nov 22 '18 at 8:08
KamalKamal
1,7681920
1,7681920
add a comment |
add a comment |
But, (I think) I want the the tag to appear in the inverted index like this (I probably won't take case into account, just including it for illustration):
Then, I want to search the tag field like this:
Based on what you've described in your post reg. the 'tag' field, here's my 2 cents.
Your Mysql data should be in 1 type (in 6.5 it's 'doc' by default). You do need to explicitly define your Index Mapping though - especially on the 'tag' field, as you seem to have search requirements.
I would define your 'tag' field as a multi-field of:
- type 'keyword' for aggregations
- type 'text' for searches, with a custom analyzer (that might use 'whitespace' tokenizer, and an 'edge ngram' token filter
(if you don't need aggregations, then just define a 'text' type field with the custom analyzer)
FYI, The Analyze API will show you what ES is doing with your 'tag' data, and will help you define the Mapping that meets your requirements.
add a comment |
But, (I think) I want the the tag to appear in the inverted index like this (I probably won't take case into account, just including it for illustration):
Then, I want to search the tag field like this:
Based on what you've described in your post reg. the 'tag' field, here's my 2 cents.
Your Mysql data should be in 1 type (in 6.5 it's 'doc' by default). You do need to explicitly define your Index Mapping though - especially on the 'tag' field, as you seem to have search requirements.
I would define your 'tag' field as a multi-field of:
- type 'keyword' for aggregations
- type 'text' for searches, with a custom analyzer (that might use 'whitespace' tokenizer, and an 'edge ngram' token filter
(if you don't need aggregations, then just define a 'text' type field with the custom analyzer)
FYI, The Analyze API will show you what ES is doing with your 'tag' data, and will help you define the Mapping that meets your requirements.
add a comment |
But, (I think) I want the the tag to appear in the inverted index like this (I probably won't take case into account, just including it for illustration):
Then, I want to search the tag field like this:
Based on what you've described in your post reg. the 'tag' field, here's my 2 cents.
Your Mysql data should be in 1 type (in 6.5 it's 'doc' by default). You do need to explicitly define your Index Mapping though - especially on the 'tag' field, as you seem to have search requirements.
I would define your 'tag' field as a multi-field of:
- type 'keyword' for aggregations
- type 'text' for searches, with a custom analyzer (that might use 'whitespace' tokenizer, and an 'edge ngram' token filter
(if you don't need aggregations, then just define a 'text' type field with the custom analyzer)
FYI, The Analyze API will show you what ES is doing with your 'tag' data, and will help you define the Mapping that meets your requirements.
But, (I think) I want the the tag to appear in the inverted index like this (I probably won't take case into account, just including it for illustration):
Then, I want to search the tag field like this:
Based on what you've described in your post reg. the 'tag' field, here's my 2 cents.
Your Mysql data should be in 1 type (in 6.5 it's 'doc' by default). You do need to explicitly define your Index Mapping though - especially on the 'tag' field, as you seem to have search requirements.
I would define your 'tag' field as a multi-field of:
- type 'keyword' for aggregations
- type 'text' for searches, with a custom analyzer (that might use 'whitespace' tokenizer, and an 'edge ngram' token filter
(if you don't need aggregations, then just define a 'text' type field with the custom analyzer)
FYI, The Analyze API will show you what ES is doing with your 'tag' data, and will help you define the Mapping that meets your requirements.
answered Nov 22 '18 at 5:55
kevvo83kevvo83
11
11
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53385768%2fsearching-period-and-hyphen-delimited-fields-in-elasticsearch%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown