Searching period and hyphen-delimited fields in Elasticsearch












0















I'm trying to find a way to use Elasticsearch to query a field that is both period and hyphen-delimited.



I have a (MySQL) data-set like this (using SQLAlchemy to access it):



id    text        tag
====================================
1 some-text A.B.c3
2 more. text A.B-C.c4
3 even more. B.A-32.D-24.f9





The core reason I use ES for search in the first place is that I want to query against the text field. That part works awesome!






But, (I think) I want the the tag to appear in the inverted index like this (I probably won't take case into account, just including it for illustration):



A.B.c3            1
A.B-C.c4 2
B.A-C2.D-24.f9 3


Then, I want to search the tag field like this:



{ "query": {
"prefix" : { "tag" : "A.B" }
}
}


And have the query return id/rows/documents 1 and 2.



Basically, I want the query to match the index(es) in this truth table:



"A." = 1, 2
"A-" = 3


How do I accomplish both the "A." match at the beginning, differentiate between a period and a hyphen (possibly boost this), and match mid-phrase based on those same delimiters?



I'd also like to weight these matches higher if they occur at the beginning of the tag field if possible.



How do I do this, or is Elasticsearch not the right tool for the job? It seems like Elasticsearch works great for my text-field comparisons on normally delimited English text, but the tag-based searches seem much harder.



UPDATE: It seems that when I index only a subset of the data that my searches return the results I would expect but when querying against the full data-set, I get fewer hits.










share|improve this question





























    0















    I'm trying to find a way to use Elasticsearch to query a field that is both period and hyphen-delimited.



    I have a (MySQL) data-set like this (using SQLAlchemy to access it):



    id    text        tag
    ====================================
    1 some-text A.B.c3
    2 more. text A.B-C.c4
    3 even more. B.A-32.D-24.f9





    The core reason I use ES for search in the first place is that I want to query against the text field. That part works awesome!






    But, (I think) I want the the tag to appear in the inverted index like this (I probably won't take case into account, just including it for illustration):



    A.B.c3            1
    A.B-C.c4 2
    B.A-C2.D-24.f9 3


    Then, I want to search the tag field like this:



    { "query": {
    "prefix" : { "tag" : "A.B" }
    }
    }


    And have the query return id/rows/documents 1 and 2.



    Basically, I want the query to match the index(es) in this truth table:



    "A." = 1, 2
    "A-" = 3


    How do I accomplish both the "A." match at the beginning, differentiate between a period and a hyphen (possibly boost this), and match mid-phrase based on those same delimiters?



    I'd also like to weight these matches higher if they occur at the beginning of the tag field if possible.



    How do I do this, or is Elasticsearch not the right tool for the job? It seems like Elasticsearch works great for my text-field comparisons on normally delimited English text, but the tag-based searches seem much harder.



    UPDATE: It seems that when I index only a subset of the data that my searches return the results I would expect but when querying against the full data-set, I get fewer hits.










    share|improve this question



























      0












      0








      0








      I'm trying to find a way to use Elasticsearch to query a field that is both period and hyphen-delimited.



      I have a (MySQL) data-set like this (using SQLAlchemy to access it):



      id    text        tag
      ====================================
      1 some-text A.B.c3
      2 more. text A.B-C.c4
      3 even more. B.A-32.D-24.f9





      The core reason I use ES for search in the first place is that I want to query against the text field. That part works awesome!






      But, (I think) I want the the tag to appear in the inverted index like this (I probably won't take case into account, just including it for illustration):



      A.B.c3            1
      A.B-C.c4 2
      B.A-C2.D-24.f9 3


      Then, I want to search the tag field like this:



      { "query": {
      "prefix" : { "tag" : "A.B" }
      }
      }


      And have the query return id/rows/documents 1 and 2.



      Basically, I want the query to match the index(es) in this truth table:



      "A." = 1, 2
      "A-" = 3


      How do I accomplish both the "A." match at the beginning, differentiate between a period and a hyphen (possibly boost this), and match mid-phrase based on those same delimiters?



      I'd also like to weight these matches higher if they occur at the beginning of the tag field if possible.



      How do I do this, or is Elasticsearch not the right tool for the job? It seems like Elasticsearch works great for my text-field comparisons on normally delimited English text, but the tag-based searches seem much harder.



      UPDATE: It seems that when I index only a subset of the data that my searches return the results I would expect but when querying against the full data-set, I get fewer hits.










      share|improve this question
















      I'm trying to find a way to use Elasticsearch to query a field that is both period and hyphen-delimited.



      I have a (MySQL) data-set like this (using SQLAlchemy to access it):



      id    text        tag
      ====================================
      1 some-text A.B.c3
      2 more. text A.B-C.c4
      3 even more. B.A-32.D-24.f9





      The core reason I use ES for search in the first place is that I want to query against the text field. That part works awesome!






      But, (I think) I want the the tag to appear in the inverted index like this (I probably won't take case into account, just including it for illustration):



      A.B.c3            1
      A.B-C.c4 2
      B.A-C2.D-24.f9 3


      Then, I want to search the tag field like this:



      { "query": {
      "prefix" : { "tag" : "A.B" }
      }
      }


      And have the query return id/rows/documents 1 and 2.



      Basically, I want the query to match the index(es) in this truth table:



      "A." = 1, 2
      "A-" = 3


      How do I accomplish both the "A." match at the beginning, differentiate between a period and a hyphen (possibly boost this), and match mid-phrase based on those same delimiters?



      I'd also like to weight these matches higher if they occur at the beginning of the tag field if possible.



      How do I do this, or is Elasticsearch not the right tool for the job? It seems like Elasticsearch works great for my text-field comparisons on normally delimited English text, but the tag-based searches seem much harder.



      UPDATE: It seems that when I index only a subset of the data that my searches return the results I would expect but when querying against the full data-set, I get fewer hits.







      python elasticsearch search






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 21 '18 at 21:33







      Jonathan Rys

















      asked Nov 20 '18 at 3:24









      Jonathan RysJonathan Rys

      1,156520




      1,156520
























          2 Answers
          2






          active

          oldest

          votes


















          1














          This can be done via N-Gram tokenizer.



          Based on what you've provided in question, I've created its corresponding mapping, documents and a sample query to give you what you are looking for.



          Mapping



          PUT idtesttag
          {
          "settings": {
          "analysis": {
          "analyzer": {
          "my_analyzer": {
          "tokenizer": "my_tokenizer"
          }
          },
          "tokenizer": {
          "my_tokenizer": {
          "type": "ngram",
          "min_gram": 2,
          "max_gram": 5
          }
          }
          }
          },
          "mappings": {
          "mydocs": {
          "properties": {
          "id": {
          "type": "long"
          },
          "text": {
          "type": "text",
          "analyzer": "my_analyzer"
          },
          "tag": {
          "type": "text",
          "analyzer": "my_analyzer"
          }
          }
          }
          }
          }


          What this would do is, if you have a document with id = 1 has a tag A.B it would store following group of characters in its inverted index.



           A. -> 1
          .B -> 1
          A.B -> 1


          So if your query has any of these three words, your document with id=1 would be returned.



          Sample Documents



          POST idtesttag/mydocs/1
          {
          "id": 1,
          "text": "some-text",
          "tag": "A.B.c3"
          }

          POST idtesttag/mydocs/2
          {
          "id": 2,
          "text": "more. text",
          "tag": "A.B-C.c4"
          }

          POST idtesttag/mydocs/3
          {
          "id": 3,
          "text": "even more.",
          "tag": "B.A-32.D-24.f9"
          }

          POST idtesttag/mydocs/4
          {
          "id": 3,
          "text": "even more.",
          "tag": "B.A.B-32.D-24.f9"
          }


          Sample Query



          POST idtesttag/_search
          {
          "query": {
          "match": {
          "tag": "A.B"
          }
          }
          }


          Query Response



          {
          "took": 139,
          "timed_out": false,
          "_shards": {
          "total": 5,
          "successful": 5,
          "skipped": 0,
          "failed": 0
          },
          "hits": {
          "total": 3,
          "max_score": 0.8630463,
          "hits": [
          {
          "_index": "idtesttag",
          "_type": "mydocs",
          "_id": "1",
          "_score": 0.8630463,
          "_source": {
          "id": 1,
          "text": "some-text",
          "tag": "A.B.c3"
          }
          },
          {
          "_index": "idtesttag",
          "_type": "mydocs",
          "_id": "2",
          "_score": 0.66078395,
          "_source": {
          "id": 2,
          "text": "more. text",
          "tag": "A.B-C.c4"
          }
          },
          {
          "_index": "idtesttag",
          "_type": "mydocs",
          "_id": "4",
          "_score": 0.46659434,
          "_source": {
          "id": 3,
          "text": "even more.",
          "tag": "B.A.B-32.D-24.f9"
          }
          }
          ]
          }
          }


          Note that the documents 1, 2 and 4 are returned in the response. The document 4 is the mid sentence match while documents 1 & 2 are at the beginning.



          Also note the score value as how it appears.



          Boosting based on hypen



          Now with regards to boosting based on hypen character, I'd suggest you to have Bool query along with Regex Query with Boosting. Below is the sample query I came up with.



          Note that just for sake of simplicity I've added regex where it would only boost if hypen is next to A.B.



          POST idtesttag/_search
          {
          "query": {
          "bool": {
          "must" : {
          "match" : { "tag" : "A.B" }
          },
          "should": [
          {
          "regexp": {
          "tag": {
          "value": "A.B-.*",
          "boost": 3
          }
          }
          }
          ]
          }
          }
          }


          Boosting Query Response



          {
          "took": 2,
          "timed_out": false,
          "_shards": {
          "total": 5,
          "successful": 5,
          "skipped": 0,
          "failed": 0
          },
          "hits": {
          "total": 3,
          "max_score": 3.660784,
          "hits": [
          {
          "_index": "idtesttag",
          "_type": "mydocs",
          "_id": "2",
          "_score": 3.660784,
          "_source": {
          "id": 2,
          "text": "more. text",
          "tag": "A.B-C.c4"
          }
          },
          {
          "_index": "idtesttag",
          "_type": "mydocs",
          "_id": "4",
          "_score": 3.4665942,
          "_source": {
          "id": 3,
          "text": "even more.",
          "tag": "B.A.B-32.D-24.f9"
          }
          },
          {
          "_index": "idtesttag",
          "_type": "mydocs",
          "_id": "1",
          "_score": 0.8630463,
          "_source": {
          "id": 1,
          "text": "some-text",
          "tag": "A.B.c3"
          }
          }
          ]
          }
          }


          Just ensure that your testing is thorough when it comes to boosting because its all about influencing the score & make sure you do that with prod data ingested in DEV/TEST Elastic index.



          That way you'd not be spooked when you see totally different results if you move to PROD Elastic.



          I'm sorry its pretty long answer but I hope this helps!






          share|improve this answer































            0















            But, (I think) I want the the tag to appear in the inverted index like this (I probably won't take case into account, just including it for illustration):



            Then, I want to search the tag field like this:




            Based on what you've described in your post reg. the 'tag' field, here's my 2 cents.



            Your Mysql data should be in 1 type (in 6.5 it's 'doc' by default). You do need to explicitly define your Index Mapping though - especially on the 'tag' field, as you seem to have search requirements.



            I would define your 'tag' field as a multi-field of:




            • type 'keyword' for aggregations

            • type 'text' for searches, with a custom analyzer (that might use 'whitespace' tokenizer, and an 'edge ngram' token filter


            (if you don't need aggregations, then just define a 'text' type field with the custom analyzer)



            FYI, The Analyze API will show you what ES is doing with your 'tag' data, and will help you define the Mapping that meets your requirements.






            share|improve this answer























              Your Answer






              StackExchange.ifUsing("editor", function () {
              StackExchange.using("externalEditor", function () {
              StackExchange.using("snippets", function () {
              StackExchange.snippets.init();
              });
              });
              }, "code-snippets");

              StackExchange.ready(function() {
              var channelOptions = {
              tags: "".split(" "),
              id: "1"
              };
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function() {
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled) {
              StackExchange.using("snippets", function() {
              createEditor();
              });
              }
              else {
              createEditor();
              }
              });

              function createEditor() {
              StackExchange.prepareEditor({
              heartbeatType: 'answer',
              autoActivateHeartbeat: false,
              convertImagesToLinks: true,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: 10,
              bindNavPrevention: true,
              postfix: "",
              imageUploader: {
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              },
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              });


              }
              });














              draft saved

              draft discarded


















              StackExchange.ready(
              function () {
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53385768%2fsearching-period-and-hyphen-delimited-fields-in-elasticsearch%23new-answer', 'question_page');
              }
              );

              Post as a guest















              Required, but never shown

























              2 Answers
              2






              active

              oldest

              votes








              2 Answers
              2






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes









              1














              This can be done via N-Gram tokenizer.



              Based on what you've provided in question, I've created its corresponding mapping, documents and a sample query to give you what you are looking for.



              Mapping



              PUT idtesttag
              {
              "settings": {
              "analysis": {
              "analyzer": {
              "my_analyzer": {
              "tokenizer": "my_tokenizer"
              }
              },
              "tokenizer": {
              "my_tokenizer": {
              "type": "ngram",
              "min_gram": 2,
              "max_gram": 5
              }
              }
              }
              },
              "mappings": {
              "mydocs": {
              "properties": {
              "id": {
              "type": "long"
              },
              "text": {
              "type": "text",
              "analyzer": "my_analyzer"
              },
              "tag": {
              "type": "text",
              "analyzer": "my_analyzer"
              }
              }
              }
              }
              }


              What this would do is, if you have a document with id = 1 has a tag A.B it would store following group of characters in its inverted index.



               A. -> 1
              .B -> 1
              A.B -> 1


              So if your query has any of these three words, your document with id=1 would be returned.



              Sample Documents



              POST idtesttag/mydocs/1
              {
              "id": 1,
              "text": "some-text",
              "tag": "A.B.c3"
              }

              POST idtesttag/mydocs/2
              {
              "id": 2,
              "text": "more. text",
              "tag": "A.B-C.c4"
              }

              POST idtesttag/mydocs/3
              {
              "id": 3,
              "text": "even more.",
              "tag": "B.A-32.D-24.f9"
              }

              POST idtesttag/mydocs/4
              {
              "id": 3,
              "text": "even more.",
              "tag": "B.A.B-32.D-24.f9"
              }


              Sample Query



              POST idtesttag/_search
              {
              "query": {
              "match": {
              "tag": "A.B"
              }
              }
              }


              Query Response



              {
              "took": 139,
              "timed_out": false,
              "_shards": {
              "total": 5,
              "successful": 5,
              "skipped": 0,
              "failed": 0
              },
              "hits": {
              "total": 3,
              "max_score": 0.8630463,
              "hits": [
              {
              "_index": "idtesttag",
              "_type": "mydocs",
              "_id": "1",
              "_score": 0.8630463,
              "_source": {
              "id": 1,
              "text": "some-text",
              "tag": "A.B.c3"
              }
              },
              {
              "_index": "idtesttag",
              "_type": "mydocs",
              "_id": "2",
              "_score": 0.66078395,
              "_source": {
              "id": 2,
              "text": "more. text",
              "tag": "A.B-C.c4"
              }
              },
              {
              "_index": "idtesttag",
              "_type": "mydocs",
              "_id": "4",
              "_score": 0.46659434,
              "_source": {
              "id": 3,
              "text": "even more.",
              "tag": "B.A.B-32.D-24.f9"
              }
              }
              ]
              }
              }


              Note that the documents 1, 2 and 4 are returned in the response. The document 4 is the mid sentence match while documents 1 & 2 are at the beginning.



              Also note the score value as how it appears.



              Boosting based on hypen



              Now with regards to boosting based on hypen character, I'd suggest you to have Bool query along with Regex Query with Boosting. Below is the sample query I came up with.



              Note that just for sake of simplicity I've added regex where it would only boost if hypen is next to A.B.



              POST idtesttag/_search
              {
              "query": {
              "bool": {
              "must" : {
              "match" : { "tag" : "A.B" }
              },
              "should": [
              {
              "regexp": {
              "tag": {
              "value": "A.B-.*",
              "boost": 3
              }
              }
              }
              ]
              }
              }
              }


              Boosting Query Response



              {
              "took": 2,
              "timed_out": false,
              "_shards": {
              "total": 5,
              "successful": 5,
              "skipped": 0,
              "failed": 0
              },
              "hits": {
              "total": 3,
              "max_score": 3.660784,
              "hits": [
              {
              "_index": "idtesttag",
              "_type": "mydocs",
              "_id": "2",
              "_score": 3.660784,
              "_source": {
              "id": 2,
              "text": "more. text",
              "tag": "A.B-C.c4"
              }
              },
              {
              "_index": "idtesttag",
              "_type": "mydocs",
              "_id": "4",
              "_score": 3.4665942,
              "_source": {
              "id": 3,
              "text": "even more.",
              "tag": "B.A.B-32.D-24.f9"
              }
              },
              {
              "_index": "idtesttag",
              "_type": "mydocs",
              "_id": "1",
              "_score": 0.8630463,
              "_source": {
              "id": 1,
              "text": "some-text",
              "tag": "A.B.c3"
              }
              }
              ]
              }
              }


              Just ensure that your testing is thorough when it comes to boosting because its all about influencing the score & make sure you do that with prod data ingested in DEV/TEST Elastic index.



              That way you'd not be spooked when you see totally different results if you move to PROD Elastic.



              I'm sorry its pretty long answer but I hope this helps!






              share|improve this answer




























                1














                This can be done via N-Gram tokenizer.



                Based on what you've provided in question, I've created its corresponding mapping, documents and a sample query to give you what you are looking for.



                Mapping



                PUT idtesttag
                {
                "settings": {
                "analysis": {
                "analyzer": {
                "my_analyzer": {
                "tokenizer": "my_tokenizer"
                }
                },
                "tokenizer": {
                "my_tokenizer": {
                "type": "ngram",
                "min_gram": 2,
                "max_gram": 5
                }
                }
                }
                },
                "mappings": {
                "mydocs": {
                "properties": {
                "id": {
                "type": "long"
                },
                "text": {
                "type": "text",
                "analyzer": "my_analyzer"
                },
                "tag": {
                "type": "text",
                "analyzer": "my_analyzer"
                }
                }
                }
                }
                }


                What this would do is, if you have a document with id = 1 has a tag A.B it would store following group of characters in its inverted index.



                 A. -> 1
                .B -> 1
                A.B -> 1


                So if your query has any of these three words, your document with id=1 would be returned.



                Sample Documents



                POST idtesttag/mydocs/1
                {
                "id": 1,
                "text": "some-text",
                "tag": "A.B.c3"
                }

                POST idtesttag/mydocs/2
                {
                "id": 2,
                "text": "more. text",
                "tag": "A.B-C.c4"
                }

                POST idtesttag/mydocs/3
                {
                "id": 3,
                "text": "even more.",
                "tag": "B.A-32.D-24.f9"
                }

                POST idtesttag/mydocs/4
                {
                "id": 3,
                "text": "even more.",
                "tag": "B.A.B-32.D-24.f9"
                }


                Sample Query



                POST idtesttag/_search
                {
                "query": {
                "match": {
                "tag": "A.B"
                }
                }
                }


                Query Response



                {
                "took": 139,
                "timed_out": false,
                "_shards": {
                "total": 5,
                "successful": 5,
                "skipped": 0,
                "failed": 0
                },
                "hits": {
                "total": 3,
                "max_score": 0.8630463,
                "hits": [
                {
                "_index": "idtesttag",
                "_type": "mydocs",
                "_id": "1",
                "_score": 0.8630463,
                "_source": {
                "id": 1,
                "text": "some-text",
                "tag": "A.B.c3"
                }
                },
                {
                "_index": "idtesttag",
                "_type": "mydocs",
                "_id": "2",
                "_score": 0.66078395,
                "_source": {
                "id": 2,
                "text": "more. text",
                "tag": "A.B-C.c4"
                }
                },
                {
                "_index": "idtesttag",
                "_type": "mydocs",
                "_id": "4",
                "_score": 0.46659434,
                "_source": {
                "id": 3,
                "text": "even more.",
                "tag": "B.A.B-32.D-24.f9"
                }
                }
                ]
                }
                }


                Note that the documents 1, 2 and 4 are returned in the response. The document 4 is the mid sentence match while documents 1 & 2 are at the beginning.



                Also note the score value as how it appears.



                Boosting based on hypen



                Now with regards to boosting based on hypen character, I'd suggest you to have Bool query along with Regex Query with Boosting. Below is the sample query I came up with.



                Note that just for sake of simplicity I've added regex where it would only boost if hypen is next to A.B.



                POST idtesttag/_search
                {
                "query": {
                "bool": {
                "must" : {
                "match" : { "tag" : "A.B" }
                },
                "should": [
                {
                "regexp": {
                "tag": {
                "value": "A.B-.*",
                "boost": 3
                }
                }
                }
                ]
                }
                }
                }


                Boosting Query Response



                {
                "took": 2,
                "timed_out": false,
                "_shards": {
                "total": 5,
                "successful": 5,
                "skipped": 0,
                "failed": 0
                },
                "hits": {
                "total": 3,
                "max_score": 3.660784,
                "hits": [
                {
                "_index": "idtesttag",
                "_type": "mydocs",
                "_id": "2",
                "_score": 3.660784,
                "_source": {
                "id": 2,
                "text": "more. text",
                "tag": "A.B-C.c4"
                }
                },
                {
                "_index": "idtesttag",
                "_type": "mydocs",
                "_id": "4",
                "_score": 3.4665942,
                "_source": {
                "id": 3,
                "text": "even more.",
                "tag": "B.A.B-32.D-24.f9"
                }
                },
                {
                "_index": "idtesttag",
                "_type": "mydocs",
                "_id": "1",
                "_score": 0.8630463,
                "_source": {
                "id": 1,
                "text": "some-text",
                "tag": "A.B.c3"
                }
                }
                ]
                }
                }


                Just ensure that your testing is thorough when it comes to boosting because its all about influencing the score & make sure you do that with prod data ingested in DEV/TEST Elastic index.



                That way you'd not be spooked when you see totally different results if you move to PROD Elastic.



                I'm sorry its pretty long answer but I hope this helps!






                share|improve this answer


























                  1












                  1








                  1







                  This can be done via N-Gram tokenizer.



                  Based on what you've provided in question, I've created its corresponding mapping, documents and a sample query to give you what you are looking for.



                  Mapping



                  PUT idtesttag
                  {
                  "settings": {
                  "analysis": {
                  "analyzer": {
                  "my_analyzer": {
                  "tokenizer": "my_tokenizer"
                  }
                  },
                  "tokenizer": {
                  "my_tokenizer": {
                  "type": "ngram",
                  "min_gram": 2,
                  "max_gram": 5
                  }
                  }
                  }
                  },
                  "mappings": {
                  "mydocs": {
                  "properties": {
                  "id": {
                  "type": "long"
                  },
                  "text": {
                  "type": "text",
                  "analyzer": "my_analyzer"
                  },
                  "tag": {
                  "type": "text",
                  "analyzer": "my_analyzer"
                  }
                  }
                  }
                  }
                  }


                  What this would do is, if you have a document with id = 1 has a tag A.B it would store following group of characters in its inverted index.



                   A. -> 1
                  .B -> 1
                  A.B -> 1


                  So if your query has any of these three words, your document with id=1 would be returned.



                  Sample Documents



                  POST idtesttag/mydocs/1
                  {
                  "id": 1,
                  "text": "some-text",
                  "tag": "A.B.c3"
                  }

                  POST idtesttag/mydocs/2
                  {
                  "id": 2,
                  "text": "more. text",
                  "tag": "A.B-C.c4"
                  }

                  POST idtesttag/mydocs/3
                  {
                  "id": 3,
                  "text": "even more.",
                  "tag": "B.A-32.D-24.f9"
                  }

                  POST idtesttag/mydocs/4
                  {
                  "id": 3,
                  "text": "even more.",
                  "tag": "B.A.B-32.D-24.f9"
                  }


                  Sample Query



                  POST idtesttag/_search
                  {
                  "query": {
                  "match": {
                  "tag": "A.B"
                  }
                  }
                  }


                  Query Response



                  {
                  "took": 139,
                  "timed_out": false,
                  "_shards": {
                  "total": 5,
                  "successful": 5,
                  "skipped": 0,
                  "failed": 0
                  },
                  "hits": {
                  "total": 3,
                  "max_score": 0.8630463,
                  "hits": [
                  {
                  "_index": "idtesttag",
                  "_type": "mydocs",
                  "_id": "1",
                  "_score": 0.8630463,
                  "_source": {
                  "id": 1,
                  "text": "some-text",
                  "tag": "A.B.c3"
                  }
                  },
                  {
                  "_index": "idtesttag",
                  "_type": "mydocs",
                  "_id": "2",
                  "_score": 0.66078395,
                  "_source": {
                  "id": 2,
                  "text": "more. text",
                  "tag": "A.B-C.c4"
                  }
                  },
                  {
                  "_index": "idtesttag",
                  "_type": "mydocs",
                  "_id": "4",
                  "_score": 0.46659434,
                  "_source": {
                  "id": 3,
                  "text": "even more.",
                  "tag": "B.A.B-32.D-24.f9"
                  }
                  }
                  ]
                  }
                  }


                  Note that the documents 1, 2 and 4 are returned in the response. The document 4 is the mid sentence match while documents 1 & 2 are at the beginning.



                  Also note the score value as how it appears.



                  Boosting based on hypen



                  Now with regards to boosting based on hypen character, I'd suggest you to have Bool query along with Regex Query with Boosting. Below is the sample query I came up with.



                  Note that just for sake of simplicity I've added regex where it would only boost if hypen is next to A.B.



                  POST idtesttag/_search
                  {
                  "query": {
                  "bool": {
                  "must" : {
                  "match" : { "tag" : "A.B" }
                  },
                  "should": [
                  {
                  "regexp": {
                  "tag": {
                  "value": "A.B-.*",
                  "boost": 3
                  }
                  }
                  }
                  ]
                  }
                  }
                  }


                  Boosting Query Response



                  {
                  "took": 2,
                  "timed_out": false,
                  "_shards": {
                  "total": 5,
                  "successful": 5,
                  "skipped": 0,
                  "failed": 0
                  },
                  "hits": {
                  "total": 3,
                  "max_score": 3.660784,
                  "hits": [
                  {
                  "_index": "idtesttag",
                  "_type": "mydocs",
                  "_id": "2",
                  "_score": 3.660784,
                  "_source": {
                  "id": 2,
                  "text": "more. text",
                  "tag": "A.B-C.c4"
                  }
                  },
                  {
                  "_index": "idtesttag",
                  "_type": "mydocs",
                  "_id": "4",
                  "_score": 3.4665942,
                  "_source": {
                  "id": 3,
                  "text": "even more.",
                  "tag": "B.A.B-32.D-24.f9"
                  }
                  },
                  {
                  "_index": "idtesttag",
                  "_type": "mydocs",
                  "_id": "1",
                  "_score": 0.8630463,
                  "_source": {
                  "id": 1,
                  "text": "some-text",
                  "tag": "A.B.c3"
                  }
                  }
                  ]
                  }
                  }


                  Just ensure that your testing is thorough when it comes to boosting because its all about influencing the score & make sure you do that with prod data ingested in DEV/TEST Elastic index.



                  That way you'd not be spooked when you see totally different results if you move to PROD Elastic.



                  I'm sorry its pretty long answer but I hope this helps!






                  share|improve this answer













                  This can be done via N-Gram tokenizer.



                  Based on what you've provided in question, I've created its corresponding mapping, documents and a sample query to give you what you are looking for.



                  Mapping



                  PUT idtesttag
                  {
                  "settings": {
                  "analysis": {
                  "analyzer": {
                  "my_analyzer": {
                  "tokenizer": "my_tokenizer"
                  }
                  },
                  "tokenizer": {
                  "my_tokenizer": {
                  "type": "ngram",
                  "min_gram": 2,
                  "max_gram": 5
                  }
                  }
                  }
                  },
                  "mappings": {
                  "mydocs": {
                  "properties": {
                  "id": {
                  "type": "long"
                  },
                  "text": {
                  "type": "text",
                  "analyzer": "my_analyzer"
                  },
                  "tag": {
                  "type": "text",
                  "analyzer": "my_analyzer"
                  }
                  }
                  }
                  }
                  }


                  What this would do is, if you have a document with id = 1 has a tag A.B it would store following group of characters in its inverted index.



                   A. -> 1
                  .B -> 1
                  A.B -> 1


                  So if your query has any of these three words, your document with id=1 would be returned.



                  Sample Documents



                  POST idtesttag/mydocs/1
                  {
                  "id": 1,
                  "text": "some-text",
                  "tag": "A.B.c3"
                  }

                  POST idtesttag/mydocs/2
                  {
                  "id": 2,
                  "text": "more. text",
                  "tag": "A.B-C.c4"
                  }

                  POST idtesttag/mydocs/3
                  {
                  "id": 3,
                  "text": "even more.",
                  "tag": "B.A-32.D-24.f9"
                  }

                  POST idtesttag/mydocs/4
                  {
                  "id": 3,
                  "text": "even more.",
                  "tag": "B.A.B-32.D-24.f9"
                  }


                  Sample Query



                  POST idtesttag/_search
                  {
                  "query": {
                  "match": {
                  "tag": "A.B"
                  }
                  }
                  }


                  Query Response



                  {
                  "took": 139,
                  "timed_out": false,
                  "_shards": {
                  "total": 5,
                  "successful": 5,
                  "skipped": 0,
                  "failed": 0
                  },
                  "hits": {
                  "total": 3,
                  "max_score": 0.8630463,
                  "hits": [
                  {
                  "_index": "idtesttag",
                  "_type": "mydocs",
                  "_id": "1",
                  "_score": 0.8630463,
                  "_source": {
                  "id": 1,
                  "text": "some-text",
                  "tag": "A.B.c3"
                  }
                  },
                  {
                  "_index": "idtesttag",
                  "_type": "mydocs",
                  "_id": "2",
                  "_score": 0.66078395,
                  "_source": {
                  "id": 2,
                  "text": "more. text",
                  "tag": "A.B-C.c4"
                  }
                  },
                  {
                  "_index": "idtesttag",
                  "_type": "mydocs",
                  "_id": "4",
                  "_score": 0.46659434,
                  "_source": {
                  "id": 3,
                  "text": "even more.",
                  "tag": "B.A.B-32.D-24.f9"
                  }
                  }
                  ]
                  }
                  }


                  Note that the documents 1, 2 and 4 are returned in the response. The document 4 is the mid sentence match while documents 1 & 2 are at the beginning.



                  Also note the score value as how it appears.



                  Boosting based on hypen



                  Now with regards to boosting based on hypen character, I'd suggest you to have Bool query along with Regex Query with Boosting. Below is the sample query I came up with.



                  Note that just for sake of simplicity I've added regex where it would only boost if hypen is next to A.B.



                  POST idtesttag/_search
                  {
                  "query": {
                  "bool": {
                  "must" : {
                  "match" : { "tag" : "A.B" }
                  },
                  "should": [
                  {
                  "regexp": {
                  "tag": {
                  "value": "A.B-.*",
                  "boost": 3
                  }
                  }
                  }
                  ]
                  }
                  }
                  }


                  Boosting Query Response



                  {
                  "took": 2,
                  "timed_out": false,
                  "_shards": {
                  "total": 5,
                  "successful": 5,
                  "skipped": 0,
                  "failed": 0
                  },
                  "hits": {
                  "total": 3,
                  "max_score": 3.660784,
                  "hits": [
                  {
                  "_index": "idtesttag",
                  "_type": "mydocs",
                  "_id": "2",
                  "_score": 3.660784,
                  "_source": {
                  "id": 2,
                  "text": "more. text",
                  "tag": "A.B-C.c4"
                  }
                  },
                  {
                  "_index": "idtesttag",
                  "_type": "mydocs",
                  "_id": "4",
                  "_score": 3.4665942,
                  "_source": {
                  "id": 3,
                  "text": "even more.",
                  "tag": "B.A.B-32.D-24.f9"
                  }
                  },
                  {
                  "_index": "idtesttag",
                  "_type": "mydocs",
                  "_id": "1",
                  "_score": 0.8630463,
                  "_source": {
                  "id": 1,
                  "text": "some-text",
                  "tag": "A.B.c3"
                  }
                  }
                  ]
                  }
                  }


                  Just ensure that your testing is thorough when it comes to boosting because its all about influencing the score & make sure you do that with prod data ingested in DEV/TEST Elastic index.



                  That way you'd not be spooked when you see totally different results if you move to PROD Elastic.



                  I'm sorry its pretty long answer but I hope this helps!







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Nov 22 '18 at 8:08









                  KamalKamal

                  1,7681920




                  1,7681920

























                      0















                      But, (I think) I want the the tag to appear in the inverted index like this (I probably won't take case into account, just including it for illustration):



                      Then, I want to search the tag field like this:




                      Based on what you've described in your post reg. the 'tag' field, here's my 2 cents.



                      Your Mysql data should be in 1 type (in 6.5 it's 'doc' by default). You do need to explicitly define your Index Mapping though - especially on the 'tag' field, as you seem to have search requirements.



                      I would define your 'tag' field as a multi-field of:




                      • type 'keyword' for aggregations

                      • type 'text' for searches, with a custom analyzer (that might use 'whitespace' tokenizer, and an 'edge ngram' token filter


                      (if you don't need aggregations, then just define a 'text' type field with the custom analyzer)



                      FYI, The Analyze API will show you what ES is doing with your 'tag' data, and will help you define the Mapping that meets your requirements.






                      share|improve this answer




























                        0















                        But, (I think) I want the the tag to appear in the inverted index like this (I probably won't take case into account, just including it for illustration):



                        Then, I want to search the tag field like this:




                        Based on what you've described in your post reg. the 'tag' field, here's my 2 cents.



                        Your Mysql data should be in 1 type (in 6.5 it's 'doc' by default). You do need to explicitly define your Index Mapping though - especially on the 'tag' field, as you seem to have search requirements.



                        I would define your 'tag' field as a multi-field of:




                        • type 'keyword' for aggregations

                        • type 'text' for searches, with a custom analyzer (that might use 'whitespace' tokenizer, and an 'edge ngram' token filter


                        (if you don't need aggregations, then just define a 'text' type field with the custom analyzer)



                        FYI, The Analyze API will show you what ES is doing with your 'tag' data, and will help you define the Mapping that meets your requirements.






                        share|improve this answer


























                          0












                          0








                          0








                          But, (I think) I want the the tag to appear in the inverted index like this (I probably won't take case into account, just including it for illustration):



                          Then, I want to search the tag field like this:




                          Based on what you've described in your post reg. the 'tag' field, here's my 2 cents.



                          Your Mysql data should be in 1 type (in 6.5 it's 'doc' by default). You do need to explicitly define your Index Mapping though - especially on the 'tag' field, as you seem to have search requirements.



                          I would define your 'tag' field as a multi-field of:




                          • type 'keyword' for aggregations

                          • type 'text' for searches, with a custom analyzer (that might use 'whitespace' tokenizer, and an 'edge ngram' token filter


                          (if you don't need aggregations, then just define a 'text' type field with the custom analyzer)



                          FYI, The Analyze API will show you what ES is doing with your 'tag' data, and will help you define the Mapping that meets your requirements.






                          share|improve this answer














                          But, (I think) I want the the tag to appear in the inverted index like this (I probably won't take case into account, just including it for illustration):



                          Then, I want to search the tag field like this:




                          Based on what you've described in your post reg. the 'tag' field, here's my 2 cents.



                          Your Mysql data should be in 1 type (in 6.5 it's 'doc' by default). You do need to explicitly define your Index Mapping though - especially on the 'tag' field, as you seem to have search requirements.



                          I would define your 'tag' field as a multi-field of:




                          • type 'keyword' for aggregations

                          • type 'text' for searches, with a custom analyzer (that might use 'whitespace' tokenizer, and an 'edge ngram' token filter


                          (if you don't need aggregations, then just define a 'text' type field with the custom analyzer)



                          FYI, The Analyze API will show you what ES is doing with your 'tag' data, and will help you define the Mapping that meets your requirements.







                          share|improve this answer












                          share|improve this answer



                          share|improve this answer










                          answered Nov 22 '18 at 5:55









                          kevvo83kevvo83

                          11




                          11






























                              draft saved

                              draft discarded




















































                              Thanks for contributing an answer to Stack Overflow!


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid



                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.


                              To learn more, see our tips on writing great answers.




                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function () {
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53385768%2fsearching-period-and-hyphen-delimited-fields-in-elasticsearch%23new-answer', 'question_page');
                              }
                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              這個網誌中的熱門文章

                              Academy of Television Arts & Sciences

                              L'Équipe

                              1995 France bombings