extract words of a certain language out of an xml file











up vote
0
down vote

favorite












given the following xml (which of course consists of many records),



i would like to output unique values out of it, and also generate a report, that would have the records each word was found.



    <collection>
<record>
<controlfield tag="001">1</controlfield>
<datafield tag="200" ind1="1" ind2=" ">
<subfield code="a">Metafore po</subfield>
<subfield code="e">Δοκίμια</subfield>
<subfield code="f">Περικλής αρχαία Ελλάδα</subfield>
</datafield>
<datafield tag="210" ind1="|" ind2="|">
<subfield code="a">Η Αθήνα</subfield>
<subfield code="c">Νοέμβριος</subfield>
<subfield code="d">1999</subfield>
</datafield>
<datafield tag="215" ind1=" " ind2=" ">
<subfield code="a">263 s.</subfield>
</datafield>
<datafield tag="606" ind1="|" ind2=" ">
<subfield code="3">250000087120140311174609</subfield>
<subfield code="a">Πλάτων ιστορία</subfield>
</datafield>
<datafield tag="700" ind1=" " ind2="1">
<subfield code="3">200000000120140228092156</subfield>
<subfield code="4">070</subfield>
<subfield code="a">Liper</subfield>
<subfield code="b">Berit von der</subfield>
</datafield>
</record>
<record>
<controlfield tag="001">here text may also exist</controlfield>
<datafield tag="200" ind1="1" ind2=" ">
<subfield code="a">Metafore po</subfield>
<subfield code="e">Δοκίμια</subfield>
<subfield code="f">Περικλής</subfield>
</datafield>
</collection>


desired output (xml format, or whatever is more easily achieved)



Δοκίμια: 1, here text may also exist
Περικλής: 1, here text may also exist
αρχαία: 1
Η: 1


etc...
regex i have tried with:
/[Α-Ωα-ω]{1,}/










share|improve this question


























    up vote
    0
    down vote

    favorite












    given the following xml (which of course consists of many records),



    i would like to output unique values out of it, and also generate a report, that would have the records each word was found.



        <collection>
    <record>
    <controlfield tag="001">1</controlfield>
    <datafield tag="200" ind1="1" ind2=" ">
    <subfield code="a">Metafore po</subfield>
    <subfield code="e">Δοκίμια</subfield>
    <subfield code="f">Περικλής αρχαία Ελλάδα</subfield>
    </datafield>
    <datafield tag="210" ind1="|" ind2="|">
    <subfield code="a">Η Αθήνα</subfield>
    <subfield code="c">Νοέμβριος</subfield>
    <subfield code="d">1999</subfield>
    </datafield>
    <datafield tag="215" ind1=" " ind2=" ">
    <subfield code="a">263 s.</subfield>
    </datafield>
    <datafield tag="606" ind1="|" ind2=" ">
    <subfield code="3">250000087120140311174609</subfield>
    <subfield code="a">Πλάτων ιστορία</subfield>
    </datafield>
    <datafield tag="700" ind1=" " ind2="1">
    <subfield code="3">200000000120140228092156</subfield>
    <subfield code="4">070</subfield>
    <subfield code="a">Liper</subfield>
    <subfield code="b">Berit von der</subfield>
    </datafield>
    </record>
    <record>
    <controlfield tag="001">here text may also exist</controlfield>
    <datafield tag="200" ind1="1" ind2=" ">
    <subfield code="a">Metafore po</subfield>
    <subfield code="e">Δοκίμια</subfield>
    <subfield code="f">Περικλής</subfield>
    </datafield>
    </collection>


    desired output (xml format, or whatever is more easily achieved)



    Δοκίμια: 1, here text may also exist
    Περικλής: 1, here text may also exist
    αρχαία: 1
    Η: 1


    etc...
    regex i have tried with:
    /[Α-Ωα-ω]{1,}/










    share|improve this question
























      up vote
      0
      down vote

      favorite









      up vote
      0
      down vote

      favorite











      given the following xml (which of course consists of many records),



      i would like to output unique values out of it, and also generate a report, that would have the records each word was found.



          <collection>
      <record>
      <controlfield tag="001">1</controlfield>
      <datafield tag="200" ind1="1" ind2=" ">
      <subfield code="a">Metafore po</subfield>
      <subfield code="e">Δοκίμια</subfield>
      <subfield code="f">Περικλής αρχαία Ελλάδα</subfield>
      </datafield>
      <datafield tag="210" ind1="|" ind2="|">
      <subfield code="a">Η Αθήνα</subfield>
      <subfield code="c">Νοέμβριος</subfield>
      <subfield code="d">1999</subfield>
      </datafield>
      <datafield tag="215" ind1=" " ind2=" ">
      <subfield code="a">263 s.</subfield>
      </datafield>
      <datafield tag="606" ind1="|" ind2=" ">
      <subfield code="3">250000087120140311174609</subfield>
      <subfield code="a">Πλάτων ιστορία</subfield>
      </datafield>
      <datafield tag="700" ind1=" " ind2="1">
      <subfield code="3">200000000120140228092156</subfield>
      <subfield code="4">070</subfield>
      <subfield code="a">Liper</subfield>
      <subfield code="b">Berit von der</subfield>
      </datafield>
      </record>
      <record>
      <controlfield tag="001">here text may also exist</controlfield>
      <datafield tag="200" ind1="1" ind2=" ">
      <subfield code="a">Metafore po</subfield>
      <subfield code="e">Δοκίμια</subfield>
      <subfield code="f">Περικλής</subfield>
      </datafield>
      </collection>


      desired output (xml format, or whatever is more easily achieved)



      Δοκίμια: 1, here text may also exist
      Περικλής: 1, here text may also exist
      αρχαία: 1
      Η: 1


      etc...
      regex i have tried with:
      /[Α-Ωα-ω]{1,}/










      share|improve this question













      given the following xml (which of course consists of many records),



      i would like to output unique values out of it, and also generate a report, that would have the records each word was found.



          <collection>
      <record>
      <controlfield tag="001">1</controlfield>
      <datafield tag="200" ind1="1" ind2=" ">
      <subfield code="a">Metafore po</subfield>
      <subfield code="e">Δοκίμια</subfield>
      <subfield code="f">Περικλής αρχαία Ελλάδα</subfield>
      </datafield>
      <datafield tag="210" ind1="|" ind2="|">
      <subfield code="a">Η Αθήνα</subfield>
      <subfield code="c">Νοέμβριος</subfield>
      <subfield code="d">1999</subfield>
      </datafield>
      <datafield tag="215" ind1=" " ind2=" ">
      <subfield code="a">263 s.</subfield>
      </datafield>
      <datafield tag="606" ind1="|" ind2=" ">
      <subfield code="3">250000087120140311174609</subfield>
      <subfield code="a">Πλάτων ιστορία</subfield>
      </datafield>
      <datafield tag="700" ind1=" " ind2="1">
      <subfield code="3">200000000120140228092156</subfield>
      <subfield code="4">070</subfield>
      <subfield code="a">Liper</subfield>
      <subfield code="b">Berit von der</subfield>
      </datafield>
      </record>
      <record>
      <controlfield tag="001">here text may also exist</controlfield>
      <datafield tag="200" ind1="1" ind2=" ">
      <subfield code="a">Metafore po</subfield>
      <subfield code="e">Δοκίμια</subfield>
      <subfield code="f">Περικλής</subfield>
      </datafield>
      </collection>


      desired output (xml format, or whatever is more easily achieved)



      Δοκίμια: 1, here text may also exist
      Περικλής: 1, here text may also exist
      αρχαία: 1
      Η: 1


      etc...
      regex i have tried with:
      /[Α-Ωα-ω]{1,}/







      xslt-3.0






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 7 at 10:01









      cazew

      31




      31
























          1 Answer
          1






          active

          oldest

          votes

















          up vote
          0
          down vote



          accepted










          It seems you can treat that like a grouping problem:



            <xsl:template match="collection">
          <xsl:where-populated>
          <ul>
          <xsl:for-each-group select="record" group-by="datafield/subfield!tokenize(., 's')[matches(., 'p{IsGreek}')]">
          <li>
          {current-grouping-key()} : <xsl:value-of select="current-group()/controlfield" separator=", "/>
          </li>

          </xsl:for-each-group>
          </ul>
          </xsl:where-populated>
          </xsl:template>


          https://xsltfiddle.liberty-development.net/gWmuiKi/1 outputs



            <ul>
          <li>
          Δοκίμια : 1, here text may also exist
          </li>
          <li>
          Περικλής : 1, here text may also exist
          </li>
          <li>
          αρχαία : 1
          </li>
          <li>
          Ελλάδα : 1
          </li>
          <li>
          Η : 1
          </li>
          <li>
          Αθήνα : 1
          </li>
          <li>
          Νοέμβριος : 1
          </li>
          <li>
          Πλάτων : 1
          </li>
          <li>
          ιστορία : 1
          </li>
          </ul>


          that way.



          Of course identifying a "word" by simply tokenizing on white space is going to fail in mosts texts and languages, due to punctuation characters and language specific rules. But XSLT/XPath/XQuery regular expressions don't have a word break metacharacter anyway so somehow one has to use tokenize or analyze-string.






          share|improve this answer





















          • the above works perfect. Could one retrieve json output somehow, so as to store the results in mongodb? Also, seems i need to preprocess the xml, and remove () characters for instance
            – cazew
            Nov 7 at 15:27










          • XSLT 3 has a json output method w3.org/TR/xslt-xquery-serialization-31/#json-output and with the XPath 3.1 data types map and array w3.org/TR/xpath-31/#id-maps-and-arrays has support to construct JSON (like) data (w3.org/TR/xslt-30/#json) so creating JSON is possible and rather easy: xsltfiddle.liberty-development.net/gWmuiKi/2. For further problems think you need to ask a new question on that indicating the exact JSON structure you want, once you have digested the links and given it a try on your own.
            – Martin Honnen
            Nov 7 at 15:57











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














           

          draft saved


          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53187188%2fextract-words-of-a-certain-language-out-of-an-xml-file%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          up vote
          0
          down vote



          accepted










          It seems you can treat that like a grouping problem:



            <xsl:template match="collection">
          <xsl:where-populated>
          <ul>
          <xsl:for-each-group select="record" group-by="datafield/subfield!tokenize(., 's')[matches(., 'p{IsGreek}')]">
          <li>
          {current-grouping-key()} : <xsl:value-of select="current-group()/controlfield" separator=", "/>
          </li>

          </xsl:for-each-group>
          </ul>
          </xsl:where-populated>
          </xsl:template>


          https://xsltfiddle.liberty-development.net/gWmuiKi/1 outputs



            <ul>
          <li>
          Δοκίμια : 1, here text may also exist
          </li>
          <li>
          Περικλής : 1, here text may also exist
          </li>
          <li>
          αρχαία : 1
          </li>
          <li>
          Ελλάδα : 1
          </li>
          <li>
          Η : 1
          </li>
          <li>
          Αθήνα : 1
          </li>
          <li>
          Νοέμβριος : 1
          </li>
          <li>
          Πλάτων : 1
          </li>
          <li>
          ιστορία : 1
          </li>
          </ul>


          that way.



          Of course identifying a "word" by simply tokenizing on white space is going to fail in mosts texts and languages, due to punctuation characters and language specific rules. But XSLT/XPath/XQuery regular expressions don't have a word break metacharacter anyway so somehow one has to use tokenize or analyze-string.






          share|improve this answer





















          • the above works perfect. Could one retrieve json output somehow, so as to store the results in mongodb? Also, seems i need to preprocess the xml, and remove () characters for instance
            – cazew
            Nov 7 at 15:27










          • XSLT 3 has a json output method w3.org/TR/xslt-xquery-serialization-31/#json-output and with the XPath 3.1 data types map and array w3.org/TR/xpath-31/#id-maps-and-arrays has support to construct JSON (like) data (w3.org/TR/xslt-30/#json) so creating JSON is possible and rather easy: xsltfiddle.liberty-development.net/gWmuiKi/2. For further problems think you need to ask a new question on that indicating the exact JSON structure you want, once you have digested the links and given it a try on your own.
            – Martin Honnen
            Nov 7 at 15:57















          up vote
          0
          down vote



          accepted










          It seems you can treat that like a grouping problem:



            <xsl:template match="collection">
          <xsl:where-populated>
          <ul>
          <xsl:for-each-group select="record" group-by="datafield/subfield!tokenize(., 's')[matches(., 'p{IsGreek}')]">
          <li>
          {current-grouping-key()} : <xsl:value-of select="current-group()/controlfield" separator=", "/>
          </li>

          </xsl:for-each-group>
          </ul>
          </xsl:where-populated>
          </xsl:template>


          https://xsltfiddle.liberty-development.net/gWmuiKi/1 outputs



            <ul>
          <li>
          Δοκίμια : 1, here text may also exist
          </li>
          <li>
          Περικλής : 1, here text may also exist
          </li>
          <li>
          αρχαία : 1
          </li>
          <li>
          Ελλάδα : 1
          </li>
          <li>
          Η : 1
          </li>
          <li>
          Αθήνα : 1
          </li>
          <li>
          Νοέμβριος : 1
          </li>
          <li>
          Πλάτων : 1
          </li>
          <li>
          ιστορία : 1
          </li>
          </ul>


          that way.



          Of course identifying a "word" by simply tokenizing on white space is going to fail in mosts texts and languages, due to punctuation characters and language specific rules. But XSLT/XPath/XQuery regular expressions don't have a word break metacharacter anyway so somehow one has to use tokenize or analyze-string.






          share|improve this answer





















          • the above works perfect. Could one retrieve json output somehow, so as to store the results in mongodb? Also, seems i need to preprocess the xml, and remove () characters for instance
            – cazew
            Nov 7 at 15:27










          • XSLT 3 has a json output method w3.org/TR/xslt-xquery-serialization-31/#json-output and with the XPath 3.1 data types map and array w3.org/TR/xpath-31/#id-maps-and-arrays has support to construct JSON (like) data (w3.org/TR/xslt-30/#json) so creating JSON is possible and rather easy: xsltfiddle.liberty-development.net/gWmuiKi/2. For further problems think you need to ask a new question on that indicating the exact JSON structure you want, once you have digested the links and given it a try on your own.
            – Martin Honnen
            Nov 7 at 15:57













          up vote
          0
          down vote



          accepted







          up vote
          0
          down vote



          accepted






          It seems you can treat that like a grouping problem:



            <xsl:template match="collection">
          <xsl:where-populated>
          <ul>
          <xsl:for-each-group select="record" group-by="datafield/subfield!tokenize(., 's')[matches(., 'p{IsGreek}')]">
          <li>
          {current-grouping-key()} : <xsl:value-of select="current-group()/controlfield" separator=", "/>
          </li>

          </xsl:for-each-group>
          </ul>
          </xsl:where-populated>
          </xsl:template>


          https://xsltfiddle.liberty-development.net/gWmuiKi/1 outputs



            <ul>
          <li>
          Δοκίμια : 1, here text may also exist
          </li>
          <li>
          Περικλής : 1, here text may also exist
          </li>
          <li>
          αρχαία : 1
          </li>
          <li>
          Ελλάδα : 1
          </li>
          <li>
          Η : 1
          </li>
          <li>
          Αθήνα : 1
          </li>
          <li>
          Νοέμβριος : 1
          </li>
          <li>
          Πλάτων : 1
          </li>
          <li>
          ιστορία : 1
          </li>
          </ul>


          that way.



          Of course identifying a "word" by simply tokenizing on white space is going to fail in mosts texts and languages, due to punctuation characters and language specific rules. But XSLT/XPath/XQuery regular expressions don't have a word break metacharacter anyway so somehow one has to use tokenize or analyze-string.






          share|improve this answer












          It seems you can treat that like a grouping problem:



            <xsl:template match="collection">
          <xsl:where-populated>
          <ul>
          <xsl:for-each-group select="record" group-by="datafield/subfield!tokenize(., 's')[matches(., 'p{IsGreek}')]">
          <li>
          {current-grouping-key()} : <xsl:value-of select="current-group()/controlfield" separator=", "/>
          </li>

          </xsl:for-each-group>
          </ul>
          </xsl:where-populated>
          </xsl:template>


          https://xsltfiddle.liberty-development.net/gWmuiKi/1 outputs



            <ul>
          <li>
          Δοκίμια : 1, here text may also exist
          </li>
          <li>
          Περικλής : 1, here text may also exist
          </li>
          <li>
          αρχαία : 1
          </li>
          <li>
          Ελλάδα : 1
          </li>
          <li>
          Η : 1
          </li>
          <li>
          Αθήνα : 1
          </li>
          <li>
          Νοέμβριος : 1
          </li>
          <li>
          Πλάτων : 1
          </li>
          <li>
          ιστορία : 1
          </li>
          </ul>


          that way.



          Of course identifying a "word" by simply tokenizing on white space is going to fail in mosts texts and languages, due to punctuation characters and language specific rules. But XSLT/XPath/XQuery regular expressions don't have a word break metacharacter anyway so somehow one has to use tokenize or analyze-string.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 7 at 12:55









          Martin Honnen

          110k65775




          110k65775












          • the above works perfect. Could one retrieve json output somehow, so as to store the results in mongodb? Also, seems i need to preprocess the xml, and remove () characters for instance
            – cazew
            Nov 7 at 15:27










          • XSLT 3 has a json output method w3.org/TR/xslt-xquery-serialization-31/#json-output and with the XPath 3.1 data types map and array w3.org/TR/xpath-31/#id-maps-and-arrays has support to construct JSON (like) data (w3.org/TR/xslt-30/#json) so creating JSON is possible and rather easy: xsltfiddle.liberty-development.net/gWmuiKi/2. For further problems think you need to ask a new question on that indicating the exact JSON structure you want, once you have digested the links and given it a try on your own.
            – Martin Honnen
            Nov 7 at 15:57


















          • the above works perfect. Could one retrieve json output somehow, so as to store the results in mongodb? Also, seems i need to preprocess the xml, and remove () characters for instance
            – cazew
            Nov 7 at 15:27










          • XSLT 3 has a json output method w3.org/TR/xslt-xquery-serialization-31/#json-output and with the XPath 3.1 data types map and array w3.org/TR/xpath-31/#id-maps-and-arrays has support to construct JSON (like) data (w3.org/TR/xslt-30/#json) so creating JSON is possible and rather easy: xsltfiddle.liberty-development.net/gWmuiKi/2. For further problems think you need to ask a new question on that indicating the exact JSON structure you want, once you have digested the links and given it a try on your own.
            – Martin Honnen
            Nov 7 at 15:57
















          the above works perfect. Could one retrieve json output somehow, so as to store the results in mongodb? Also, seems i need to preprocess the xml, and remove () characters for instance
          – cazew
          Nov 7 at 15:27




          the above works perfect. Could one retrieve json output somehow, so as to store the results in mongodb? Also, seems i need to preprocess the xml, and remove () characters for instance
          – cazew
          Nov 7 at 15:27












          XSLT 3 has a json output method w3.org/TR/xslt-xquery-serialization-31/#json-output and with the XPath 3.1 data types map and array w3.org/TR/xpath-31/#id-maps-and-arrays has support to construct JSON (like) data (w3.org/TR/xslt-30/#json) so creating JSON is possible and rather easy: xsltfiddle.liberty-development.net/gWmuiKi/2. For further problems think you need to ask a new question on that indicating the exact JSON structure you want, once you have digested the links and given it a try on your own.
          – Martin Honnen
          Nov 7 at 15:57




          XSLT 3 has a json output method w3.org/TR/xslt-xquery-serialization-31/#json-output and with the XPath 3.1 data types map and array w3.org/TR/xpath-31/#id-maps-and-arrays has support to construct JSON (like) data (w3.org/TR/xslt-30/#json) so creating JSON is possible and rather easy: xsltfiddle.liberty-development.net/gWmuiKi/2. For further problems think you need to ask a new question on that indicating the exact JSON structure you want, once you have digested the links and given it a try on your own.
          – Martin Honnen
          Nov 7 at 15:57


















           

          draft saved


          draft discarded



















































           


          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53187188%2fextract-words-of-a-certain-language-out-of-an-xml-file%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          這個網誌中的熱門文章

          Academy of Television Arts & Sciences

          L'Équipe

          1995 France bombings