extract words of a certain language out of an xml file

up vote
0
down vote

favorite

given the following xml (which of course consists of many records),

i would like to output unique values out of it, and also generate a report, that would have the records each word was found.

    <collection>

<record>

  <controlfield tag="001">1</controlfield>

  <datafield tag="200" ind1="1" ind2=" ">

    <subfield code="a">Metafore po</subfield>

    <subfield code="e">Δοκίμια</subfield>

    <subfield code="f">Περικλής αρχαία Ελλάδα</subfield>

  </datafield>

  <datafield tag="210" ind1="|" ind2="|">

    <subfield code="a">Η Αθήνα</subfield>

    <subfield code="c">Νοέμβριος</subfield>

    <subfield code="d">1999</subfield>

  </datafield>

  <datafield tag="215" ind1=" " ind2=" ">

    <subfield code="a">263 s.</subfield>

  </datafield>

  <datafield tag="606" ind1="|" ind2=" ">

    <subfield code="3">250000087120140311174609</subfield>

    <subfield code="a">Πλάτων ιστορία</subfield>

  </datafield>

  <datafield tag="700" ind1=" " ind2="1">

    <subfield code="3">200000000120140228092156</subfield>

    <subfield code="4">070</subfield>

    <subfield code="a">Liper</subfield>

    <subfield code="b">Berit von der</subfield>

  </datafield>

</record>

<record>

  <controlfield tag="001">here text may also exist</controlfield>

  <datafield tag="200" ind1="1" ind2=" ">

    <subfield code="a">Metafore po</subfield>

    <subfield code="e">Δοκίμια</subfield>

    <subfield code="f">Περικλής</subfield>

  </datafield>

</collection>

desired output (xml format, or whatever is more easily achieved)

Δοκίμια: 1, here text may also exist

Περικλής: 1, here text may also exist

αρχαία: 1

Η: 1

etc...
regex i have tried with:
/[Α-Ωα-ω]{1,}/

asked Nov 7 at 10:01

cazew

add a comment |

up vote
0
down vote

favorite

given the following xml (which of course consists of many records),

i would like to output unique values out of it, and also generate a report, that would have the records each word was found.

    <collection>

<record>

  <controlfield tag="001">1</controlfield>

  <datafield tag="200" ind1="1" ind2=" ">

    <subfield code="a">Metafore po</subfield>

    <subfield code="e">Δοκίμια</subfield>

    <subfield code="f">Περικλής αρχαία Ελλάδα</subfield>

  </datafield>

  <datafield tag="210" ind1="|" ind2="|">

    <subfield code="a">Η Αθήνα</subfield>

    <subfield code="c">Νοέμβριος</subfield>

    <subfield code="d">1999</subfield>

  </datafield>

  <datafield tag="215" ind1=" " ind2=" ">

    <subfield code="a">263 s.</subfield>

  </datafield>

  <datafield tag="606" ind1="|" ind2=" ">

    <subfield code="3">250000087120140311174609</subfield>

    <subfield code="a">Πλάτων ιστορία</subfield>

  </datafield>

  <datafield tag="700" ind1=" " ind2="1">

    <subfield code="3">200000000120140228092156</subfield>

    <subfield code="4">070</subfield>

    <subfield code="a">Liper</subfield>

    <subfield code="b">Berit von der</subfield>

  </datafield>

</record>

<record>

  <controlfield tag="001">here text may also exist</controlfield>

  <datafield tag="200" ind1="1" ind2=" ">

    <subfield code="a">Metafore po</subfield>

    <subfield code="e">Δοκίμια</subfield>

    <subfield code="f">Περικλής</subfield>

  </datafield>

</collection>

desired output (xml format, or whatever is more easily achieved)

Δοκίμια: 1, here text may also exist

Περικλής: 1, here text may also exist

αρχαία: 1

Η: 1

etc...
regex i have tried with:
/[Α-Ωα-ω]{1,}/

asked Nov 7 at 10:01

cazew

add a comment |

up vote
0
down vote

favorite

given the following xml (which of course consists of many records),

i would like to output unique values out of it, and also generate a report, that would have the records each word was found.

    <collection>

<record>

  <controlfield tag="001">1</controlfield>

  <datafield tag="200" ind1="1" ind2=" ">

    <subfield code="a">Metafore po</subfield>

    <subfield code="e">Δοκίμια</subfield>

    <subfield code="f">Περικλής αρχαία Ελλάδα</subfield>

  </datafield>

  <datafield tag="210" ind1="|" ind2="|">

    <subfield code="a">Η Αθήνα</subfield>

    <subfield code="c">Νοέμβριος</subfield>

    <subfield code="d">1999</subfield>

  </datafield>

  <datafield tag="215" ind1=" " ind2=" ">

    <subfield code="a">263 s.</subfield>

  </datafield>

  <datafield tag="606" ind1="|" ind2=" ">

    <subfield code="3">250000087120140311174609</subfield>

    <subfield code="a">Πλάτων ιστορία</subfield>

  </datafield>

  <datafield tag="700" ind1=" " ind2="1">

    <subfield code="3">200000000120140228092156</subfield>

    <subfield code="4">070</subfield>

    <subfield code="a">Liper</subfield>

    <subfield code="b">Berit von der</subfield>

  </datafield>

</record>

<record>

  <controlfield tag="001">here text may also exist</controlfield>

  <datafield tag="200" ind1="1" ind2=" ">

    <subfield code="a">Metafore po</subfield>

    <subfield code="e">Δοκίμια</subfield>

    <subfield code="f">Περικλής</subfield>

  </datafield>

</collection>

desired output (xml format, or whatever is more easily achieved)

Δοκίμια: 1, here text may also exist

Περικλής: 1, here text may also exist

αρχαία: 1

Η: 1

etc...
regex i have tried with:
/[Α-Ωα-ω]{1,}/

asked Nov 7 at 10:01

cazew

given the following xml (which of course consists of many records),

i would like to output unique values out of it, and also generate a report, that would have the records each word was found.

    <collection>

<record>

  <controlfield tag="001">1</controlfield>

  <datafield tag="200" ind1="1" ind2=" ">

    <subfield code="a">Metafore po</subfield>

    <subfield code="e">Δοκίμια</subfield>

    <subfield code="f">Περικλής αρχαία Ελλάδα</subfield>

  </datafield>

  <datafield tag="210" ind1="|" ind2="|">

    <subfield code="a">Η Αθήνα</subfield>

    <subfield code="c">Νοέμβριος</subfield>

    <subfield code="d">1999</subfield>

  </datafield>

  <datafield tag="215" ind1=" " ind2=" ">

    <subfield code="a">263 s.</subfield>

  </datafield>

  <datafield tag="606" ind1="|" ind2=" ">

    <subfield code="3">250000087120140311174609</subfield>

    <subfield code="a">Πλάτων ιστορία</subfield>

  </datafield>

  <datafield tag="700" ind1=" " ind2="1">

    <subfield code="3">200000000120140228092156</subfield>

    <subfield code="4">070</subfield>

    <subfield code="a">Liper</subfield>

    <subfield code="b">Berit von der</subfield>

  </datafield>

</record>

<record>

  <controlfield tag="001">here text may also exist</controlfield>

  <datafield tag="200" ind1="1" ind2=" ">

    <subfield code="a">Metafore po</subfield>

    <subfield code="e">Δοκίμια</subfield>

    <subfield code="f">Περικλής</subfield>

  </datafield>

</collection>

desired output (xml format, or whatever is more easily achieved)

Δοκίμια: 1, here text may also exist

Περικλής: 1, here text may also exist

αρχαία: 1

Η: 1

etc...
regex i have tried with:
/[Α-Ωα-ω]{1,}/

xslt-3.0

asked Nov 7 at 10:01

cazew

asked Nov 7 at 10:01

cazew

asked Nov 7 at 10:01

cazew

asked Nov 7 at 10:01

cazew

asked Nov 7 at 10:01

cazew

add a comment |

1 Answer
1

active

oldest

votes

up vote
0
down vote

accepted

It seems you can treat that like a grouping problem:

  <xsl:template match="collection">

      <xsl:where-populated>

          <ul>

              <xsl:for-each-group select="record" group-by="datafield/subfield!tokenize(., 's')[matches(., 'p{IsGreek}')]">

                  <li>

                      {current-grouping-key()} : <xsl:value-of select="current-group()/controlfield" separator=", "/>

                  </li>



              </xsl:for-each-group>

          </ul>

      </xsl:where-populated>

  </xsl:template>

https://xsltfiddle.liberty-development.net/gWmuiKi/1 outputs

  <ul>

     <li>

        Δοκίμια : 1, here text may also exist

     </li>

     <li>

        Περικλής : 1, here text may also exist

     </li>

     <li>

        αρχαία : 1

     </li>

     <li>

        Ελλάδα : 1

     </li>

     <li>

        Η : 1

     </li>

     <li>

        Αθήνα : 1

     </li>

     <li>

        Νοέμβριος : 1

     </li>

     <li>

        Πλάτων : 1

     </li>

     <li>

        ιστορία : 1

     </li>

  </ul>

that way.

Of course identifying a "word" by simply tokenizing on white space is going to fail in mosts texts and languages, due to punctuation characters and language specific rules. But XSLT/XPath/XQuery regular expressions don't have a word break metacharacter anyway so somehow one has to use tokenize or analyze-string.

answered Nov 7 at 12:55

Martin Honnen

110k65775

the above works perfect. Could one retrieve json output somehow, so as to store the results in mongodb? Also, seems i need to preprocess the xml, and remove () characters for instance
– cazew
Nov 7 at 15:27

XSLT 3 has a json output method w3.org/TR/xslt-xquery-serialization-31/#json-output and with the XPath 3.1 data types map and array w3.org/TR/xpath-31/#id-maps-and-arrays has support to construct JSON (like) data (w3.org/TR/xslt-30/#json) so creating JSON is possible and rather easy: xsltfiddle.liberty-development.net/gWmuiKi/2. For further problems think you need to ask a new question on that indicating the exact JSON structure you want, once you have digested the links and given it a try on your own.
– Martin Honnen
Nov 7 at 15:57

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53187188%2fextract-words-of-a-certain-language-out-of-an-xml-file%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
0
down vote

accepted

It seems you can treat that like a grouping problem:

  <xsl:template match="collection">

      <xsl:where-populated>

          <ul>

              <xsl:for-each-group select="record" group-by="datafield/subfield!tokenize(., 's')[matches(., 'p{IsGreek}')]">

                  <li>

                      {current-grouping-key()} : <xsl:value-of select="current-group()/controlfield" separator=", "/>

                  </li>



              </xsl:for-each-group>

          </ul>

      </xsl:where-populated>

  </xsl:template>

https://xsltfiddle.liberty-development.net/gWmuiKi/1 outputs

  <ul>

     <li>

        Δοκίμια : 1, here text may also exist

     </li>

     <li>

        Περικλής : 1, here text may also exist

     </li>

     <li>

        αρχαία : 1

     </li>

     <li>

        Ελλάδα : 1

     </li>

     <li>

        Η : 1

     </li>

     <li>

        Αθήνα : 1

     </li>

     <li>

        Νοέμβριος : 1

     </li>

     <li>

        Πλάτων : 1

     </li>

     <li>

        ιστορία : 1

     </li>

  </ul>

that way.

answered Nov 7 at 12:55

Martin Honnen

110k65775

the above works perfect. Could one retrieve json output somehow, so as to store the results in mongodb? Also, seems i need to preprocess the xml, and remove () characters for instance
– cazew
Nov 7 at 15:27

XSLT 3 has a json output method w3.org/TR/xslt-xquery-serialization-31/#json-output and with the XPath 3.1 data types map and array w3.org/TR/xpath-31/#id-maps-and-arrays has support to construct JSON (like) data (w3.org/TR/xslt-30/#json) so creating JSON is possible and rather easy: xsltfiddle.liberty-development.net/gWmuiKi/2. For further problems think you need to ask a new question on that indicating the exact JSON structure you want, once you have digested the links and given it a try on your own.
– Martin Honnen
Nov 7 at 15:57

add a comment |

up vote
0
down vote

accepted

It seems you can treat that like a grouping problem:

  <xsl:template match="collection">

      <xsl:where-populated>

          <ul>

              <xsl:for-each-group select="record" group-by="datafield/subfield!tokenize(., 's')[matches(., 'p{IsGreek}')]">

                  <li>

                      {current-grouping-key()} : <xsl:value-of select="current-group()/controlfield" separator=", "/>

                  </li>



              </xsl:for-each-group>

          </ul>

      </xsl:where-populated>

  </xsl:template>

https://xsltfiddle.liberty-development.net/gWmuiKi/1 outputs

  <ul>

     <li>

        Δοκίμια : 1, here text may also exist

     </li>

     <li>

        Περικλής : 1, here text may also exist

     </li>

     <li>

        αρχαία : 1

     </li>

     <li>

        Ελλάδα : 1

     </li>

     <li>

        Η : 1

     </li>

     <li>

        Αθήνα : 1

     </li>

     <li>

        Νοέμβριος : 1

     </li>

     <li>

        Πλάτων : 1

     </li>

     <li>

        ιστορία : 1

     </li>

  </ul>

that way.

answered Nov 7 at 12:55

Martin Honnen

110k65775

the above works perfect. Could one retrieve json output somehow, so as to store the results in mongodb? Also, seems i need to preprocess the xml, and remove () characters for instance
– cazew
Nov 7 at 15:27

XSLT 3 has a json output method w3.org/TR/xslt-xquery-serialization-31/#json-output and with the XPath 3.1 data types map and array w3.org/TR/xpath-31/#id-maps-and-arrays has support to construct JSON (like) data (w3.org/TR/xslt-30/#json) so creating JSON is possible and rather easy: xsltfiddle.liberty-development.net/gWmuiKi/2. For further problems think you need to ask a new question on that indicating the exact JSON structure you want, once you have digested the links and given it a try on your own.
– Martin Honnen
Nov 7 at 15:57

add a comment |

up vote
0
down vote

accepted

It seems you can treat that like a grouping problem:

  <xsl:template match="collection">

      <xsl:where-populated>

          <ul>

              <xsl:for-each-group select="record" group-by="datafield/subfield!tokenize(., 's')[matches(., 'p{IsGreek}')]">

                  <li>

                      {current-grouping-key()} : <xsl:value-of select="current-group()/controlfield" separator=", "/>

                  </li>



              </xsl:for-each-group>

          </ul>

      </xsl:where-populated>

  </xsl:template>

https://xsltfiddle.liberty-development.net/gWmuiKi/1 outputs

  <ul>

     <li>

        Δοκίμια : 1, here text may also exist

     </li>

     <li>

        Περικλής : 1, here text may also exist

     </li>

     <li>

        αρχαία : 1

     </li>

     <li>

        Ελλάδα : 1

     </li>

     <li>

        Η : 1

     </li>

     <li>

        Αθήνα : 1

     </li>

     <li>

        Νοέμβριος : 1

     </li>

     <li>

        Πλάτων : 1

     </li>

     <li>

        ιστορία : 1

     </li>

  </ul>

that way.

answered Nov 7 at 12:55

Martin Honnen

110k65775

It seems you can treat that like a grouping problem:

  <xsl:template match="collection">

      <xsl:where-populated>

          <ul>

              <xsl:for-each-group select="record" group-by="datafield/subfield!tokenize(., 's')[matches(., 'p{IsGreek}')]">

                  <li>

                      {current-grouping-key()} : <xsl:value-of select="current-group()/controlfield" separator=", "/>

                  </li>



              </xsl:for-each-group>

          </ul>

      </xsl:where-populated>

  </xsl:template>

https://xsltfiddle.liberty-development.net/gWmuiKi/1 outputs

  <ul>

     <li>

        Δοκίμια : 1, here text may also exist

     </li>

     <li>

        Περικλής : 1, here text may also exist

     </li>

     <li>

        αρχαία : 1

     </li>

     <li>

        Ελλάδα : 1

     </li>

     <li>

        Η : 1

     </li>

     <li>

        Αθήνα : 1

     </li>

     <li>

        Νοέμβριος : 1

     </li>

     <li>

        Πλάτων : 1

     </li>

     <li>

        ιστορία : 1

     </li>

  </ul>

that way.

answered Nov 7 at 12:55

Martin Honnen

110k65775

answered Nov 7 at 12:55

Martin Honnen

110k65775

answered Nov 7 at 12:55

Martin Honnen

110k65775

answered Nov 7 at 12:55

Martin Honnen

110k65775

the above works perfect. Could one retrieve json output somehow, so as to store the results in mongodb? Also, seems i need to preprocess the xml, and remove () characters for instance
– cazew
Nov 7 at 15:27

XSLT 3 has a json output method w3.org/TR/xslt-xquery-serialization-31/#json-output and with the XPath 3.1 data types map and array w3.org/TR/xpath-31/#id-maps-and-arrays has support to construct JSON (like) data (w3.org/TR/xslt-30/#json) so creating JSON is possible and rather easy: xsltfiddle.liberty-development.net/gWmuiKi/2. For further problems think you need to ask a new question on that indicating the exact JSON structure you want, once you have digested the links and given it a try on your own.
– Martin Honnen
Nov 7 at 15:57

add a comment |

the above works perfect. Could one retrieve json output somehow, so as to store the results in mongodb? Also, seems i need to preprocess the xml, and remove () characters for instance
– cazew
Nov 7 at 15:27

XSLT 3 has a json output method w3.org/TR/xslt-xquery-serialization-31/#json-output and with the XPath 3.1 data types map and array w3.org/TR/xpath-31/#id-maps-and-arrays has support to construct JSON (like) data (w3.org/TR/xslt-30/#json) so creating JSON is possible and rather easy: xsltfiddle.liberty-development.net/gWmuiKi/2. For further problems think you need to ask a new question on that indicating the exact JSON structure you want, once you have digested the links and given it a try on your own.
– Martin Honnen
Nov 7 at 15:57

the above works perfect. Could one retrieve json output somehow, so as to store the results in mongodb? Also, seems i need to preprocess the xml, and remove () characters for instance
– cazew
Nov 7 at 15:27

XSLT 3 has a json output method w3.org/TR/xslt-xquery-serialization-31/#json-output and with the XPath 3.1 data types map and array w3.org/TR/xpath-31/#id-maps-and-arrays has support to construct JSON (like) data (w3.org/TR/xslt-30/#json) so creating JSON is possible and rather easy: xsltfiddle.liberty-development.net/gWmuiKi/2. For further problems think you need to ask a new question on that indicating the exact JSON structure you want, once you have digested the links and given it a try on your own.
– Martin Honnen
Nov 7 at 15:57

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Wsrtjtyk