extract words of a certain language out of an xml file
up vote
0
down vote
favorite
given the following xml (which of course consists of many records),
i would like to output unique values out of it, and also generate a report, that would have the records each word was found.
<collection>
<record>
<controlfield tag="001">1</controlfield>
<datafield tag="200" ind1="1" ind2=" ">
<subfield code="a">Metafore po</subfield>
<subfield code="e">Δοκίμια</subfield>
<subfield code="f">Περικλής αρχαία Ελλάδα</subfield>
</datafield>
<datafield tag="210" ind1="|" ind2="|">
<subfield code="a">Η Αθήνα</subfield>
<subfield code="c">Νοέμβριος</subfield>
<subfield code="d">1999</subfield>
</datafield>
<datafield tag="215" ind1=" " ind2=" ">
<subfield code="a">263 s.</subfield>
</datafield>
<datafield tag="606" ind1="|" ind2=" ">
<subfield code="3">250000087120140311174609</subfield>
<subfield code="a">Πλάτων ιστορία</subfield>
</datafield>
<datafield tag="700" ind1=" " ind2="1">
<subfield code="3">200000000120140228092156</subfield>
<subfield code="4">070</subfield>
<subfield code="a">Liper</subfield>
<subfield code="b">Berit von der</subfield>
</datafield>
</record>
<record>
<controlfield tag="001">here text may also exist</controlfield>
<datafield tag="200" ind1="1" ind2=" ">
<subfield code="a">Metafore po</subfield>
<subfield code="e">Δοκίμια</subfield>
<subfield code="f">Περικλής</subfield>
</datafield>
</collection>
desired output (xml format, or whatever is more easily achieved)
Δοκίμια: 1, here text may also exist
Περικλής: 1, here text may also exist
αρχαία: 1
Η: 1
etc...
regex i have tried with:
/[Α-Ωα-ω]{1,}/
xslt-3.0
add a comment |
up vote
0
down vote
favorite
given the following xml (which of course consists of many records),
i would like to output unique values out of it, and also generate a report, that would have the records each word was found.
<collection>
<record>
<controlfield tag="001">1</controlfield>
<datafield tag="200" ind1="1" ind2=" ">
<subfield code="a">Metafore po</subfield>
<subfield code="e">Δοκίμια</subfield>
<subfield code="f">Περικλής αρχαία Ελλάδα</subfield>
</datafield>
<datafield tag="210" ind1="|" ind2="|">
<subfield code="a">Η Αθήνα</subfield>
<subfield code="c">Νοέμβριος</subfield>
<subfield code="d">1999</subfield>
</datafield>
<datafield tag="215" ind1=" " ind2=" ">
<subfield code="a">263 s.</subfield>
</datafield>
<datafield tag="606" ind1="|" ind2=" ">
<subfield code="3">250000087120140311174609</subfield>
<subfield code="a">Πλάτων ιστορία</subfield>
</datafield>
<datafield tag="700" ind1=" " ind2="1">
<subfield code="3">200000000120140228092156</subfield>
<subfield code="4">070</subfield>
<subfield code="a">Liper</subfield>
<subfield code="b">Berit von der</subfield>
</datafield>
</record>
<record>
<controlfield tag="001">here text may also exist</controlfield>
<datafield tag="200" ind1="1" ind2=" ">
<subfield code="a">Metafore po</subfield>
<subfield code="e">Δοκίμια</subfield>
<subfield code="f">Περικλής</subfield>
</datafield>
</collection>
desired output (xml format, or whatever is more easily achieved)
Δοκίμια: 1, here text may also exist
Περικλής: 1, here text may also exist
αρχαία: 1
Η: 1
etc...
regex i have tried with:
/[Α-Ωα-ω]{1,}/
xslt-3.0
add a comment |
up vote
0
down vote
favorite
up vote
0
down vote
favorite
given the following xml (which of course consists of many records),
i would like to output unique values out of it, and also generate a report, that would have the records each word was found.
<collection>
<record>
<controlfield tag="001">1</controlfield>
<datafield tag="200" ind1="1" ind2=" ">
<subfield code="a">Metafore po</subfield>
<subfield code="e">Δοκίμια</subfield>
<subfield code="f">Περικλής αρχαία Ελλάδα</subfield>
</datafield>
<datafield tag="210" ind1="|" ind2="|">
<subfield code="a">Η Αθήνα</subfield>
<subfield code="c">Νοέμβριος</subfield>
<subfield code="d">1999</subfield>
</datafield>
<datafield tag="215" ind1=" " ind2=" ">
<subfield code="a">263 s.</subfield>
</datafield>
<datafield tag="606" ind1="|" ind2=" ">
<subfield code="3">250000087120140311174609</subfield>
<subfield code="a">Πλάτων ιστορία</subfield>
</datafield>
<datafield tag="700" ind1=" " ind2="1">
<subfield code="3">200000000120140228092156</subfield>
<subfield code="4">070</subfield>
<subfield code="a">Liper</subfield>
<subfield code="b">Berit von der</subfield>
</datafield>
</record>
<record>
<controlfield tag="001">here text may also exist</controlfield>
<datafield tag="200" ind1="1" ind2=" ">
<subfield code="a">Metafore po</subfield>
<subfield code="e">Δοκίμια</subfield>
<subfield code="f">Περικλής</subfield>
</datafield>
</collection>
desired output (xml format, or whatever is more easily achieved)
Δοκίμια: 1, here text may also exist
Περικλής: 1, here text may also exist
αρχαία: 1
Η: 1
etc...
regex i have tried with:
/[Α-Ωα-ω]{1,}/
xslt-3.0
given the following xml (which of course consists of many records),
i would like to output unique values out of it, and also generate a report, that would have the records each word was found.
<collection>
<record>
<controlfield tag="001">1</controlfield>
<datafield tag="200" ind1="1" ind2=" ">
<subfield code="a">Metafore po</subfield>
<subfield code="e">Δοκίμια</subfield>
<subfield code="f">Περικλής αρχαία Ελλάδα</subfield>
</datafield>
<datafield tag="210" ind1="|" ind2="|">
<subfield code="a">Η Αθήνα</subfield>
<subfield code="c">Νοέμβριος</subfield>
<subfield code="d">1999</subfield>
</datafield>
<datafield tag="215" ind1=" " ind2=" ">
<subfield code="a">263 s.</subfield>
</datafield>
<datafield tag="606" ind1="|" ind2=" ">
<subfield code="3">250000087120140311174609</subfield>
<subfield code="a">Πλάτων ιστορία</subfield>
</datafield>
<datafield tag="700" ind1=" " ind2="1">
<subfield code="3">200000000120140228092156</subfield>
<subfield code="4">070</subfield>
<subfield code="a">Liper</subfield>
<subfield code="b">Berit von der</subfield>
</datafield>
</record>
<record>
<controlfield tag="001">here text may also exist</controlfield>
<datafield tag="200" ind1="1" ind2=" ">
<subfield code="a">Metafore po</subfield>
<subfield code="e">Δοκίμια</subfield>
<subfield code="f">Περικλής</subfield>
</datafield>
</collection>
desired output (xml format, or whatever is more easily achieved)
Δοκίμια: 1, here text may also exist
Περικλής: 1, here text may also exist
αρχαία: 1
Η: 1
etc...
regex i have tried with:
/[Α-Ωα-ω]{1,}/
xslt-3.0
xslt-3.0
asked Nov 7 at 10:01
cazew
31
31
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
up vote
0
down vote
accepted
It seems you can treat that like a grouping problem:
<xsl:template match="collection">
<xsl:where-populated>
<ul>
<xsl:for-each-group select="record" group-by="datafield/subfield!tokenize(., 's')[matches(., 'p{IsGreek}')]">
<li>
{current-grouping-key()} : <xsl:value-of select="current-group()/controlfield" separator=", "/>
</li>
</xsl:for-each-group>
</ul>
</xsl:where-populated>
</xsl:template>
https://xsltfiddle.liberty-development.net/gWmuiKi/1 outputs
<ul>
<li>
Δοκίμια : 1, here text may also exist
</li>
<li>
Περικλής : 1, here text may also exist
</li>
<li>
αρχαία : 1
</li>
<li>
Ελλάδα : 1
</li>
<li>
Η : 1
</li>
<li>
Αθήνα : 1
</li>
<li>
Νοέμβριος : 1
</li>
<li>
Πλάτων : 1
</li>
<li>
ιστορία : 1
</li>
</ul>
that way.
Of course identifying a "word" by simply tokenizing on white space is going to fail in mosts texts and languages, due to punctuation characters and language specific rules. But XSLT/XPath/XQuery regular expressions don't have a word break metacharacter anyway so somehow one has to use tokenize or analyze-string.
the above works perfect. Could one retrieve json output somehow, so as to store the results in mongodb? Also, seems i need to preprocess the xml, and remove () characters for instance
– cazew
Nov 7 at 15:27
XSLT 3 has ajsonoutput method w3.org/TR/xslt-xquery-serialization-31/#json-output and with the XPath 3.1 data typesmapandarrayw3.org/TR/xpath-31/#id-maps-and-arrays has support to construct JSON (like) data (w3.org/TR/xslt-30/#json) so creating JSON is possible and rather easy: xsltfiddle.liberty-development.net/gWmuiKi/2. For further problems think you need to ask a new question on that indicating the exact JSON structure you want, once you have digested the links and given it a try on your own.
– Martin Honnen
Nov 7 at 15:57
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
accepted
It seems you can treat that like a grouping problem:
<xsl:template match="collection">
<xsl:where-populated>
<ul>
<xsl:for-each-group select="record" group-by="datafield/subfield!tokenize(., 's')[matches(., 'p{IsGreek}')]">
<li>
{current-grouping-key()} : <xsl:value-of select="current-group()/controlfield" separator=", "/>
</li>
</xsl:for-each-group>
</ul>
</xsl:where-populated>
</xsl:template>
https://xsltfiddle.liberty-development.net/gWmuiKi/1 outputs
<ul>
<li>
Δοκίμια : 1, here text may also exist
</li>
<li>
Περικλής : 1, here text may also exist
</li>
<li>
αρχαία : 1
</li>
<li>
Ελλάδα : 1
</li>
<li>
Η : 1
</li>
<li>
Αθήνα : 1
</li>
<li>
Νοέμβριος : 1
</li>
<li>
Πλάτων : 1
</li>
<li>
ιστορία : 1
</li>
</ul>
that way.
Of course identifying a "word" by simply tokenizing on white space is going to fail in mosts texts and languages, due to punctuation characters and language specific rules. But XSLT/XPath/XQuery regular expressions don't have a word break metacharacter anyway so somehow one has to use tokenize or analyze-string.
the above works perfect. Could one retrieve json output somehow, so as to store the results in mongodb? Also, seems i need to preprocess the xml, and remove () characters for instance
– cazew
Nov 7 at 15:27
XSLT 3 has ajsonoutput method w3.org/TR/xslt-xquery-serialization-31/#json-output and with the XPath 3.1 data typesmapandarrayw3.org/TR/xpath-31/#id-maps-and-arrays has support to construct JSON (like) data (w3.org/TR/xslt-30/#json) so creating JSON is possible and rather easy: xsltfiddle.liberty-development.net/gWmuiKi/2. For further problems think you need to ask a new question on that indicating the exact JSON structure you want, once you have digested the links and given it a try on your own.
– Martin Honnen
Nov 7 at 15:57
add a comment |
up vote
0
down vote
accepted
It seems you can treat that like a grouping problem:
<xsl:template match="collection">
<xsl:where-populated>
<ul>
<xsl:for-each-group select="record" group-by="datafield/subfield!tokenize(., 's')[matches(., 'p{IsGreek}')]">
<li>
{current-grouping-key()} : <xsl:value-of select="current-group()/controlfield" separator=", "/>
</li>
</xsl:for-each-group>
</ul>
</xsl:where-populated>
</xsl:template>
https://xsltfiddle.liberty-development.net/gWmuiKi/1 outputs
<ul>
<li>
Δοκίμια : 1, here text may also exist
</li>
<li>
Περικλής : 1, here text may also exist
</li>
<li>
αρχαία : 1
</li>
<li>
Ελλάδα : 1
</li>
<li>
Η : 1
</li>
<li>
Αθήνα : 1
</li>
<li>
Νοέμβριος : 1
</li>
<li>
Πλάτων : 1
</li>
<li>
ιστορία : 1
</li>
</ul>
that way.
Of course identifying a "word" by simply tokenizing on white space is going to fail in mosts texts and languages, due to punctuation characters and language specific rules. But XSLT/XPath/XQuery regular expressions don't have a word break metacharacter anyway so somehow one has to use tokenize or analyze-string.
the above works perfect. Could one retrieve json output somehow, so as to store the results in mongodb? Also, seems i need to preprocess the xml, and remove () characters for instance
– cazew
Nov 7 at 15:27
XSLT 3 has ajsonoutput method w3.org/TR/xslt-xquery-serialization-31/#json-output and with the XPath 3.1 data typesmapandarrayw3.org/TR/xpath-31/#id-maps-and-arrays has support to construct JSON (like) data (w3.org/TR/xslt-30/#json) so creating JSON is possible and rather easy: xsltfiddle.liberty-development.net/gWmuiKi/2. For further problems think you need to ask a new question on that indicating the exact JSON structure you want, once you have digested the links and given it a try on your own.
– Martin Honnen
Nov 7 at 15:57
add a comment |
up vote
0
down vote
accepted
up vote
0
down vote
accepted
It seems you can treat that like a grouping problem:
<xsl:template match="collection">
<xsl:where-populated>
<ul>
<xsl:for-each-group select="record" group-by="datafield/subfield!tokenize(., 's')[matches(., 'p{IsGreek}')]">
<li>
{current-grouping-key()} : <xsl:value-of select="current-group()/controlfield" separator=", "/>
</li>
</xsl:for-each-group>
</ul>
</xsl:where-populated>
</xsl:template>
https://xsltfiddle.liberty-development.net/gWmuiKi/1 outputs
<ul>
<li>
Δοκίμια : 1, here text may also exist
</li>
<li>
Περικλής : 1, here text may also exist
</li>
<li>
αρχαία : 1
</li>
<li>
Ελλάδα : 1
</li>
<li>
Η : 1
</li>
<li>
Αθήνα : 1
</li>
<li>
Νοέμβριος : 1
</li>
<li>
Πλάτων : 1
</li>
<li>
ιστορία : 1
</li>
</ul>
that way.
Of course identifying a "word" by simply tokenizing on white space is going to fail in mosts texts and languages, due to punctuation characters and language specific rules. But XSLT/XPath/XQuery regular expressions don't have a word break metacharacter anyway so somehow one has to use tokenize or analyze-string.
It seems you can treat that like a grouping problem:
<xsl:template match="collection">
<xsl:where-populated>
<ul>
<xsl:for-each-group select="record" group-by="datafield/subfield!tokenize(., 's')[matches(., 'p{IsGreek}')]">
<li>
{current-grouping-key()} : <xsl:value-of select="current-group()/controlfield" separator=", "/>
</li>
</xsl:for-each-group>
</ul>
</xsl:where-populated>
</xsl:template>
https://xsltfiddle.liberty-development.net/gWmuiKi/1 outputs
<ul>
<li>
Δοκίμια : 1, here text may also exist
</li>
<li>
Περικλής : 1, here text may also exist
</li>
<li>
αρχαία : 1
</li>
<li>
Ελλάδα : 1
</li>
<li>
Η : 1
</li>
<li>
Αθήνα : 1
</li>
<li>
Νοέμβριος : 1
</li>
<li>
Πλάτων : 1
</li>
<li>
ιστορία : 1
</li>
</ul>
that way.
Of course identifying a "word" by simply tokenizing on white space is going to fail in mosts texts and languages, due to punctuation characters and language specific rules. But XSLT/XPath/XQuery regular expressions don't have a word break metacharacter anyway so somehow one has to use tokenize or analyze-string.
answered Nov 7 at 12:55
Martin Honnen
110k65775
110k65775
the above works perfect. Could one retrieve json output somehow, so as to store the results in mongodb? Also, seems i need to preprocess the xml, and remove () characters for instance
– cazew
Nov 7 at 15:27
XSLT 3 has ajsonoutput method w3.org/TR/xslt-xquery-serialization-31/#json-output and with the XPath 3.1 data typesmapandarrayw3.org/TR/xpath-31/#id-maps-and-arrays has support to construct JSON (like) data (w3.org/TR/xslt-30/#json) so creating JSON is possible and rather easy: xsltfiddle.liberty-development.net/gWmuiKi/2. For further problems think you need to ask a new question on that indicating the exact JSON structure you want, once you have digested the links and given it a try on your own.
– Martin Honnen
Nov 7 at 15:57
add a comment |
the above works perfect. Could one retrieve json output somehow, so as to store the results in mongodb? Also, seems i need to preprocess the xml, and remove () characters for instance
– cazew
Nov 7 at 15:27
XSLT 3 has ajsonoutput method w3.org/TR/xslt-xquery-serialization-31/#json-output and with the XPath 3.1 data typesmapandarrayw3.org/TR/xpath-31/#id-maps-and-arrays has support to construct JSON (like) data (w3.org/TR/xslt-30/#json) so creating JSON is possible and rather easy: xsltfiddle.liberty-development.net/gWmuiKi/2. For further problems think you need to ask a new question on that indicating the exact JSON structure you want, once you have digested the links and given it a try on your own.
– Martin Honnen
Nov 7 at 15:57
the above works perfect. Could one retrieve json output somehow, so as to store the results in mongodb? Also, seems i need to preprocess the xml, and remove () characters for instance
– cazew
Nov 7 at 15:27
the above works perfect. Could one retrieve json output somehow, so as to store the results in mongodb? Also, seems i need to preprocess the xml, and remove () characters for instance
– cazew
Nov 7 at 15:27
XSLT 3 has a
json output method w3.org/TR/xslt-xquery-serialization-31/#json-output and with the XPath 3.1 data types map and array w3.org/TR/xpath-31/#id-maps-and-arrays has support to construct JSON (like) data (w3.org/TR/xslt-30/#json) so creating JSON is possible and rather easy: xsltfiddle.liberty-development.net/gWmuiKi/2. For further problems think you need to ask a new question on that indicating the exact JSON structure you want, once you have digested the links and given it a try on your own.– Martin Honnen
Nov 7 at 15:57
XSLT 3 has a
json output method w3.org/TR/xslt-xquery-serialization-31/#json-output and with the XPath 3.1 data types map and array w3.org/TR/xpath-31/#id-maps-and-arrays has support to construct JSON (like) data (w3.org/TR/xslt-30/#json) so creating JSON is possible and rather easy: xsltfiddle.liberty-development.net/gWmuiKi/2. For further problems think you need to ask a new question on that indicating the exact JSON structure you want, once you have digested the links and given it a try on your own.– Martin Honnen
Nov 7 at 15:57
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53187188%2fextract-words-of-a-certain-language-out-of-an-xml-file%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown