parse html file, obtaining data from nested categories hierarchy using xslt 3
given the following html file:
http://bpeck.com/references/DDC/ddc_mine900.htm
http://bpeck.com/references/DDC/ddc_mine200.htm
http://bpeck.com/references/DDC/ddc_mine500.htm
etc,
how i could get an output that would show me the hierarchy of the categories?
/---------------------
| ID | Name
| 1 | Main Category
| 3 | Sub Category
| 5 | Sub-Sub Category
| 4 | Sub Category
| 2 | Next Main Category
----------------------
Ideally if the output result could be in json format, but i guess xml could do.
Struggled with a serial parser (SAX), but failed, looking for an elegant solution.
main categories
900 World History
910 Geography and travel [see area subdivisions]
920 Biography, genealogy, insignia
930 History of the ancient world
940 General history of Europe [check schedules for date subdivisions]
950 General history of Asia, Far East
etc...
subcategories of 900:
900 Geography & history
901 Philosophy & theory
902 Miscellany
903 Dictionaries & encyclopedias
904 Collected accounts of events
905 Serial publications
906 Organizations & management
907 Education, research, related topics
908 With respect to kinds of persons
...
example of a sub sub category found under 909 World history:
909.7 18th century, 1700-1799
909.8 1800-
909.82 1900-
output i would prefer the best approach you judge would be the best.
each key would be the ID, which is 900, 901, 902, etc, and the corresponding value would be the name: Geography & history, Philosophy & theory, Miscellany. This output json should be nested, showing the hierarchy of the categories.
I use saxon HE version 9.8
xslt-3.0
add a comment |
given the following html file:
http://bpeck.com/references/DDC/ddc_mine900.htm
http://bpeck.com/references/DDC/ddc_mine200.htm
http://bpeck.com/references/DDC/ddc_mine500.htm
etc,
how i could get an output that would show me the hierarchy of the categories?
/---------------------
| ID | Name
| 1 | Main Category
| 3 | Sub Category
| 5 | Sub-Sub Category
| 4 | Sub Category
| 2 | Next Main Category
----------------------
Ideally if the output result could be in json format, but i guess xml could do.
Struggled with a serial parser (SAX), but failed, looking for an elegant solution.
main categories
900 World History
910 Geography and travel [see area subdivisions]
920 Biography, genealogy, insignia
930 History of the ancient world
940 General history of Europe [check schedules for date subdivisions]
950 General history of Asia, Far East
etc...
subcategories of 900:
900 Geography & history
901 Philosophy & theory
902 Miscellany
903 Dictionaries & encyclopedias
904 Collected accounts of events
905 Serial publications
906 Organizations & management
907 Education, research, related topics
908 With respect to kinds of persons
...
example of a sub sub category found under 909 World history:
909.7 18th century, 1700-1799
909.8 1800-
909.82 1900-
output i would prefer the best approach you judge would be the best.
each key would be the ID, which is 900, 901, 902, etc, and the corresponding value would be the name: Geography & history, Philosophy & theory, Miscellany. This output json should be nested, showing the hierarchy of the categories.
I use saxon HE version 9.8
xslt-3.0
The result you have shown doesn't show any sample data, nor it is XML or JSON. So where do you get theID
values from, what do you consider a main category, a sub category, and so on?
– Martin Honnen
Nov 22 '18 at 15:59
And which XSLT 3 processor do you use as the input is not XML but HTML? Do you use Saxon PE or EE where you have saxonica.com/html/documentation/functions/saxon/parse-html.html or some other way to plug in an HTML parser into the tool chain?
– Martin Honnen
Nov 22 '18 at 16:25
i use saxon HE 9.8.
– saloda
Nov 22 '18 at 16:29
It depends, do you use the Java version of Saxon HE? How do you run it, from the command line, from Java code? Can you install TagSoup vrici.lojban.org/~cowan/XML/tagsoup? Saxon (even HE) has a command line option-x
to name a parser so if TagSoup is on the class path then-x:org.ccil.cowan.tagsoup.Parser
is supposed to use that parser for the input instead of a normal XML parser, then you have the (X)HTML tree TagSoup gives as the input.
– Martin Honnen
Nov 22 '18 at 16:39
add a comment |
given the following html file:
http://bpeck.com/references/DDC/ddc_mine900.htm
http://bpeck.com/references/DDC/ddc_mine200.htm
http://bpeck.com/references/DDC/ddc_mine500.htm
etc,
how i could get an output that would show me the hierarchy of the categories?
/---------------------
| ID | Name
| 1 | Main Category
| 3 | Sub Category
| 5 | Sub-Sub Category
| 4 | Sub Category
| 2 | Next Main Category
----------------------
Ideally if the output result could be in json format, but i guess xml could do.
Struggled with a serial parser (SAX), but failed, looking for an elegant solution.
main categories
900 World History
910 Geography and travel [see area subdivisions]
920 Biography, genealogy, insignia
930 History of the ancient world
940 General history of Europe [check schedules for date subdivisions]
950 General history of Asia, Far East
etc...
subcategories of 900:
900 Geography & history
901 Philosophy & theory
902 Miscellany
903 Dictionaries & encyclopedias
904 Collected accounts of events
905 Serial publications
906 Organizations & management
907 Education, research, related topics
908 With respect to kinds of persons
...
example of a sub sub category found under 909 World history:
909.7 18th century, 1700-1799
909.8 1800-
909.82 1900-
output i would prefer the best approach you judge would be the best.
each key would be the ID, which is 900, 901, 902, etc, and the corresponding value would be the name: Geography & history, Philosophy & theory, Miscellany. This output json should be nested, showing the hierarchy of the categories.
I use saxon HE version 9.8
xslt-3.0
given the following html file:
http://bpeck.com/references/DDC/ddc_mine900.htm
http://bpeck.com/references/DDC/ddc_mine200.htm
http://bpeck.com/references/DDC/ddc_mine500.htm
etc,
how i could get an output that would show me the hierarchy of the categories?
/---------------------
| ID | Name
| 1 | Main Category
| 3 | Sub Category
| 5 | Sub-Sub Category
| 4 | Sub Category
| 2 | Next Main Category
----------------------
Ideally if the output result could be in json format, but i guess xml could do.
Struggled with a serial parser (SAX), but failed, looking for an elegant solution.
main categories
900 World History
910 Geography and travel [see area subdivisions]
920 Biography, genealogy, insignia
930 History of the ancient world
940 General history of Europe [check schedules for date subdivisions]
950 General history of Asia, Far East
etc...
subcategories of 900:
900 Geography & history
901 Philosophy & theory
902 Miscellany
903 Dictionaries & encyclopedias
904 Collected accounts of events
905 Serial publications
906 Organizations & management
907 Education, research, related topics
908 With respect to kinds of persons
...
example of a sub sub category found under 909 World history:
909.7 18th century, 1700-1799
909.8 1800-
909.82 1900-
output i would prefer the best approach you judge would be the best.
each key would be the ID, which is 900, 901, 902, etc, and the corresponding value would be the name: Geography & history, Philosophy & theory, Miscellany. This output json should be nested, showing the hierarchy of the categories.
I use saxon HE version 9.8
xslt-3.0
xslt-3.0
edited Nov 22 '18 at 16:28
saloda
asked Nov 22 '18 at 15:02
salodasaloda
32
32
The result you have shown doesn't show any sample data, nor it is XML or JSON. So where do you get theID
values from, what do you consider a main category, a sub category, and so on?
– Martin Honnen
Nov 22 '18 at 15:59
And which XSLT 3 processor do you use as the input is not XML but HTML? Do you use Saxon PE or EE where you have saxonica.com/html/documentation/functions/saxon/parse-html.html or some other way to plug in an HTML parser into the tool chain?
– Martin Honnen
Nov 22 '18 at 16:25
i use saxon HE 9.8.
– saloda
Nov 22 '18 at 16:29
It depends, do you use the Java version of Saxon HE? How do you run it, from the command line, from Java code? Can you install TagSoup vrici.lojban.org/~cowan/XML/tagsoup? Saxon (even HE) has a command line option-x
to name a parser so if TagSoup is on the class path then-x:org.ccil.cowan.tagsoup.Parser
is supposed to use that parser for the input instead of a normal XML parser, then you have the (X)HTML tree TagSoup gives as the input.
– Martin Honnen
Nov 22 '18 at 16:39
add a comment |
The result you have shown doesn't show any sample data, nor it is XML or JSON. So where do you get theID
values from, what do you consider a main category, a sub category, and so on?
– Martin Honnen
Nov 22 '18 at 15:59
And which XSLT 3 processor do you use as the input is not XML but HTML? Do you use Saxon PE or EE where you have saxonica.com/html/documentation/functions/saxon/parse-html.html or some other way to plug in an HTML parser into the tool chain?
– Martin Honnen
Nov 22 '18 at 16:25
i use saxon HE 9.8.
– saloda
Nov 22 '18 at 16:29
It depends, do you use the Java version of Saxon HE? How do you run it, from the command line, from Java code? Can you install TagSoup vrici.lojban.org/~cowan/XML/tagsoup? Saxon (even HE) has a command line option-x
to name a parser so if TagSoup is on the class path then-x:org.ccil.cowan.tagsoup.Parser
is supposed to use that parser for the input instead of a normal XML parser, then you have the (X)HTML tree TagSoup gives as the input.
– Martin Honnen
Nov 22 '18 at 16:39
The result you have shown doesn't show any sample data, nor it is XML or JSON. So where do you get the
ID
values from, what do you consider a main category, a sub category, and so on?– Martin Honnen
Nov 22 '18 at 15:59
The result you have shown doesn't show any sample data, nor it is XML or JSON. So where do you get the
ID
values from, what do you consider a main category, a sub category, and so on?– Martin Honnen
Nov 22 '18 at 15:59
And which XSLT 3 processor do you use as the input is not XML but HTML? Do you use Saxon PE or EE where you have saxonica.com/html/documentation/functions/saxon/parse-html.html or some other way to plug in an HTML parser into the tool chain?
– Martin Honnen
Nov 22 '18 at 16:25
And which XSLT 3 processor do you use as the input is not XML but HTML? Do you use Saxon PE or EE where you have saxonica.com/html/documentation/functions/saxon/parse-html.html or some other way to plug in an HTML parser into the tool chain?
– Martin Honnen
Nov 22 '18 at 16:25
i use saxon HE 9.8.
– saloda
Nov 22 '18 at 16:29
i use saxon HE 9.8.
– saloda
Nov 22 '18 at 16:29
It depends, do you use the Java version of Saxon HE? How do you run it, from the command line, from Java code? Can you install TagSoup vrici.lojban.org/~cowan/XML/tagsoup? Saxon (even HE) has a command line option
-x
to name a parser so if TagSoup is on the class path then -x:org.ccil.cowan.tagsoup.Parser
is supposed to use that parser for the input instead of a normal XML parser, then you have the (X)HTML tree TagSoup gives as the input.– Martin Honnen
Nov 22 '18 at 16:39
It depends, do you use the Java version of Saxon HE? How do you run it, from the command line, from Java code? Can you install TagSoup vrici.lojban.org/~cowan/XML/tagsoup? Saxon (even HE) has a command line option
-x
to name a parser so if TagSoup is on the class path then -x:org.ccil.cowan.tagsoup.Parser
is supposed to use that parser for the input instead of a normal XML parser, then you have the (X)HTML tree TagSoup gives as the input.– Martin Honnen
Nov 22 '18 at 16:39
add a comment |
1 Answer
1
active
oldest
votes
The data you have seems poorly structured (only checked http://bpeck.com/references/DDC/ddc_mine900.htm but that doesn't pass HTML validation at https://validator.w3.org/check?uri=http%3A%2F%2Fbpeck.com%2Freferences%2FDDC%2Fddc_mine900.htm&charset=%28detect+automatically%29&doctype=Inline&group=0, in particular the lists of subcategories are not properly nested so some XSLT plumbing is needed).
As for parsing HTML with XSLT 2 or 3, if you can't set up Saxon to use an HTML parser like TagSoup instead of an XML parser for the input you can try to use David Carlisle's htmlparse
function implemented in pure XSLT 2, it is available online at https://github.com/davidcarlisle/web-xslt/blob/master/htmlparse/htmlparse.xsl, make sure you download a local copy if you want to use it to parse your HTML in XSLT 2 or 3 with good performance.
Here is an example that uses the online copy and parses the input HTML into some XML format I made up:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc"
xmlns:mf="http://example.com/mf"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">
<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>
<xsl:param name="html-file" as="xs:string">http://bpeck.com/references/DDC/ddc_mine900.htm</xsl:param>
<xsl:param name="html-text" as="xs:string" select="unparsed-text($html-file)"/>
<xsl:variable name="html-doc" select="d:htmlparse($html-text, '', true())"/>
<xsl:mode on-no-match="shallow-copy"/>
<xsl:output method="xml" indent="yes"/>
<xsl:template match="/" name="xsl:initial-template">
<categories>
<xsl:apply-templates select="tail($html-doc//table)"/>
</categories>
</xsl:template>
<xsl:template match="table">
<category>
<xsl:sequence select="mf:create-attributes(tr[1]/td[1])"/>
<xsl:apply-templates select="head(tr)/td[2], tail(tr)/td[1]"/>
</category>
</xsl:template>
<xsl:template match="td">
<subcategory>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="following-sibling::td[1]/ul"/>
</subcategory>
</xsl:template>
<xsl:template match="ul">
<xsl:for-each-group select="*" group-starting-with="li">
<sub-sub-category>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="tail(current-group())"/>
</sub-sub-category>
</xsl:for-each-group>
</xsl:template>
<xsl:function name="mf:create-attributes" as="attribute()*">
<xsl:param name="input" as="xs:string"/>
<xsl:variable name="input-components" as="xs:string*" select="tokenize(normalize-space($input))"/>
<xsl:attribute name="name" select="head($input-components)"/>
<xsl:attribute name="title" select="tail($input-components)"/>
</xsl:function>
</xsl:stylesheet>
https://xsltfiddle.liberty-development.net/3NzcBud
To create JSON with XSLT 3, you have a couple of options, one is to have your stylesheet create the format the xml-to-json
function expects (https://www.w3.org/TR/xslt-30/#json-to-xml-mapping); in the following example I extend the above stylesheet with a mode that takes the previous result XML to create the XML input you can feed to xml-to-json
:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc"
xmlns:mf="http://example.com/mf"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">
<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>
<xsl:param name="html-file" as="xs:string">http://bpeck.com/references/DDC/ddc_mine900.htm</xsl:param>
<xsl:param name="html-text" as="xs:string" select="unparsed-text($html-file)"/>
<xsl:variable name="html-doc" select="d:htmlparse($html-text, '', true())"/>
<xsl:mode on-no-match="shallow-copy"/>
<xsl:output method="xml" indent="yes"/>
<xsl:template match="/" name="xsl:initial-template">
<xsl:variable name="categories">
<categories>
<xsl:apply-templates select="tail($html-doc//table)"/>
</categories>
</xsl:variable>
<xsl:apply-templates select="$categories" mode="json"/>
</xsl:template>
<xsl:template match="table">
<category>
<xsl:sequence select="mf:create-attributes(tr[1]/td[1])"/>
<xsl:apply-templates select="head(tr)/td[2], tail(tr)/td[1]"/>
</category>
</xsl:template>
<xsl:template match="td">
<subcategory>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="following-sibling::td[1]/ul"/>
</subcategory>
</xsl:template>
<xsl:template match="ul">
<xsl:for-each-group select="*" group-starting-with="li">
<sub-sub-category>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="tail(current-group())"/>
</sub-sub-category>
</xsl:for-each-group>
</xsl:template>
<xsl:function name="mf:create-attributes" as="attribute()*">
<xsl:param name="input" as="xs:string"/>
<xsl:variable name="input-components" as="xs:string*" select="tokenize(normalize-space($input))"/>
<xsl:attribute name="name" select="head($input-components)"/>
<xsl:attribute name="title" select="tail($input-components)"/>
</xsl:function>
<xsl:mode name="json" on-no-match="shallow-skip"/>
<xsl:template match="category | subcategory | sub-sub-category" mode="json">
<fn:map>
<fn:map key="{@name}">
<fn:string key="title">{@title}</fn:string>
<xsl:where-populated>
<fn:array key="children">
<xsl:apply-templates mode="#current"/>
</fn:array>
</xsl:where-populated>
</fn:map>
</fn:map>
</xsl:template>
</xsl:stylesheet>
https://xsltfiddle.liberty-development.net/3NzcBud/1
The final step then would to use the function xml-to-json
(https://www.w3.org/TR/xpath-functions/#func-xml-to-json) to output JSON instead of XML:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc"
xmlns:mf="http://example.com/mf"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">
<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>
<xsl:param name="html-file" as="xs:string">http://bpeck.com/references/DDC/ddc_mine900.htm</xsl:param>
<xsl:param name="html-text" as="xs:string" select="unparsed-text($html-file)"/>
<xsl:variable name="html-doc" select="d:htmlparse($html-text, '', true())"/>
<xsl:mode on-no-match="shallow-copy"/>
<xsl:output method="text"/>
<xsl:template match="/" name="xsl:initial-template">
<xsl:variable name="categories">
<categories>
<xsl:apply-templates select="tail($html-doc//table)"/>
</categories>
</xsl:variable>
<xsl:variable name="json-xml">
<xsl:apply-templates select="$categories" mode="json"/>
</xsl:variable>
<xsl:sequence select="xml-to-json($json-xml, map { 'indent' : true() })"/>
</xsl:template>
<xsl:template match="table">
<category>
<xsl:sequence select="mf:create-attributes(tr[1]/td[1])"/>
<xsl:apply-templates select="head(tr)/td[2], tail(tr)/td[1]"/>
</category>
</xsl:template>
<xsl:template match="td">
<subcategory>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="following-sibling::td[1]/ul"/>
</subcategory>
</xsl:template>
<xsl:template match="ul">
<xsl:for-each-group select="*" group-starting-with="li">
<sub-sub-category>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="tail(current-group())"/>
</sub-sub-category>
</xsl:for-each-group>
</xsl:template>
<xsl:function name="mf:create-attributes" as="attribute()*">
<xsl:param name="input" as="xs:string"/>
<xsl:variable name="input-components" as="xs:string*" select="tokenize(normalize-space($input))"/>
<xsl:attribute name="name" select="head($input-components)"/>
<xsl:attribute name="title" select="tail($input-components)"/>
</xsl:function>
<xsl:mode name="json" on-no-match="shallow-skip"/>
<xsl:template match="category | subcategory | sub-sub-category" mode="json">
<fn:map>
<fn:map key="{@name}">
<fn:string key="title">{@title}</fn:string>
<xsl:where-populated>
<fn:array key="children">
<xsl:apply-templates mode="#current"/>
</fn:array>
</xsl:where-populated>
</fn:map>
</fn:map>
</xsl:template>
</xsl:stylesheet>
https://xsltfiddle.liberty-development.net/3NzcBud/2
https://xsltfiddle.liberty-development.net/3NzcBud/3 is the same code applied to a different input file, at least the XML -> XML -> JSON generation doesn't break, I haven't checked whether the HTML table and lists are having the same structure as in the previous input.
As another option to create JSON with XSLT 3 and support for XPath 3.1 map and array data types (which both Saxon 9.8/9.9 in all editions provide as well as Altova 2017/2018/2019) you can directly create maps and arrays and serialize with method json
:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc"
xmlns:mf="http://example.com/mf"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
xmlns:map="http://www.w3.org/2005/xpath-functions/map"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">
<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>
<xsl:param name="html-file" as="xs:string"
>http://bpeck.com/references/DDC/ddc_mine900.htm</xsl:param>
<xsl:param name="html-text" as="xs:string" select="unparsed-text($html-file)"/>
<xsl:variable name="html-doc" select="d:htmlparse($html-text, '', true())"/>
<xsl:mode on-no-match="shallow-copy"/>
<xsl:output method="json" indent="yes"/>
<xsl:template match="/" name="xsl:initial-template">
<xsl:apply-templates select="tail($html-doc//table)"/>
</xsl:template>
<xsl:template match="table">
<xsl:sequence select="mf:category-map(tr[1]/td[1], (head(tr)/td[2], tail(tr)/td[1]))"/>
</xsl:template>
<xsl:template match="td">
<xsl:sequence select="mf:category-map(., following-sibling::td[1]/ul)"/>
</xsl:template>
<xsl:template match="ul">
<xsl:for-each-group select="*" group-starting-with="li">
<xsl:sequence select="mf:category-map(., tail(current-group()))"/>
</xsl:for-each-group>
</xsl:template>
<xsl:function name="mf:split-index-title" as="xs:string*">
<xsl:param name="input" as="xs:string"/>
<xsl:sequence
select="
let $components := tokenize(normalize-space($input))
return
(head($components), string-join(tail($components), ' '))"
/>
</xsl:function>
<xsl:function name="mf:category-map" as="map(xs:string, item())">
<xsl:param name="category" as="element()"/>
<xsl:param name="subcategories" as="element()*"/>
<xsl:variable name="components" select="mf:split-index-title($category)"/>
<xsl:map>
<xsl:map-entry key="$components[1]">
<xsl:map>
<xsl:map-entry key="'title'" select="$components[2]"/>
<xsl:if test="$subcategories">
<xsl:map-entry key="'children'">
<xsl:sequence select="array{ mf:child-categories($subcategories) }"/>
</xsl:map-entry>
</xsl:if>
</xsl:map>
</xsl:map-entry>
</xsl:map>
</xsl:function>
<xsl:function name="mf:child-categories" as="map(xs:string, item())*">
<xsl:param name="subcategories" as="element()*"/>
<xsl:apply-templates select="$subcategories"/>
</xsl:function>
</xsl:stylesheet>
https://xsltfiddle.liberty-development.net/3NzcBud/4
thank you very much
– saloda
Nov 23 '18 at 10:47
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53433682%2fparse-html-file-obtaining-data-from-nested-categories-hierarchy-using-xslt-3%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
The data you have seems poorly structured (only checked http://bpeck.com/references/DDC/ddc_mine900.htm but that doesn't pass HTML validation at https://validator.w3.org/check?uri=http%3A%2F%2Fbpeck.com%2Freferences%2FDDC%2Fddc_mine900.htm&charset=%28detect+automatically%29&doctype=Inline&group=0, in particular the lists of subcategories are not properly nested so some XSLT plumbing is needed).
As for parsing HTML with XSLT 2 or 3, if you can't set up Saxon to use an HTML parser like TagSoup instead of an XML parser for the input you can try to use David Carlisle's htmlparse
function implemented in pure XSLT 2, it is available online at https://github.com/davidcarlisle/web-xslt/blob/master/htmlparse/htmlparse.xsl, make sure you download a local copy if you want to use it to parse your HTML in XSLT 2 or 3 with good performance.
Here is an example that uses the online copy and parses the input HTML into some XML format I made up:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc"
xmlns:mf="http://example.com/mf"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">
<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>
<xsl:param name="html-file" as="xs:string">http://bpeck.com/references/DDC/ddc_mine900.htm</xsl:param>
<xsl:param name="html-text" as="xs:string" select="unparsed-text($html-file)"/>
<xsl:variable name="html-doc" select="d:htmlparse($html-text, '', true())"/>
<xsl:mode on-no-match="shallow-copy"/>
<xsl:output method="xml" indent="yes"/>
<xsl:template match="/" name="xsl:initial-template">
<categories>
<xsl:apply-templates select="tail($html-doc//table)"/>
</categories>
</xsl:template>
<xsl:template match="table">
<category>
<xsl:sequence select="mf:create-attributes(tr[1]/td[1])"/>
<xsl:apply-templates select="head(tr)/td[2], tail(tr)/td[1]"/>
</category>
</xsl:template>
<xsl:template match="td">
<subcategory>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="following-sibling::td[1]/ul"/>
</subcategory>
</xsl:template>
<xsl:template match="ul">
<xsl:for-each-group select="*" group-starting-with="li">
<sub-sub-category>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="tail(current-group())"/>
</sub-sub-category>
</xsl:for-each-group>
</xsl:template>
<xsl:function name="mf:create-attributes" as="attribute()*">
<xsl:param name="input" as="xs:string"/>
<xsl:variable name="input-components" as="xs:string*" select="tokenize(normalize-space($input))"/>
<xsl:attribute name="name" select="head($input-components)"/>
<xsl:attribute name="title" select="tail($input-components)"/>
</xsl:function>
</xsl:stylesheet>
https://xsltfiddle.liberty-development.net/3NzcBud
To create JSON with XSLT 3, you have a couple of options, one is to have your stylesheet create the format the xml-to-json
function expects (https://www.w3.org/TR/xslt-30/#json-to-xml-mapping); in the following example I extend the above stylesheet with a mode that takes the previous result XML to create the XML input you can feed to xml-to-json
:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc"
xmlns:mf="http://example.com/mf"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">
<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>
<xsl:param name="html-file" as="xs:string">http://bpeck.com/references/DDC/ddc_mine900.htm</xsl:param>
<xsl:param name="html-text" as="xs:string" select="unparsed-text($html-file)"/>
<xsl:variable name="html-doc" select="d:htmlparse($html-text, '', true())"/>
<xsl:mode on-no-match="shallow-copy"/>
<xsl:output method="xml" indent="yes"/>
<xsl:template match="/" name="xsl:initial-template">
<xsl:variable name="categories">
<categories>
<xsl:apply-templates select="tail($html-doc//table)"/>
</categories>
</xsl:variable>
<xsl:apply-templates select="$categories" mode="json"/>
</xsl:template>
<xsl:template match="table">
<category>
<xsl:sequence select="mf:create-attributes(tr[1]/td[1])"/>
<xsl:apply-templates select="head(tr)/td[2], tail(tr)/td[1]"/>
</category>
</xsl:template>
<xsl:template match="td">
<subcategory>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="following-sibling::td[1]/ul"/>
</subcategory>
</xsl:template>
<xsl:template match="ul">
<xsl:for-each-group select="*" group-starting-with="li">
<sub-sub-category>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="tail(current-group())"/>
</sub-sub-category>
</xsl:for-each-group>
</xsl:template>
<xsl:function name="mf:create-attributes" as="attribute()*">
<xsl:param name="input" as="xs:string"/>
<xsl:variable name="input-components" as="xs:string*" select="tokenize(normalize-space($input))"/>
<xsl:attribute name="name" select="head($input-components)"/>
<xsl:attribute name="title" select="tail($input-components)"/>
</xsl:function>
<xsl:mode name="json" on-no-match="shallow-skip"/>
<xsl:template match="category | subcategory | sub-sub-category" mode="json">
<fn:map>
<fn:map key="{@name}">
<fn:string key="title">{@title}</fn:string>
<xsl:where-populated>
<fn:array key="children">
<xsl:apply-templates mode="#current"/>
</fn:array>
</xsl:where-populated>
</fn:map>
</fn:map>
</xsl:template>
</xsl:stylesheet>
https://xsltfiddle.liberty-development.net/3NzcBud/1
The final step then would to use the function xml-to-json
(https://www.w3.org/TR/xpath-functions/#func-xml-to-json) to output JSON instead of XML:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc"
xmlns:mf="http://example.com/mf"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">
<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>
<xsl:param name="html-file" as="xs:string">http://bpeck.com/references/DDC/ddc_mine900.htm</xsl:param>
<xsl:param name="html-text" as="xs:string" select="unparsed-text($html-file)"/>
<xsl:variable name="html-doc" select="d:htmlparse($html-text, '', true())"/>
<xsl:mode on-no-match="shallow-copy"/>
<xsl:output method="text"/>
<xsl:template match="/" name="xsl:initial-template">
<xsl:variable name="categories">
<categories>
<xsl:apply-templates select="tail($html-doc//table)"/>
</categories>
</xsl:variable>
<xsl:variable name="json-xml">
<xsl:apply-templates select="$categories" mode="json"/>
</xsl:variable>
<xsl:sequence select="xml-to-json($json-xml, map { 'indent' : true() })"/>
</xsl:template>
<xsl:template match="table">
<category>
<xsl:sequence select="mf:create-attributes(tr[1]/td[1])"/>
<xsl:apply-templates select="head(tr)/td[2], tail(tr)/td[1]"/>
</category>
</xsl:template>
<xsl:template match="td">
<subcategory>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="following-sibling::td[1]/ul"/>
</subcategory>
</xsl:template>
<xsl:template match="ul">
<xsl:for-each-group select="*" group-starting-with="li">
<sub-sub-category>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="tail(current-group())"/>
</sub-sub-category>
</xsl:for-each-group>
</xsl:template>
<xsl:function name="mf:create-attributes" as="attribute()*">
<xsl:param name="input" as="xs:string"/>
<xsl:variable name="input-components" as="xs:string*" select="tokenize(normalize-space($input))"/>
<xsl:attribute name="name" select="head($input-components)"/>
<xsl:attribute name="title" select="tail($input-components)"/>
</xsl:function>
<xsl:mode name="json" on-no-match="shallow-skip"/>
<xsl:template match="category | subcategory | sub-sub-category" mode="json">
<fn:map>
<fn:map key="{@name}">
<fn:string key="title">{@title}</fn:string>
<xsl:where-populated>
<fn:array key="children">
<xsl:apply-templates mode="#current"/>
</fn:array>
</xsl:where-populated>
</fn:map>
</fn:map>
</xsl:template>
</xsl:stylesheet>
https://xsltfiddle.liberty-development.net/3NzcBud/2
https://xsltfiddle.liberty-development.net/3NzcBud/3 is the same code applied to a different input file, at least the XML -> XML -> JSON generation doesn't break, I haven't checked whether the HTML table and lists are having the same structure as in the previous input.
As another option to create JSON with XSLT 3 and support for XPath 3.1 map and array data types (which both Saxon 9.8/9.9 in all editions provide as well as Altova 2017/2018/2019) you can directly create maps and arrays and serialize with method json
:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc"
xmlns:mf="http://example.com/mf"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
xmlns:map="http://www.w3.org/2005/xpath-functions/map"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">
<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>
<xsl:param name="html-file" as="xs:string"
>http://bpeck.com/references/DDC/ddc_mine900.htm</xsl:param>
<xsl:param name="html-text" as="xs:string" select="unparsed-text($html-file)"/>
<xsl:variable name="html-doc" select="d:htmlparse($html-text, '', true())"/>
<xsl:mode on-no-match="shallow-copy"/>
<xsl:output method="json" indent="yes"/>
<xsl:template match="/" name="xsl:initial-template">
<xsl:apply-templates select="tail($html-doc//table)"/>
</xsl:template>
<xsl:template match="table">
<xsl:sequence select="mf:category-map(tr[1]/td[1], (head(tr)/td[2], tail(tr)/td[1]))"/>
</xsl:template>
<xsl:template match="td">
<xsl:sequence select="mf:category-map(., following-sibling::td[1]/ul)"/>
</xsl:template>
<xsl:template match="ul">
<xsl:for-each-group select="*" group-starting-with="li">
<xsl:sequence select="mf:category-map(., tail(current-group()))"/>
</xsl:for-each-group>
</xsl:template>
<xsl:function name="mf:split-index-title" as="xs:string*">
<xsl:param name="input" as="xs:string"/>
<xsl:sequence
select="
let $components := tokenize(normalize-space($input))
return
(head($components), string-join(tail($components), ' '))"
/>
</xsl:function>
<xsl:function name="mf:category-map" as="map(xs:string, item())">
<xsl:param name="category" as="element()"/>
<xsl:param name="subcategories" as="element()*"/>
<xsl:variable name="components" select="mf:split-index-title($category)"/>
<xsl:map>
<xsl:map-entry key="$components[1]">
<xsl:map>
<xsl:map-entry key="'title'" select="$components[2]"/>
<xsl:if test="$subcategories">
<xsl:map-entry key="'children'">
<xsl:sequence select="array{ mf:child-categories($subcategories) }"/>
</xsl:map-entry>
</xsl:if>
</xsl:map>
</xsl:map-entry>
</xsl:map>
</xsl:function>
<xsl:function name="mf:child-categories" as="map(xs:string, item())*">
<xsl:param name="subcategories" as="element()*"/>
<xsl:apply-templates select="$subcategories"/>
</xsl:function>
</xsl:stylesheet>
https://xsltfiddle.liberty-development.net/3NzcBud/4
thank you very much
– saloda
Nov 23 '18 at 10:47
add a comment |
The data you have seems poorly structured (only checked http://bpeck.com/references/DDC/ddc_mine900.htm but that doesn't pass HTML validation at https://validator.w3.org/check?uri=http%3A%2F%2Fbpeck.com%2Freferences%2FDDC%2Fddc_mine900.htm&charset=%28detect+automatically%29&doctype=Inline&group=0, in particular the lists of subcategories are not properly nested so some XSLT plumbing is needed).
As for parsing HTML with XSLT 2 or 3, if you can't set up Saxon to use an HTML parser like TagSoup instead of an XML parser for the input you can try to use David Carlisle's htmlparse
function implemented in pure XSLT 2, it is available online at https://github.com/davidcarlisle/web-xslt/blob/master/htmlparse/htmlparse.xsl, make sure you download a local copy if you want to use it to parse your HTML in XSLT 2 or 3 with good performance.
Here is an example that uses the online copy and parses the input HTML into some XML format I made up:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc"
xmlns:mf="http://example.com/mf"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">
<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>
<xsl:param name="html-file" as="xs:string">http://bpeck.com/references/DDC/ddc_mine900.htm</xsl:param>
<xsl:param name="html-text" as="xs:string" select="unparsed-text($html-file)"/>
<xsl:variable name="html-doc" select="d:htmlparse($html-text, '', true())"/>
<xsl:mode on-no-match="shallow-copy"/>
<xsl:output method="xml" indent="yes"/>
<xsl:template match="/" name="xsl:initial-template">
<categories>
<xsl:apply-templates select="tail($html-doc//table)"/>
</categories>
</xsl:template>
<xsl:template match="table">
<category>
<xsl:sequence select="mf:create-attributes(tr[1]/td[1])"/>
<xsl:apply-templates select="head(tr)/td[2], tail(tr)/td[1]"/>
</category>
</xsl:template>
<xsl:template match="td">
<subcategory>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="following-sibling::td[1]/ul"/>
</subcategory>
</xsl:template>
<xsl:template match="ul">
<xsl:for-each-group select="*" group-starting-with="li">
<sub-sub-category>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="tail(current-group())"/>
</sub-sub-category>
</xsl:for-each-group>
</xsl:template>
<xsl:function name="mf:create-attributes" as="attribute()*">
<xsl:param name="input" as="xs:string"/>
<xsl:variable name="input-components" as="xs:string*" select="tokenize(normalize-space($input))"/>
<xsl:attribute name="name" select="head($input-components)"/>
<xsl:attribute name="title" select="tail($input-components)"/>
</xsl:function>
</xsl:stylesheet>
https://xsltfiddle.liberty-development.net/3NzcBud
To create JSON with XSLT 3, you have a couple of options, one is to have your stylesheet create the format the xml-to-json
function expects (https://www.w3.org/TR/xslt-30/#json-to-xml-mapping); in the following example I extend the above stylesheet with a mode that takes the previous result XML to create the XML input you can feed to xml-to-json
:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc"
xmlns:mf="http://example.com/mf"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">
<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>
<xsl:param name="html-file" as="xs:string">http://bpeck.com/references/DDC/ddc_mine900.htm</xsl:param>
<xsl:param name="html-text" as="xs:string" select="unparsed-text($html-file)"/>
<xsl:variable name="html-doc" select="d:htmlparse($html-text, '', true())"/>
<xsl:mode on-no-match="shallow-copy"/>
<xsl:output method="xml" indent="yes"/>
<xsl:template match="/" name="xsl:initial-template">
<xsl:variable name="categories">
<categories>
<xsl:apply-templates select="tail($html-doc//table)"/>
</categories>
</xsl:variable>
<xsl:apply-templates select="$categories" mode="json"/>
</xsl:template>
<xsl:template match="table">
<category>
<xsl:sequence select="mf:create-attributes(tr[1]/td[1])"/>
<xsl:apply-templates select="head(tr)/td[2], tail(tr)/td[1]"/>
</category>
</xsl:template>
<xsl:template match="td">
<subcategory>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="following-sibling::td[1]/ul"/>
</subcategory>
</xsl:template>
<xsl:template match="ul">
<xsl:for-each-group select="*" group-starting-with="li">
<sub-sub-category>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="tail(current-group())"/>
</sub-sub-category>
</xsl:for-each-group>
</xsl:template>
<xsl:function name="mf:create-attributes" as="attribute()*">
<xsl:param name="input" as="xs:string"/>
<xsl:variable name="input-components" as="xs:string*" select="tokenize(normalize-space($input))"/>
<xsl:attribute name="name" select="head($input-components)"/>
<xsl:attribute name="title" select="tail($input-components)"/>
</xsl:function>
<xsl:mode name="json" on-no-match="shallow-skip"/>
<xsl:template match="category | subcategory | sub-sub-category" mode="json">
<fn:map>
<fn:map key="{@name}">
<fn:string key="title">{@title}</fn:string>
<xsl:where-populated>
<fn:array key="children">
<xsl:apply-templates mode="#current"/>
</fn:array>
</xsl:where-populated>
</fn:map>
</fn:map>
</xsl:template>
</xsl:stylesheet>
https://xsltfiddle.liberty-development.net/3NzcBud/1
The final step then would to use the function xml-to-json
(https://www.w3.org/TR/xpath-functions/#func-xml-to-json) to output JSON instead of XML:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc"
xmlns:mf="http://example.com/mf"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">
<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>
<xsl:param name="html-file" as="xs:string">http://bpeck.com/references/DDC/ddc_mine900.htm</xsl:param>
<xsl:param name="html-text" as="xs:string" select="unparsed-text($html-file)"/>
<xsl:variable name="html-doc" select="d:htmlparse($html-text, '', true())"/>
<xsl:mode on-no-match="shallow-copy"/>
<xsl:output method="text"/>
<xsl:template match="/" name="xsl:initial-template">
<xsl:variable name="categories">
<categories>
<xsl:apply-templates select="tail($html-doc//table)"/>
</categories>
</xsl:variable>
<xsl:variable name="json-xml">
<xsl:apply-templates select="$categories" mode="json"/>
</xsl:variable>
<xsl:sequence select="xml-to-json($json-xml, map { 'indent' : true() })"/>
</xsl:template>
<xsl:template match="table">
<category>
<xsl:sequence select="mf:create-attributes(tr[1]/td[1])"/>
<xsl:apply-templates select="head(tr)/td[2], tail(tr)/td[1]"/>
</category>
</xsl:template>
<xsl:template match="td">
<subcategory>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="following-sibling::td[1]/ul"/>
</subcategory>
</xsl:template>
<xsl:template match="ul">
<xsl:for-each-group select="*" group-starting-with="li">
<sub-sub-category>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="tail(current-group())"/>
</sub-sub-category>
</xsl:for-each-group>
</xsl:template>
<xsl:function name="mf:create-attributes" as="attribute()*">
<xsl:param name="input" as="xs:string"/>
<xsl:variable name="input-components" as="xs:string*" select="tokenize(normalize-space($input))"/>
<xsl:attribute name="name" select="head($input-components)"/>
<xsl:attribute name="title" select="tail($input-components)"/>
</xsl:function>
<xsl:mode name="json" on-no-match="shallow-skip"/>
<xsl:template match="category | subcategory | sub-sub-category" mode="json">
<fn:map>
<fn:map key="{@name}">
<fn:string key="title">{@title}</fn:string>
<xsl:where-populated>
<fn:array key="children">
<xsl:apply-templates mode="#current"/>
</fn:array>
</xsl:where-populated>
</fn:map>
</fn:map>
</xsl:template>
</xsl:stylesheet>
https://xsltfiddle.liberty-development.net/3NzcBud/2
https://xsltfiddle.liberty-development.net/3NzcBud/3 is the same code applied to a different input file, at least the XML -> XML -> JSON generation doesn't break, I haven't checked whether the HTML table and lists are having the same structure as in the previous input.
As another option to create JSON with XSLT 3 and support for XPath 3.1 map and array data types (which both Saxon 9.8/9.9 in all editions provide as well as Altova 2017/2018/2019) you can directly create maps and arrays and serialize with method json
:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc"
xmlns:mf="http://example.com/mf"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
xmlns:map="http://www.w3.org/2005/xpath-functions/map"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">
<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>
<xsl:param name="html-file" as="xs:string"
>http://bpeck.com/references/DDC/ddc_mine900.htm</xsl:param>
<xsl:param name="html-text" as="xs:string" select="unparsed-text($html-file)"/>
<xsl:variable name="html-doc" select="d:htmlparse($html-text, '', true())"/>
<xsl:mode on-no-match="shallow-copy"/>
<xsl:output method="json" indent="yes"/>
<xsl:template match="/" name="xsl:initial-template">
<xsl:apply-templates select="tail($html-doc//table)"/>
</xsl:template>
<xsl:template match="table">
<xsl:sequence select="mf:category-map(tr[1]/td[1], (head(tr)/td[2], tail(tr)/td[1]))"/>
</xsl:template>
<xsl:template match="td">
<xsl:sequence select="mf:category-map(., following-sibling::td[1]/ul)"/>
</xsl:template>
<xsl:template match="ul">
<xsl:for-each-group select="*" group-starting-with="li">
<xsl:sequence select="mf:category-map(., tail(current-group()))"/>
</xsl:for-each-group>
</xsl:template>
<xsl:function name="mf:split-index-title" as="xs:string*">
<xsl:param name="input" as="xs:string"/>
<xsl:sequence
select="
let $components := tokenize(normalize-space($input))
return
(head($components), string-join(tail($components), ' '))"
/>
</xsl:function>
<xsl:function name="mf:category-map" as="map(xs:string, item())">
<xsl:param name="category" as="element()"/>
<xsl:param name="subcategories" as="element()*"/>
<xsl:variable name="components" select="mf:split-index-title($category)"/>
<xsl:map>
<xsl:map-entry key="$components[1]">
<xsl:map>
<xsl:map-entry key="'title'" select="$components[2]"/>
<xsl:if test="$subcategories">
<xsl:map-entry key="'children'">
<xsl:sequence select="array{ mf:child-categories($subcategories) }"/>
</xsl:map-entry>
</xsl:if>
</xsl:map>
</xsl:map-entry>
</xsl:map>
</xsl:function>
<xsl:function name="mf:child-categories" as="map(xs:string, item())*">
<xsl:param name="subcategories" as="element()*"/>
<xsl:apply-templates select="$subcategories"/>
</xsl:function>
</xsl:stylesheet>
https://xsltfiddle.liberty-development.net/3NzcBud/4
thank you very much
– saloda
Nov 23 '18 at 10:47
add a comment |
The data you have seems poorly structured (only checked http://bpeck.com/references/DDC/ddc_mine900.htm but that doesn't pass HTML validation at https://validator.w3.org/check?uri=http%3A%2F%2Fbpeck.com%2Freferences%2FDDC%2Fddc_mine900.htm&charset=%28detect+automatically%29&doctype=Inline&group=0, in particular the lists of subcategories are not properly nested so some XSLT plumbing is needed).
As for parsing HTML with XSLT 2 or 3, if you can't set up Saxon to use an HTML parser like TagSoup instead of an XML parser for the input you can try to use David Carlisle's htmlparse
function implemented in pure XSLT 2, it is available online at https://github.com/davidcarlisle/web-xslt/blob/master/htmlparse/htmlparse.xsl, make sure you download a local copy if you want to use it to parse your HTML in XSLT 2 or 3 with good performance.
Here is an example that uses the online copy and parses the input HTML into some XML format I made up:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc"
xmlns:mf="http://example.com/mf"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">
<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>
<xsl:param name="html-file" as="xs:string">http://bpeck.com/references/DDC/ddc_mine900.htm</xsl:param>
<xsl:param name="html-text" as="xs:string" select="unparsed-text($html-file)"/>
<xsl:variable name="html-doc" select="d:htmlparse($html-text, '', true())"/>
<xsl:mode on-no-match="shallow-copy"/>
<xsl:output method="xml" indent="yes"/>
<xsl:template match="/" name="xsl:initial-template">
<categories>
<xsl:apply-templates select="tail($html-doc//table)"/>
</categories>
</xsl:template>
<xsl:template match="table">
<category>
<xsl:sequence select="mf:create-attributes(tr[1]/td[1])"/>
<xsl:apply-templates select="head(tr)/td[2], tail(tr)/td[1]"/>
</category>
</xsl:template>
<xsl:template match="td">
<subcategory>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="following-sibling::td[1]/ul"/>
</subcategory>
</xsl:template>
<xsl:template match="ul">
<xsl:for-each-group select="*" group-starting-with="li">
<sub-sub-category>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="tail(current-group())"/>
</sub-sub-category>
</xsl:for-each-group>
</xsl:template>
<xsl:function name="mf:create-attributes" as="attribute()*">
<xsl:param name="input" as="xs:string"/>
<xsl:variable name="input-components" as="xs:string*" select="tokenize(normalize-space($input))"/>
<xsl:attribute name="name" select="head($input-components)"/>
<xsl:attribute name="title" select="tail($input-components)"/>
</xsl:function>
</xsl:stylesheet>
https://xsltfiddle.liberty-development.net/3NzcBud
To create JSON with XSLT 3, you have a couple of options, one is to have your stylesheet create the format the xml-to-json
function expects (https://www.w3.org/TR/xslt-30/#json-to-xml-mapping); in the following example I extend the above stylesheet with a mode that takes the previous result XML to create the XML input you can feed to xml-to-json
:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc"
xmlns:mf="http://example.com/mf"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">
<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>
<xsl:param name="html-file" as="xs:string">http://bpeck.com/references/DDC/ddc_mine900.htm</xsl:param>
<xsl:param name="html-text" as="xs:string" select="unparsed-text($html-file)"/>
<xsl:variable name="html-doc" select="d:htmlparse($html-text, '', true())"/>
<xsl:mode on-no-match="shallow-copy"/>
<xsl:output method="xml" indent="yes"/>
<xsl:template match="/" name="xsl:initial-template">
<xsl:variable name="categories">
<categories>
<xsl:apply-templates select="tail($html-doc//table)"/>
</categories>
</xsl:variable>
<xsl:apply-templates select="$categories" mode="json"/>
</xsl:template>
<xsl:template match="table">
<category>
<xsl:sequence select="mf:create-attributes(tr[1]/td[1])"/>
<xsl:apply-templates select="head(tr)/td[2], tail(tr)/td[1]"/>
</category>
</xsl:template>
<xsl:template match="td">
<subcategory>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="following-sibling::td[1]/ul"/>
</subcategory>
</xsl:template>
<xsl:template match="ul">
<xsl:for-each-group select="*" group-starting-with="li">
<sub-sub-category>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="tail(current-group())"/>
</sub-sub-category>
</xsl:for-each-group>
</xsl:template>
<xsl:function name="mf:create-attributes" as="attribute()*">
<xsl:param name="input" as="xs:string"/>
<xsl:variable name="input-components" as="xs:string*" select="tokenize(normalize-space($input))"/>
<xsl:attribute name="name" select="head($input-components)"/>
<xsl:attribute name="title" select="tail($input-components)"/>
</xsl:function>
<xsl:mode name="json" on-no-match="shallow-skip"/>
<xsl:template match="category | subcategory | sub-sub-category" mode="json">
<fn:map>
<fn:map key="{@name}">
<fn:string key="title">{@title}</fn:string>
<xsl:where-populated>
<fn:array key="children">
<xsl:apply-templates mode="#current"/>
</fn:array>
</xsl:where-populated>
</fn:map>
</fn:map>
</xsl:template>
</xsl:stylesheet>
https://xsltfiddle.liberty-development.net/3NzcBud/1
The final step then would to use the function xml-to-json
(https://www.w3.org/TR/xpath-functions/#func-xml-to-json) to output JSON instead of XML:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc"
xmlns:mf="http://example.com/mf"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">
<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>
<xsl:param name="html-file" as="xs:string">http://bpeck.com/references/DDC/ddc_mine900.htm</xsl:param>
<xsl:param name="html-text" as="xs:string" select="unparsed-text($html-file)"/>
<xsl:variable name="html-doc" select="d:htmlparse($html-text, '', true())"/>
<xsl:mode on-no-match="shallow-copy"/>
<xsl:output method="text"/>
<xsl:template match="/" name="xsl:initial-template">
<xsl:variable name="categories">
<categories>
<xsl:apply-templates select="tail($html-doc//table)"/>
</categories>
</xsl:variable>
<xsl:variable name="json-xml">
<xsl:apply-templates select="$categories" mode="json"/>
</xsl:variable>
<xsl:sequence select="xml-to-json($json-xml, map { 'indent' : true() })"/>
</xsl:template>
<xsl:template match="table">
<category>
<xsl:sequence select="mf:create-attributes(tr[1]/td[1])"/>
<xsl:apply-templates select="head(tr)/td[2], tail(tr)/td[1]"/>
</category>
</xsl:template>
<xsl:template match="td">
<subcategory>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="following-sibling::td[1]/ul"/>
</subcategory>
</xsl:template>
<xsl:template match="ul">
<xsl:for-each-group select="*" group-starting-with="li">
<sub-sub-category>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="tail(current-group())"/>
</sub-sub-category>
</xsl:for-each-group>
</xsl:template>
<xsl:function name="mf:create-attributes" as="attribute()*">
<xsl:param name="input" as="xs:string"/>
<xsl:variable name="input-components" as="xs:string*" select="tokenize(normalize-space($input))"/>
<xsl:attribute name="name" select="head($input-components)"/>
<xsl:attribute name="title" select="tail($input-components)"/>
</xsl:function>
<xsl:mode name="json" on-no-match="shallow-skip"/>
<xsl:template match="category | subcategory | sub-sub-category" mode="json">
<fn:map>
<fn:map key="{@name}">
<fn:string key="title">{@title}</fn:string>
<xsl:where-populated>
<fn:array key="children">
<xsl:apply-templates mode="#current"/>
</fn:array>
</xsl:where-populated>
</fn:map>
</fn:map>
</xsl:template>
</xsl:stylesheet>
https://xsltfiddle.liberty-development.net/3NzcBud/2
https://xsltfiddle.liberty-development.net/3NzcBud/3 is the same code applied to a different input file, at least the XML -> XML -> JSON generation doesn't break, I haven't checked whether the HTML table and lists are having the same structure as in the previous input.
As another option to create JSON with XSLT 3 and support for XPath 3.1 map and array data types (which both Saxon 9.8/9.9 in all editions provide as well as Altova 2017/2018/2019) you can directly create maps and arrays and serialize with method json
:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc"
xmlns:mf="http://example.com/mf"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
xmlns:map="http://www.w3.org/2005/xpath-functions/map"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">
<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>
<xsl:param name="html-file" as="xs:string"
>http://bpeck.com/references/DDC/ddc_mine900.htm</xsl:param>
<xsl:param name="html-text" as="xs:string" select="unparsed-text($html-file)"/>
<xsl:variable name="html-doc" select="d:htmlparse($html-text, '', true())"/>
<xsl:mode on-no-match="shallow-copy"/>
<xsl:output method="json" indent="yes"/>
<xsl:template match="/" name="xsl:initial-template">
<xsl:apply-templates select="tail($html-doc//table)"/>
</xsl:template>
<xsl:template match="table">
<xsl:sequence select="mf:category-map(tr[1]/td[1], (head(tr)/td[2], tail(tr)/td[1]))"/>
</xsl:template>
<xsl:template match="td">
<xsl:sequence select="mf:category-map(., following-sibling::td[1]/ul)"/>
</xsl:template>
<xsl:template match="ul">
<xsl:for-each-group select="*" group-starting-with="li">
<xsl:sequence select="mf:category-map(., tail(current-group()))"/>
</xsl:for-each-group>
</xsl:template>
<xsl:function name="mf:split-index-title" as="xs:string*">
<xsl:param name="input" as="xs:string"/>
<xsl:sequence
select="
let $components := tokenize(normalize-space($input))
return
(head($components), string-join(tail($components), ' '))"
/>
</xsl:function>
<xsl:function name="mf:category-map" as="map(xs:string, item())">
<xsl:param name="category" as="element()"/>
<xsl:param name="subcategories" as="element()*"/>
<xsl:variable name="components" select="mf:split-index-title($category)"/>
<xsl:map>
<xsl:map-entry key="$components[1]">
<xsl:map>
<xsl:map-entry key="'title'" select="$components[2]"/>
<xsl:if test="$subcategories">
<xsl:map-entry key="'children'">
<xsl:sequence select="array{ mf:child-categories($subcategories) }"/>
</xsl:map-entry>
</xsl:if>
</xsl:map>
</xsl:map-entry>
</xsl:map>
</xsl:function>
<xsl:function name="mf:child-categories" as="map(xs:string, item())*">
<xsl:param name="subcategories" as="element()*"/>
<xsl:apply-templates select="$subcategories"/>
</xsl:function>
</xsl:stylesheet>
https://xsltfiddle.liberty-development.net/3NzcBud/4
The data you have seems poorly structured (only checked http://bpeck.com/references/DDC/ddc_mine900.htm but that doesn't pass HTML validation at https://validator.w3.org/check?uri=http%3A%2F%2Fbpeck.com%2Freferences%2FDDC%2Fddc_mine900.htm&charset=%28detect+automatically%29&doctype=Inline&group=0, in particular the lists of subcategories are not properly nested so some XSLT plumbing is needed).
As for parsing HTML with XSLT 2 or 3, if you can't set up Saxon to use an HTML parser like TagSoup instead of an XML parser for the input you can try to use David Carlisle's htmlparse
function implemented in pure XSLT 2, it is available online at https://github.com/davidcarlisle/web-xslt/blob/master/htmlparse/htmlparse.xsl, make sure you download a local copy if you want to use it to parse your HTML in XSLT 2 or 3 with good performance.
Here is an example that uses the online copy and parses the input HTML into some XML format I made up:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc"
xmlns:mf="http://example.com/mf"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">
<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>
<xsl:param name="html-file" as="xs:string">http://bpeck.com/references/DDC/ddc_mine900.htm</xsl:param>
<xsl:param name="html-text" as="xs:string" select="unparsed-text($html-file)"/>
<xsl:variable name="html-doc" select="d:htmlparse($html-text, '', true())"/>
<xsl:mode on-no-match="shallow-copy"/>
<xsl:output method="xml" indent="yes"/>
<xsl:template match="/" name="xsl:initial-template">
<categories>
<xsl:apply-templates select="tail($html-doc//table)"/>
</categories>
</xsl:template>
<xsl:template match="table">
<category>
<xsl:sequence select="mf:create-attributes(tr[1]/td[1])"/>
<xsl:apply-templates select="head(tr)/td[2], tail(tr)/td[1]"/>
</category>
</xsl:template>
<xsl:template match="td">
<subcategory>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="following-sibling::td[1]/ul"/>
</subcategory>
</xsl:template>
<xsl:template match="ul">
<xsl:for-each-group select="*" group-starting-with="li">
<sub-sub-category>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="tail(current-group())"/>
</sub-sub-category>
</xsl:for-each-group>
</xsl:template>
<xsl:function name="mf:create-attributes" as="attribute()*">
<xsl:param name="input" as="xs:string"/>
<xsl:variable name="input-components" as="xs:string*" select="tokenize(normalize-space($input))"/>
<xsl:attribute name="name" select="head($input-components)"/>
<xsl:attribute name="title" select="tail($input-components)"/>
</xsl:function>
</xsl:stylesheet>
https://xsltfiddle.liberty-development.net/3NzcBud
To create JSON with XSLT 3, you have a couple of options, one is to have your stylesheet create the format the xml-to-json
function expects (https://www.w3.org/TR/xslt-30/#json-to-xml-mapping); in the following example I extend the above stylesheet with a mode that takes the previous result XML to create the XML input you can feed to xml-to-json
:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc"
xmlns:mf="http://example.com/mf"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">
<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>
<xsl:param name="html-file" as="xs:string">http://bpeck.com/references/DDC/ddc_mine900.htm</xsl:param>
<xsl:param name="html-text" as="xs:string" select="unparsed-text($html-file)"/>
<xsl:variable name="html-doc" select="d:htmlparse($html-text, '', true())"/>
<xsl:mode on-no-match="shallow-copy"/>
<xsl:output method="xml" indent="yes"/>
<xsl:template match="/" name="xsl:initial-template">
<xsl:variable name="categories">
<categories>
<xsl:apply-templates select="tail($html-doc//table)"/>
</categories>
</xsl:variable>
<xsl:apply-templates select="$categories" mode="json"/>
</xsl:template>
<xsl:template match="table">
<category>
<xsl:sequence select="mf:create-attributes(tr[1]/td[1])"/>
<xsl:apply-templates select="head(tr)/td[2], tail(tr)/td[1]"/>
</category>
</xsl:template>
<xsl:template match="td">
<subcategory>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="following-sibling::td[1]/ul"/>
</subcategory>
</xsl:template>
<xsl:template match="ul">
<xsl:for-each-group select="*" group-starting-with="li">
<sub-sub-category>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="tail(current-group())"/>
</sub-sub-category>
</xsl:for-each-group>
</xsl:template>
<xsl:function name="mf:create-attributes" as="attribute()*">
<xsl:param name="input" as="xs:string"/>
<xsl:variable name="input-components" as="xs:string*" select="tokenize(normalize-space($input))"/>
<xsl:attribute name="name" select="head($input-components)"/>
<xsl:attribute name="title" select="tail($input-components)"/>
</xsl:function>
<xsl:mode name="json" on-no-match="shallow-skip"/>
<xsl:template match="category | subcategory | sub-sub-category" mode="json">
<fn:map>
<fn:map key="{@name}">
<fn:string key="title">{@title}</fn:string>
<xsl:where-populated>
<fn:array key="children">
<xsl:apply-templates mode="#current"/>
</fn:array>
</xsl:where-populated>
</fn:map>
</fn:map>
</xsl:template>
</xsl:stylesheet>
https://xsltfiddle.liberty-development.net/3NzcBud/1
The final step then would to use the function xml-to-json
(https://www.w3.org/TR/xpath-functions/#func-xml-to-json) to output JSON instead of XML:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc"
xmlns:mf="http://example.com/mf"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">
<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>
<xsl:param name="html-file" as="xs:string">http://bpeck.com/references/DDC/ddc_mine900.htm</xsl:param>
<xsl:param name="html-text" as="xs:string" select="unparsed-text($html-file)"/>
<xsl:variable name="html-doc" select="d:htmlparse($html-text, '', true())"/>
<xsl:mode on-no-match="shallow-copy"/>
<xsl:output method="text"/>
<xsl:template match="/" name="xsl:initial-template">
<xsl:variable name="categories">
<categories>
<xsl:apply-templates select="tail($html-doc//table)"/>
</categories>
</xsl:variable>
<xsl:variable name="json-xml">
<xsl:apply-templates select="$categories" mode="json"/>
</xsl:variable>
<xsl:sequence select="xml-to-json($json-xml, map { 'indent' : true() })"/>
</xsl:template>
<xsl:template match="table">
<category>
<xsl:sequence select="mf:create-attributes(tr[1]/td[1])"/>
<xsl:apply-templates select="head(tr)/td[2], tail(tr)/td[1]"/>
</category>
</xsl:template>
<xsl:template match="td">
<subcategory>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="following-sibling::td[1]/ul"/>
</subcategory>
</xsl:template>
<xsl:template match="ul">
<xsl:for-each-group select="*" group-starting-with="li">
<sub-sub-category>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="tail(current-group())"/>
</sub-sub-category>
</xsl:for-each-group>
</xsl:template>
<xsl:function name="mf:create-attributes" as="attribute()*">
<xsl:param name="input" as="xs:string"/>
<xsl:variable name="input-components" as="xs:string*" select="tokenize(normalize-space($input))"/>
<xsl:attribute name="name" select="head($input-components)"/>
<xsl:attribute name="title" select="tail($input-components)"/>
</xsl:function>
<xsl:mode name="json" on-no-match="shallow-skip"/>
<xsl:template match="category | subcategory | sub-sub-category" mode="json">
<fn:map>
<fn:map key="{@name}">
<fn:string key="title">{@title}</fn:string>
<xsl:where-populated>
<fn:array key="children">
<xsl:apply-templates mode="#current"/>
</fn:array>
</xsl:where-populated>
</fn:map>
</fn:map>
</xsl:template>
</xsl:stylesheet>
https://xsltfiddle.liberty-development.net/3NzcBud/2
https://xsltfiddle.liberty-development.net/3NzcBud/3 is the same code applied to a different input file, at least the XML -> XML -> JSON generation doesn't break, I haven't checked whether the HTML table and lists are having the same structure as in the previous input.
As another option to create JSON with XSLT 3 and support for XPath 3.1 map and array data types (which both Saxon 9.8/9.9 in all editions provide as well as Altova 2017/2018/2019) you can directly create maps and arrays and serialize with method json
:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc"
xmlns:mf="http://example.com/mf"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
xmlns:map="http://www.w3.org/2005/xpath-functions/map"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">
<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>
<xsl:param name="html-file" as="xs:string"
>http://bpeck.com/references/DDC/ddc_mine900.htm</xsl:param>
<xsl:param name="html-text" as="xs:string" select="unparsed-text($html-file)"/>
<xsl:variable name="html-doc" select="d:htmlparse($html-text, '', true())"/>
<xsl:mode on-no-match="shallow-copy"/>
<xsl:output method="json" indent="yes"/>
<xsl:template match="/" name="xsl:initial-template">
<xsl:apply-templates select="tail($html-doc//table)"/>
</xsl:template>
<xsl:template match="table">
<xsl:sequence select="mf:category-map(tr[1]/td[1], (head(tr)/td[2], tail(tr)/td[1]))"/>
</xsl:template>
<xsl:template match="td">
<xsl:sequence select="mf:category-map(., following-sibling::td[1]/ul)"/>
</xsl:template>
<xsl:template match="ul">
<xsl:for-each-group select="*" group-starting-with="li">
<xsl:sequence select="mf:category-map(., tail(current-group()))"/>
</xsl:for-each-group>
</xsl:template>
<xsl:function name="mf:split-index-title" as="xs:string*">
<xsl:param name="input" as="xs:string"/>
<xsl:sequence
select="
let $components := tokenize(normalize-space($input))
return
(head($components), string-join(tail($components), ' '))"
/>
</xsl:function>
<xsl:function name="mf:category-map" as="map(xs:string, item())">
<xsl:param name="category" as="element()"/>
<xsl:param name="subcategories" as="element()*"/>
<xsl:variable name="components" select="mf:split-index-title($category)"/>
<xsl:map>
<xsl:map-entry key="$components[1]">
<xsl:map>
<xsl:map-entry key="'title'" select="$components[2]"/>
<xsl:if test="$subcategories">
<xsl:map-entry key="'children'">
<xsl:sequence select="array{ mf:child-categories($subcategories) }"/>
</xsl:map-entry>
</xsl:if>
</xsl:map>
</xsl:map-entry>
</xsl:map>
</xsl:function>
<xsl:function name="mf:child-categories" as="map(xs:string, item())*">
<xsl:param name="subcategories" as="element()*"/>
<xsl:apply-templates select="$subcategories"/>
</xsl:function>
</xsl:stylesheet>
https://xsltfiddle.liberty-development.net/3NzcBud/4
edited Nov 23 '18 at 8:38
answered Nov 22 '18 at 21:00
Martin HonnenMartin Honnen
113k66279
113k66279
thank you very much
– saloda
Nov 23 '18 at 10:47
add a comment |
thank you very much
– saloda
Nov 23 '18 at 10:47
thank you very much
– saloda
Nov 23 '18 at 10:47
thank you very much
– saloda
Nov 23 '18 at 10:47
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53433682%2fparse-html-file-obtaining-data-from-nested-categories-hierarchy-using-xslt-3%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
The result you have shown doesn't show any sample data, nor it is XML or JSON. So where do you get the
ID
values from, what do you consider a main category, a sub category, and so on?– Martin Honnen
Nov 22 '18 at 15:59
And which XSLT 3 processor do you use as the input is not XML but HTML? Do you use Saxon PE or EE where you have saxonica.com/html/documentation/functions/saxon/parse-html.html or some other way to plug in an HTML parser into the tool chain?
– Martin Honnen
Nov 22 '18 at 16:25
i use saxon HE 9.8.
– saloda
Nov 22 '18 at 16:29
It depends, do you use the Java version of Saxon HE? How do you run it, from the command line, from Java code? Can you install TagSoup vrici.lojban.org/~cowan/XML/tagsoup? Saxon (even HE) has a command line option
-x
to name a parser so if TagSoup is on the class path then-x:org.ccil.cowan.tagsoup.Parser
is supposed to use that parser for the input instead of a normal XML parser, then you have the (X)HTML tree TagSoup gives as the input.– Martin Honnen
Nov 22 '18 at 16:39