I am new to the library so I have a couple of questions to make sure I am on the right track. I have a task of converting html. Say I have an html page (page1.html) it its depth of html code it has a sertain table with a <td id='content ...>. This <td> element holds the important data that I want to include in another html page (page2.html) as a <div> element with the same content slightly modified. Along with the <td id='content ...> I need to borrow some fragments of javascript from page1.html to page2.html. Right now I am doing this job with cl-html-parser, but the problem with html parser is that it can't serialize lhtml back to html, so I was using some additional functions to cl-who to do the job. It works, but I'd like to find a cleaner solution with one package (I hope that closure-html is the one) that I could do parsing + serializing. I see that to parse the page I need to use event methods like start-element, characters and end-elemet. Theoretically by using only those methods it's possible for me to find the <td id='content ...> element, set a marker in my-lhtml-builder and start to collect all the data inside the <td> elemet, modify what I need so I come up with the html string at the end of parsing. Along with that I woud do the same with bits of javascript. What seems a bit awkward here is that I can't just take the whole parse tree under the <td>, modify it, and insert it as a block to some other html page. To be able to do that there needs to be some event with gives the whole pt along with it's name and attrs. Second, I couldn't find how I can convert <p>nada</p> to its lhtml form and back to <p>nada</p> form (I always end up with those extra html and head blocks around it). Please, let me know if there is a better solution to my problem and maybe I am missing some functionality or misunderstand the philosophy of the library.
Thank you, Andrei
Hi,
Quoting Andrei Stebakov (lispercat@gmail.com):
I am new to the library so I have a couple of questions to make sure I am on the right track. I have a task of converting html. Say I have an html page (page1.html) it its depth of html code it has a sertain table with a <td id='content ...>. This <td> element holds the important data that I want to include in another html page (page2.html) as a <div> element with the same content slightly modified. Along with the <td id='content ...> I need to borrow some fragments of javascript from page1.html to page2.html.
okay.
Right now I am doing this job with cl-html-parser, but the problem with html parser is that it can't serialize lhtml back to html, so I was using some additional functions to cl-who to do the job. It works, but I'd like to find a cleaner solution with one package (I hope that closure-html is the one) that I could do parsing + serializing.
Yes, closure-html can also serialize.
I worked on the closure-html release based on the patches in cl-html-parser, and I don't think I forgot any features. So closure-html would be able to do everything that cl-html-parser could do (and slightly more perhaps, due to the cxml integration).
The actual parser is unchanged between the two, and hasn't really changed since Gilbert wrote it, so there should be no difference in that regard.
I see that to parse the page I need to use event methods like start-element, characters and end-elemet. Theoretically by using only those methods it's possible for me to find the <td id='content ...> element, set a marker in my-lhtml-builder and start to collect all the data inside the <td> elemet, modify what I need so I come up with the html string at the end of parsing. Along with that I woud do the same with bits of javascript. What seems a bit awkward here is that I can't just take the whole parse tree under the <td>, modify it, and insert it as a block to some other html page. To be able to do that there needs to be some event with gives the whole pt along with it's name and attrs.
I wouldn't define HAX methods like start-element in your kind of application. You can do it, of course, but I don't see the benefit.
This gives you the PT: (chtml:parse "<p>nada</p>" nil)
This gives you the LHTML: (chtml:parse "<p>nada</p>" (chtml:make-lhtml-builder))
This gives you cxml-stp's representation: (chtml:parse "<p>nada</p>" (stp:make-builder))
I recommend using LHTML or STP, unless you are very familiar with PT.
Second, I couldn't find how I can convert <p>nada</p> to its lhtml form and back to <p>nada</p> form (I always end up with those extra html and head blocks around it).
That's true. The parser currently follows the HTML DTD and "repairs" input whereever it doesn't match the DTD.
It probably wouldn't be hard to change the parser so that it can accept any elements as-is instead of discarding or augmenting them.
Unfortunately I haven't found an opportunity to work on that yet, so for now, you would have to take the "repaired" LHTML or STP representation and extract the child node under BODY after parsing.
For others reading this (you've probably already used it), here is how to serialize LHTML back to a string:
(chtml:serialize-lhtml '(:p () "nada") (chtml:make-string-sink)) => "<P>nada</P>"
Please, let me know if there is a better solution to my problem and maybe I am missing some functionality or misunderstand the philosophy of the library.
Other than the suggestions above, I can only suggest to try various cxml-related libraries to make those steps easier.
For example, here is how to extract the child of BODY from the repaired HTML using Plexippus XPath:
(defun first-child-of-body (document) (xpath:with-namespaces (("xhtml" "http://www.w3.org/1999/xhtml")) (xpath:first-node (xpath:evaluate "//xhtml:body/*" document))))
CL-USER> (first-child-of-body (chtml:parse "<p>nada</p>" (stp:make-builder)))
=> #.(CXML-STP:ELEMENT :LOCAL-NAME "p" ...)
Going further, have you considered using XSLT? I know many people aren't XSLT fans, but for the kind of HTML processing you are describing I have found it very helpful.
Here is how to convert your two HTML documents to XHTML:
(defun html2xml (in out) (with-open-file (s out :direction :output :if-exists :supersede :element-type '(unsigned-byte 8)) (chtml:parse in (cxml:make-octet-stream-sink s))))
(html2xml "<table><td id='content'>this is page1.html</td></table>" #p"page1.xml") (html2xml "<p>this is page2.html</p>" #p"page2.xml")
Then you can use Xuriella XSLT to combine them into a single HTML document:
(xuriella:apply-stylesheet #p"test.xsl" #p"page2.xml")
=> "<html><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"></head><body><p>this is page2.html</p><td id="content">this is page1.html</td></body></html>"
Note that Xuriella uses closure-html's serializer to generate the HTML.
The stylesheet test.xsl could look like this:
---------------------------------------------------------------------- <xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xhtml="http://www.w3.org/1999/xhtml" version="1.0">
<!-- copy everything, but strip the XHTML namespace --> <xsl:template match="*"> <xsl:element name="{local-name()}"> <xsl:apply-templates select="@*|node()"/> </xsl:element> </xsl:template>
<xsl:template match="@*"> <xsl:attribute name="{local-name()}"> <xsl:value-of select="."/> </xsl:attribute> </xsl:template>
<!-- insert the TD from page1 into body --> <xsl:template match="xhtml:body"> <body> <xsl:apply-templates select="@*|node()"/> <xsl:apply-templates select="document('page1.xml', .) //xhtml:td[@id = 'content']"/> </body> </xsl:template> </xsl:transform> ----------------------------------------------------------------------
d.
Thank you David, for the detailed explanation. I'll definitely give a try to STP and XSLT. If start-element characters and end-element couldn't handle my case, what's the typilcal case they are desiged for? Something like finding some element in an html and trigger some action based on it? Maybe we need to extend it so an html doc could be parsed and we had some events that gave us access to the whole sub-tree under the current node?
Andrei
On Thu, Apr 2, 2009 at 3:23 PM, David Lichteblau david@lichteblau.com wrote:
Hi,
Quoting Andrei Stebakov (lispercat@gmail.com):
I am new to the library so I have a couple of questions to make sure I am on the right track. I have a task of converting html. Say I have an html page (page1.html) it its depth of html code it has a sertain table with a <td id='content ...>. This <td> element holds the important data that I want to include in another html page (page2.html) as a <div> element with the same content slightly modified. Along with the <td id='content ...> I need to borrow some fragments of javascript from page1.html to page2.html.
okay.
Right now I am doing this job with cl-html-parser, but the problem with html parser is that it can't serialize lhtml back to html, so I was using some additional functions to cl-who to do the job. It works, but I'd like to find a cleaner solution with one package (I hope that closure-html is the one) that I could do parsing + serializing.
Yes, closure-html can also serialize.
I worked on the closure-html release based on the patches in cl-html-parser, and I don't think I forgot any features. So closure-html would be able to do everything that cl-html-parser could do (and slightly more perhaps, due to the cxml integration).
The actual parser is unchanged between the two, and hasn't really changed since Gilbert wrote it, so there should be no difference in that regard.
I see that to parse the page I need to use event methods like start-element, characters and end-elemet. Theoretically by using only those methods it's possible for me to find the <td id='content ...> element, set a marker in my-lhtml-builder and start to collect all the data inside the <td> elemet, modify what I need so I come up with the html string at the end of parsing. Along with that I woud do the same with bits of javascript. What seems a bit awkward here is that I can't just take the whole parse tree under the <td>, modify it, and insert it as a block to some other html page. To be able to do that there needs to be some event with gives the whole pt along with it's name and attrs.
I wouldn't define HAX methods like start-element in your kind of application. You can do it, of course, but I don't see the benefit.
This gives you the PT: (chtml:parse "<p>nada</p>" nil)
This gives you the LHTML: (chtml:parse "<p>nada</p>" (chtml:make-lhtml-builder))
This gives you cxml-stp's representation: (chtml:parse "<p>nada</p>" (stp:make-builder))
I recommend using LHTML or STP, unless you are very familiar with PT.
Second, I couldn't find how I can convert <p>nada</p> to its lhtml form and back to <p>nada</p> form (I always end up with those extra html and head blocks around it).
That's true. The parser currently follows the HTML DTD and "repairs" input whereever it doesn't match the DTD.
It probably wouldn't be hard to change the parser so that it can accept any elements as-is instead of discarding or augmenting them.
Unfortunately I haven't found an opportunity to work on that yet, so for now, you would have to take the "repaired" LHTML or STP representation and extract the child node under BODY after parsing.
For others reading this (you've probably already used it), here is how to serialize LHTML back to a string:
(chtml:serialize-lhtml '(:p () "nada") (chtml:make-string-sink)) => "<P>nada</P>"
Please, let me know if there is a better solution to my problem and maybe I am missing some functionality or misunderstand the philosophy of the library.
Other than the suggestions above, I can only suggest to try various cxml-related libraries to make those steps easier.
For example, here is how to extract the child of BODY from the repaired HTML using Plexippus XPath:
(defun first-child-of-body (document) (xpath:with-namespaces (("xhtml" "http://www.w3.org/1999/xhtml")) (xpath:first-node (xpath:evaluate "//xhtml:body/*" document))))
CL-USER> (first-child-of-body (chtml:parse "<p>nada</p>" (stp:make-builder)))
=> #.(CXML-STP:ELEMENT :LOCAL-NAME "p" ...)
Going further, have you considered using XSLT? I know many people aren't XSLT fans, but for the kind of HTML processing you are describing I have found it very helpful.
Here is how to convert your two HTML documents to XHTML:
(defun html2xml (in out) (with-open-file (s out :direction :output :if-exists :supersede :element-type '(unsigned-byte 8)) (chtml:parse in (cxml:make-octet-stream-sink s))))
(html2xml "<table><td id='content'>this is page1.html</td></table>" #p"page1.xml") (html2xml "<p>this is page2.html</p>" #p"page2.xml")
Then you can use Xuriella XSLT to combine them into a single HTML document:
(xuriella:apply-stylesheet #p"test.xsl" #p"page2.xml")
=> "<html><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"></head><body><p>this is page2.html</p><td id="content">this is page1.html</td></body></html>"
Note that Xuriella uses closure-html's serializer to generate the HTML.
The stylesheet test.xsl could look like this:
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xhtml="http://www.w3.org/1999/xhtml" version="1.0">
<!-- copy everything, but strip the XHTML namespace --> <xsl:template match="*"> <xsl:element name="{local-name()}"> <xsl:apply-templates select="@*|node()"/> </xsl:element> </xsl:template>
<xsl:template match="@*"> <xsl:attribute name="{local-name()}"> <xsl:value-of select="."/> </xsl:attribute> </xsl:template>
<!-- insert the TD from page1 into body --> <xsl:template match="xhtml:body"> <body> <xsl:apply-templates select="@*|node()"/> <xsl:apply-templates select="document('page1.xml', .) //xhtml:td[@id = 'content']"/> </body> </xsl:template>
</xsl:transform>
d.
Quoting David Lichteblau (david@lichteblau.com):
I worked on the closure-html release based on the patches in cl-html-parser, and I don't think I forgot any features. So closure-html would be able to do everything that cl-html-parser could do (and slightly more perhaps, due to the cxml integration).
Sorry, I was wrong about this.
When I read "cl-html-parser", I assumed that it was the Closure repackaging work by Ignas Mikalaj#nas.
Turns out that his work was called "trivial-html-parser", and you probably meant "cl-html-parse", which is just yet another repackaging of the parser written by Franz, this time packaged by Gary King.
So these are actually two entirely different parsers. If you like the API of the Franz parser better, I suggest that you use this for parsing, and use closure-html for serialization.
d.