Re: [closure-devel] Converting html files

2 Apr 2009

      Hi,

Quoting Andrei Stebakov (lispercat@gmail.com):
...
I am new to the library so I have a couple of questions to make sure I
am on the right track.
I have a task of converting html. Say I have an html page (page1.html)
it its depth of html code it has a sertain table with a <td
id='content ...>.
This <td> element holds the important data that I want to include in
another html page (page2.html) as a <div> element with the same
content slightly modified.
Along with the <td id='content ...> I need to borrow some fragments of
javascript from page1.html to page2.html.
okay.
...
Right now I am doing this job with cl-html-parser, but the problem
with html parser is that it can't serialize lhtml back to html, so I
was using some additional functions to cl-who to do the job. It works,
but I'd like to find a cleaner solution with one package (I hope that
closure-html is the one) that I could do parsing + serializing.
Yes, closure-html can also serialize.

I worked on the closure-html release based on the patches in
cl-html-parser, and I don't think I forgot any features.  So
closure-html would be able to do everything that cl-html-parser could do
(and slightly more perhaps, due to the cxml integration).

The actual parser is unchanged between the two, and hasn't really
changed since Gilbert wrote it, so there should be no difference in that
regard.
...
I see that to parse the page I need to use event methods like
start-element, characters and end-elemet. Theoretically by using only
those methods it's possible for me to find the <td id='content ...>
element, set a marker in my-lhtml-builder and start to collect all the
data inside the <td> elemet, modify what I need so I come up with the
html string at the end of parsing. Along with that I woud do the same
with bits of javascript.
What seems a bit awkward here is that I can't just take the whole
parse tree under the <td>, modify it, and insert it as a block to some
other html page.
To be able to do that there needs to be some event with gives the
whole pt along with it's name and attrs.
I wouldn't define HAX methods like start-element in your kind of
application. You can do it, of course, but I don't see the benefit.

This gives you the PT:
  (chtml:parse "<p>nada</p>" nil)

This gives you the LHTML:
  (chtml:parse "<p>nada</p>" (chtml:make-lhtml-builder))

This gives you cxml-stp's representation:
  (chtml:parse "<p>nada</p>" (stp:make-builder))

I recommend using LHTML or STP, unless you are very familiar with PT.
...
Second, I couldn't find how I can convert nada to its lhtml
form and back to nada form (I always end up with those extra
html and head blocks around it).
That's true. The parser currently follows the HTML DTD and "repairs"
input whereever it doesn't match the DTD.

It probably wouldn't be hard to change the parser so that it can accept
any elements as-is instead of discarding or augmenting them.

Unfortunately I haven't found an opportunity to work on that yet, so for
now, you would have to take the "repaired" LHTML or STP representation
and extract the child node under BODY after parsing.

For others reading this (you've probably already used it), here is how
to serialize LHTML back to a string:

(chtml:serialize-lhtml '(:p () "nada") (chtml:make-string-sink))
 => "nada"
...
Please, let me know if there is a better solution to my problem and
maybe I am missing some functionality or misunderstand the philosophy
of the library.
Other than the suggestions above, I can only suggest to try various
cxml-related libraries to make those steps easier.

For example, here is how to extract the child of BODY from the repaired
HTML using Plexippus XPath:

  (defun first-child-of-body (document)                                         
    (xpath:with-namespaces (("xhtml" "http://www.w3.org/1999/xhtml"))           
      (xpath:first-node (xpath:evaluate "//xhtml:body/*" document))))

CL-USER> (first-child-of-body (chtml:parse "<p>nada</p>" (stp:make-builder)))

=> #.(CXML-STP:ELEMENT :LOCAL-NAME "p" ...)

Going further, have you considered using XSLT?  I know many people
aren't XSLT fans, but for the kind of HTML processing you are describing
I have found it very helpful.

Here is how to convert your two HTML documents to XHTML:

(defun html2xml (in out)                                                        
  (with-open-file (s out                                                        
                     :direction :output                                         
                     :if-exists :supersede                                      
                     :element-type '(unsigned-byte 8))                          
    (chtml:parse in (cxml:make-octet-stream-sink s))))

(html2xml "<table><td id='content'>this is page1.html</td></table>"
          #p"page1.xml")
(html2xml "<p>this is page2.html</p>"
          #p"page2.xml")

Then you can use Xuriella XSLT to combine them into a single HTML
document:

(xuriella:apply-stylesheet #p"test.xsl" #p"page2.xml")

=> "<html><head><meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\"></head><body><p>this is page2.html</p><td id=\"content\">this is page1.html</td></body></html>"

Note that Xuriella uses closure-html's serializer to generate the HTML.

The stylesheet test.xsl could look like this:

----------------------------------------------------------------------
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
	       xmlns:xhtml="http://www.w3.org/1999/xhtml"
	       version="1.0">

  <!-- copy everything, but strip the XHTML namespace -->
  <xsl:template match="*">
    <xsl:element name="{local-name()}">
      <xsl:apply-templates select="@*|node()"/>
    </xsl:element>
  </xsl:template>

  <xsl:template match="@*">
    <xsl:attribute name="{local-name()}">
      <xsl:value-of select="."/>
    </xsl:attribute>
  </xsl:template>

  <!-- insert the TD from page1 into body -->
  <xsl:template match="xhtml:body">
    <body>
      <xsl:apply-templates select="@*|node()"/>
      <xsl:apply-templates select="document('page1.xml', .)
				   //xhtml:td[@id = 'content']"/>
    </body>
  </xsl:template>
</xsl:transform>
----------------------------------------------------------------------

d.

Re: [closure-devel] Converting html files

David Lichteblau