Hello all,
for a project I need the possibility to extract some info from a web page to link it with a database and have tried several html parsers for CL. I would like to use Closure HTML for the task because of the add ons for XPATH, but I have a problem parsing the HTML source I have with it. An example page describing the problem is for example HTML source of
http://yellow.local.ch/de/q?ext=1&name=&company=Berufsschule,+Fachsc...
from which I need to extract some of the address/street/phone data. It seems, that all HTML/XML parsers for CL can't parse it correctly and most of the missing parts in the parsed representation are the ones which deal with the HTML-source which defines a Javascript element which itself includes HTML as a string parameter in the embedded JS. Which is of course exactly the text which I need from the site :) Of course I could extract the data using some regexps but it's really clumpsy and if possible, it would be nice to can stay in the HTML/JS-parse/STP/XPATH data representation. I've looked in the source of Closure HTML to try to help, but it seems that the project has pretty old and deep roots in SGML, where I don't want to introduce errors - I don't know SGML and may be - even if I find the places needed to be corrected for HTML, that could brake something for the SGML part of Closure-HTML. Also, I think I would need not so short time to get the inner workings of the parsers in Closure-HTML, so I would greatly appreciate any help to get it working with the described site.
With best regards Plamen
Quoting Plamen . (plamen.usenet@gmail.com):
http://yellow.local.ch/de/q?ext=1&name=&company=Berufsschule,+Fachsc...
from which I need to extract some of the address/street/phone data. It seems, that all HTML/XML parsers for CL can't parse it correctly and most of the missing parts in the parsed representation are the ones which deal with the HTML-source which defines a Javascript element which itself includes HTML as a string parameter in the embedded JS. Which is of course exactly the text which I need from the site :) Of
Please reduce the source code of the page to a self-contained example and point out exactly which part of the document it is that gets discarded.
d.
FYI: that page is undergoing abundant transformation by it's javascript. But this works, in a manner. (Match is from fare-matcher.)
(defun foobar () (let* ((url "http://yellow.local.ch/de/q?ext=1&name=&company=Berufsschule,+Fachsc... ") (page (http-request-with-cache url)) (doc (chtml:parse page (chtml:make-lhtml-builder))) (result ())) (labels ((recure (node) (typecase node (list (match node (`(:div ((:class ,c)) ,@children) (when (string= "entrybox" c) (push node result)))) (map nil #'recure (cddr node)))))) (recure doc) (nreverse result))))
Possibly you were using an XML parser rather than an HTML parser.
On Dec 23, 2009, at 8:16 AM, Plamen . wrote:
Hello all,
for a project I need the possibility to extract some info from a web page to link it with a database and have tried several html parsers for CL. I would like to use Closure HTML for the task because of the add ons for XPATH, but I have a problem parsing the HTML source I have with it. An example page describing the problem is for example HTML source of
http://yellow.local.ch/de/q?ext=1&name=&company=Berufsschule,+Fachsc... =1
from which I need to extract some of the address/street/phone data. It seems, that all HTML/XML parsers for CL can't parse it correctly and most of the missing parts in the parsed representation are the ones which deal with the HTML-source which defines a Javascript element which itself includes HTML as a string parameter in the embedded JS. Which is of course exactly the text which I need from the site :) Of course I could extract the data using some regexps but it's really clumpsy and if possible, it would be nice to can stay in the HTML/JS-parse/STP/XPATH data representation. I've looked in the source of Closure HTML to try to help, but it seems that the project has pretty old and deep roots in SGML, where I don't want to introduce errors - I don't know SGML and may be - even if I find the places needed to be corrected for HTML, that could brake something for the SGML part of Closure-HTML. Also, I think I would need not so short time to get the inner workings of the parsers in Closure-HTML, so I would greatly appreciate any help to get it working with the described site.
With best regards Plamen