Hello all,
for a project I need the possibility to extract some info from a web page to link it with a database and have tried several html parsers for CL. I would like to use Closure HTML for the task because of the add ons for XPATH, but I have a problem parsing the HTML source I have with it. An example page describing the problem is for example HTML source of
http://yellow.local.ch/de/q?ext=1&name=&company=Berufsschule,+Fachsc...
from which I need to extract some of the address/street/phone data. It seems, that all HTML/XML parsers for CL can't parse it correctly and most of the missing parts in the parsed representation are the ones which deal with the HTML-source which defines a Javascript element which itself includes HTML as a string parameter in the embedded JS. Which is of course exactly the text which I need from the site :) Of course I could extract the data using some regexps but it's really clumpsy and if possible, it would be nice to can stay in the HTML/JS-parse/STP/XPATH data representation. I've looked in the source of Closure HTML to try to help, but it seems that the project has pretty old and deep roots in SGML, where I don't want to introduce errors - I don't know SGML and may be - even if I find the places needed to be corrected for HTML, that could brake something for the SGML part of Closure-HTML. Also, I think I would need not so short time to get the inner workings of the parsers in Closure-HTML, so I would greatly appreciate any help to get it working with the described site.
With best regards Plamen