Hi.
I've noticed that parsing XML that contains many entity references may cause CXML to eat up all available heap space and crash. This can be easily reproduced: (cxml:parse-rod (format nil "<xml>~{~a~}</xml>" (loop repeat 4000 collect "<p>zz")) (cxml-dom:make-dom-builder)) (tested on SBCL / Linux)
Looking at the code, I've noticed that xml/xml-parse.lisp relies on the tail recursion a lot (though always having proper tail recursion is frankly speaking somewhat dubious expectation when working with Common Lisp). What's worse, P/CONTENT function is non-tail-recursive when an entity reference is encountered, and this is exactly the cause of aforementioned crash. I've attached the patch which helped me solve the problem (it's made against cxml-2007-08-05). I've just noticed that recurse-on-entity returns NIL most of time and used that fact to cut off most of recursion. As of now my brain is somewhat foggy due to continuing coding race, so I didn't learn what recurse-on-entity actually does and how often can it return a non-NIL value. Overall I don't think this is a proper solution, most likely xml-parse.lisp needs some serious rework to make it more reliable. But I think at least this hack will help in most cases.
Ivan
Hi,
Quoting Ivan Shvedunov (ivan4th@gmail.com):
I've noticed that parsing XML that contains many entity references may cause CXML to eat up all available heap space and crash. This can be easily reproduced:
thank you for the report. I have replace tail recursion in p/content with a loop and also removed the call to append, which serves no purpose in the current code base.
(In addition, I have fixed string normalization in the DOM builder, reducing the parse time for your example from 9s to around 2s for me.)
Please test cxml CVS again.
Overall I don't think this is a proper solution, most likely xml-parse.lisp needs some serious rework to make it more reliable.
In general, recursion depth in the SAX parser should be at most proportional to the nesting depth of XML elements being parsing (as well as the nesting depth of entity references). This level of recursion should not usually be a problem. Most processing of in-memory models like DOM uses similar algorithms.
If any recursive function exceeds this complexity (in the way p/content did prior to your patch), I would regard that as a bug.
If you find any other similar problems, please report them.
Note that the Klacks parser is available as an alternative to SAX for applications that need stricter guarantees. The klacks parser will only report one event at a time and does not use most of the recursive functions found in the SAX parser.
Thanks, David