Parsing big XML files with klacks and sbcl
I am trying to parse a big xml file (around several GBs) and I am using klacks because of the the size. However it seems that there is some leak during parsing because the memory use continuously increase until sbcl runs out of memory. What am I missing? Regards, Mark Some info: $ sbcl --version SBCL 1.4.8 The script: (ql:quickload 'cxml) (defparameter *src* (cxml:make-source (pathname "huge.xml"))) (loop while t do (klacks:consume *src*)) A (room t) call when breaking to the debugger: 0] (gc) ; No debug variables for current frame: using EVAL instead of EVAL-IN-FRAME. NIL 0] (room t) ; No debug variables for current frame: using EVAL instead of EVAL-IN-FRAME. Dynamic space usage is: 231,363,600 bytes. Immobile space usage is: 15,866,480 bytes (116,720 bytes overhead). Read-only space usage is: 0 bytes. Static space usage is: 704 bytes. Control stack usage is: 9,648 bytes. Binding stack usage is: 2,064 bytes. Control and binding stack usage is for the current thread only. Garbage collection is currently enabled. Summary of spaces: dynamic immobile static CONS: 198,982,960 bytes, 12,436,435 objects, 100% dynamic. CODE: 13,755,264 bytes, 22,368 objects, 100% immobile, 0% dynamic. SIMPLE-VECTOR: 10,533,136 bytes, 80,217 objects, 100% dynamic. INSTANCE: 7,169,776 bytes, 126,568 objects, 2% immobile, 98% dynamic. SIMPLE-ARRAY-UNSIGNED-BYTE-64: 3,423,232 bytes, 1,867 objects, 100% dynamic. SIMPLE-ARRAY-UNSIGNED-BYTE-8: 2,874,208 bytes, 39,494 objects, 100% dynamic. SIMPLE-BASE-STRING: 2,031,264 bytes, 40,187 objects, 100% dynamic. SYMBOL: 1,778,320 bytes, 37,048 objects, 0% static, 67% immobile, 33% dynamic. BIGNUM: 1,327,760 bytes, 40,630 objects, 100% dynamic. SIMPLE-CHARACTER-STRING: 1,156,736 bytes, 13,489 objects, 100% dynamic. SIMPLE-ARRAY-UNSIGNED-BYTE-32: 888,800 bytes, 24,915 objects, 100% dynamic. FDEFN: 663,360 bytes, 20,730 objects, 0% static, 100% immobile. CLOSURE: 595,344 bytes, 16,722 objects, 100% dynamic. SIMPLE-ARRAY-UNSIGNED-BYTE-16: 588,608 bytes, 4,832 objects, 100% dynamic. SIMPLE-ARRAY-UNSIGNED-BYTE-31: 243,616 bytes, 4 objects, 100% dynamic. SIMPLE-ARRAY-SIGNED-BYTE-8: 196,208 bytes, 6,131 objects, 100% dynamic. FUNCALLABLE-INSTANCE: 160,368 bytes, 4,149 objects, 44% immobile, 56% dynamic. SIMPLE-BIT-VECTOR: 44,544 bytes, 100 objects, 100% dynamic. SIMPLE-ARRAY-SIGNED-BYTE-16: 15,072 bytes, 208 objects, 100% dynamic. SIMPLE-ARRAY-SIGNED-BYTE-32: 8,096 bytes, 194 objects, 100% dynamic. VALUE-CELL: 5,584 bytes, 349 objects, 100% dynamic. SIMPLE-ARRAY-FIXNUM: 2,960 bytes, 7 objects, 100% dynamic. ARRAY-HEADER: 2,208 bytes, 28 objects, 100% dynamic. RATIO: 1,024 bytes, 32 objects, 100% dynamic. DOUBLE-FLOAT: 704 bytes, 44 objects, 100% dynamic. WEAK-POINTER: 448 bytes, 14 objects, 100% dynamic. SAP: 256 bytes, 16 objects, 100% dynamic. SIMPLE-ARRAY-UNSIGNED-BYTE-2: 96 bytes, 2 objects, 100% dynamic. SIMPLE-ARRAY-UNSIGNED-FIXNUM: 80 bytes, 3 objects, 100% dynamic. COMPLEX-DOUBLE-FLOAT: 64 bytes, 2 objects, 100% dynamic. COMPLEX: 32 bytes, 1 object, 100% dynamic. COMPLEX-SINGLE-FLOAT: 32 bytes, 2 objects, 100% dynamic. SIMD-PACK: 32 bytes, 1 object, 100% dynamic. SIMPLE-ARRAY-NIL: 32 bytes, 2 objects, 100% dynamic. SIMPLE-ARRAY-UNSIGNED-BYTE-4: 16 bytes, 1 object, 100% dynamic. SIMPLE-ARRAY-UNSIGNED-BYTE-7: 16 bytes, 1 object, 100% dynamic. SIMPLE-ARRAY-UNSIGNED-BYTE-15: 16 bytes, 1 object, 100% dynamic. SIMPLE-ARRAY-UNSIGNED-BYTE-63: 16 bytes, 1 object, 100% dynamic. SIMPLE-ARRAY-SIGNED-BYTE-64: 16 bytes, 1 object, 100% dynamic. SIMPLE-ARRAY-SINGLE-FLOAT: 16 bytes, 1 object, 100% dynamic. SIMPLE-ARRAY-DOUBLE-FLOAT: 16 bytes, 1 object, 100% dynamic. SIMPLE-ARRAY-COMPLEX-SINGLE-FLOAT: 16 bytes, 1 object, 100% dynamic. SIMPLE-ARRAY-COMPLEX-DOUBLE-FLOAT: 16 bytes, 1 object, 100% dynamic. Summary total: 246,450,368 bytes, 12,916,800 objects. Top 10 dynamic instance types: COMPILED-DEBUG-FUN 1,534,912 bytes, 23,983 objects. COMPILED-DEBUG-FUN-EXTERNAL 1,468,160 bytes, 22,940 objects. COMPILED-DEBUG-INFO 986,016 bytes, 20,542 objects. DEFINITION-SOURCE-LOCATION 241,056 bytes, 7,533 objects. FAST-METHOD-CALL 239,952 bytes, 4,999 objects. SLOT-INFO 220,272 bytes, 4,589 objects. VOP-PARSE 190,624 bytes, 851 objects. VOP-INFO 183,456 bytes, 819 objects. FUN-TYPE 155,520 bytes, 1,620 objects. ARG-INFO 140,928 bytes, 1,468 objects. Other types 1,691,632 bytes, 36,184 objects. Dynamic instance total 7,052,528 bytes, 125,528 objects. Top 10 immobile instance types: LAYOUT 119,616 bytes, 1,068 objects. PACKAGE 4,224 bytes, 33 objects. Immobile instance total 123,840 bytes, 1,101 objects. Top 10 static instance types: Static instance total 0 bytes, 0 objects.
(loop while t do (klacks:consume *src*))
try to add (sb-ext:gc :full t) inside the loop. if that helps, then you're overwhelming SBCL's gc algorithm by allocating too much garbage between two gc's (or something along that line, maybe someone else with more knowledge of the details can elaborate). -- • attila lendvai • PGP: 963F 5D5F 45C7 DFCD 0A39 -- “Socialism, like the ancient ideas from which it springs, confuses the distinction between government and society. As a result of this, every time we object to a thing being done by government, the socialists conclude that we object to its being done at all. We disapprove of state education. Then the socialists say that we are opposed to any education. We object to a state religion. Then the socialists say that we want no religion at all. We object to a state-enforced equality. Then they say that we are against equality. And so on, and so on. It is as if the socialists were to accuse us of not wanting persons to eat because we do not want the state to raise grain.” — Frédéric Bastiat (1801–1850), 'The Law' (1850)
On Tue, May 29, 2018 at 9:20 PM, Attila Lendvai <attila@lendvai.name> wrote:
(loop while t do (klacks:consume *src*))
try to add (sb-ext:gc :full t) inside the loop. if that helps, then you're overwhelming SBCL's gc algorithm by allocating too much garbage between two gc's (or something along that line, maybe someone else with more knowledge of the details can elaborate).
This indeed keeps the memory usage in check. However a forced gc on every loop sounds less than ideal. I am a bit surprised that a streaming parser generates so much garbage, considering on of the main use cases is handling large files. Also I am wondering if the GC can be configured to run more aggressively without further explicit calls in the rest of the code.
This indeed keeps the memory usage in check. However a forced gc on every loop sounds less than ideal.
you can play with other options. a forced full gc is slow indeed, but if you know your load characteristics it's not unreasonable to give explicit notifications to the gc when to run and on which generations.
I am a bit surprised that a streaming parser generates so much garbage, considering on of the main use cases is handling large files. Also I am wondering if the GC can be configured to run more aggressively without further explicit calls in the rest of the code.
IIRC maybe you can play with the configuration of strings? if you don't need to deal with unicode content, then you can maybe spare some memory by using CL:STRING instead of cxml's own unicode support? i seem to remember something like this in cxml. they are called RUNEs and RODs? hth, -- • attila lendvai • PGP: 963F 5D5F 45C7 DFCD 0A39 -- Cannot ever have anything resembling a free market when money is interest-bearing debt forced into circulation at gunpoint.
I ran into this problem on some unit tests, because they were churning the memory as fast as they could. I created a function that checked the current heap size and if it was above a certain limit would GC (up to 3 times) till it was below that limit. You may find this useful: https://gist.github.com/bobbysmith007/8dd2da4483d32ab0d02d334f8b81f1bc It contains some details that are maybe not relevant to you, but were helpful in my circumstance (such as clearing caches along the way / logging). I sort of feel like the GC should be slightly more aggressive about what its doing, you might also find that simply adding a small sleep between rows to be sufficient to get the GC to do it on its own. Cheers, Russ Tyndall Acceleration.net Programmer On 05/29/2018 04:33 PM, Mark Janssen wrote:
On Tue, May 29, 2018 at 9:20 PM, Attila Lendvai <attila@lendvai.name> wrote:
(loop while t do (klacks:consume *src*)) try to add (sb-ext:gc :full t) inside the loop. if that helps, then you're overwhelming SBCL's gc algorithm by allocating too much garbage between two gc's (or something along that line, maybe someone else with more knowledge of the details can elaborate).
This indeed keeps the memory usage in check. However a forced gc on every loop sounds less than ideal. I am a bit surprised that a streaming parser generates so much garbage, considering on of the main use cases is handling large files. Also I am wondering if the GC can be configured to run more aggressively without further explicit calls in the rest of the code.
participants (3)
-
Attila Lendvai
-
Mark Janssen
-
Russ Tyndall