[armedbear-devel] (lisp) Stack management efficiency

newer
[armedbear-devel] second version...

older
[armedbear-devel] Parameterized...

Erik Huelsmann

21 Jul 2009 21 Jul '09

10:07 a.m.

Peter Graves pointed me at the language/implementation efficiency shootout at http://www.ffconsultancy.com/languages/ray_tracer/benchmark.html which implements a ray tracer in a multitude of languagues, one of them being Common Lisp. There are a number of rows, which I conviniently name version 1 (row 1) to version 5 (last row). Peter points out that version 1 isn't optized at all: all functions lead to function calls (no inlining). When running this code, it takes around 1440 seconds on my PC. Then I modified the stack management routine to re-use available stack frames (and conses), instead of creating new ones on every call. This took execution time down to around 1240 seconds. A good improvement, I'd say. However, with the change described above, stack frames won't ever become garbage anymore. My question is: does the loss of GC-ability outweigh the performance gain? Please note that it's possible to run without a lisp stack (but that also means without traces) by compiling code with a (SPEED 3) declaration. Comments? Bye, Erik.

Show replies by date

Ville Voutilainen

21 Jul 21 Jul

10:45 a.m.

On Tue, Jul 21, 2009 at 1:07 PM, Erik Huelsmann<ehuels@gmail.com> wrote:

...

When running this code, it takes around 1440 seconds on my PC. Then I modified the stack management routine to re-use available stack frames (and conses), instead of creating new ones on every call. This took execution time down to around 1240 seconds. A good improvement, I'd say. However, with the change described above, stack frames won't ever become garbage anymore. My question is: does the loss of GC-ability outweigh the performance gain?

How many frames will then hang around? If I make a deep call, and never do another deep call, will I then have frames lurking around from the deep call?

Erik Huelsmann

11:23 a.m.

On Tue, Jul 21, 2009 at 12:45 PM, Ville Voutilainen<ville.voutilainen@gmail.com> wrote:

...

On Tue, Jul 21, 2009 at 1:07 PM, Erik Huelsmann<ehuels@gmail.com> wrote:

...
When running this code, it takes around 1440 seconds on my PC. Then I modified the stack management routine to re-use available stack frames (and conses), instead of creating new ones on every call. This took execution time down to around 1240 seconds. A good improvement, I'd say. However, with the change described above, stack frames won't ever become garbage anymore. My question is: does the loss of GC-ability outweigh the performance gain?

How many frames will then hang around? If I make a deep call, and never do another deep call, will I then have frames lurking around from the deep call?

In the current implementation: yes. However, with this improvement, there's probably some room for keeping a very small statistic of some kind which would allow building a shorter list of stack frames (ie dispose of all / a few). Ofcourse, that would undo some of the performance gain. Do you have ideas? I thought to use a weak reference, meaning GC will collect the memory if it needs it, but other than that, I certainly think setting an upperbound won't fix the "keep hanging around" issue. Oh; there's a single case where they get cleared up in my current implementation even: when the thread gets destroyed. For separate task threads that means relatively soon, for the main thread that would equal to 'never'. Bye, Erik.

Ville Voutilainen

11:33 a.m.

On Tue, Jul 21, 2009 at 2:23 PM, Erik Huelsmann<ehuels@gmail.com> wrote:

...

...
How many frames will then hang around? If I make a deep call, and never do another deep call, will I then have frames lurking around from the deep call? In the current implementation: yes. However, with this improvement, there's probably some room for keeping a very small statistic of some kind which would allow building a shorter list of stack frames (ie dispose of all / a few). Ofcourse, that would undo some of the performance gain. Do you have ideas? I thought to use a weak reference, meaning GC will

collect the memory if it needs it, but other than that, I certainly think setting an upperbound won't fix the "keep hanging around" issue.

I don't have a detailed implementation idea in mind, but on the general level, the stack-frame cache should be 1) per-thread (in order to avoid synchronisation overhead) 2) chunked-allocated (set aside a bunch, when all are unwound, remove a chunk, before that, mark them unused for re-use) I don't yet know what would be a proper data structure for 2).

Ville Voutilainen

11:50 a.m.

On Tue, Jul 21, 2009 at 2:33 PM, Ville Voutilainen<ville.voutilainen@gmail.com> wrote:

...

1) per-thread (in order to avoid synchronisation overhead) 2) chunked-allocated (set aside a bunch, when all are unwound, remove a chunk, before that, mark them unused for re-use) I don't yet know what would be a proper data structure for 2).

ArrayDeque would do, but that's java 1.6. Vector would otherwise do, but it's synchronised. I want something like c++ deque, but I'm not sure if it's available in java 1.5.

Erik Huelsmann

12:16 p.m.

On Tue, Jul 21, 2009 at 1:50 PM, Ville Voutilainen<ville.voutilainen@gmail.com> wrote:

...

On Tue, Jul 21, 2009 at 2:33 PM, Ville Voutilainen<ville.voutilainen@gmail.com> wrote:

...
1) per-thread (in order to avoid synchronisation overhead)

Absolutely. It's like that at this moment.

...

...
2) chunked-allocated (set aside a bunch, when all are unwound, remove a chunk, before that, mark them unused for re-use) I don't yet know what would be a proper data structure for 2).

I like that idea. We have no idea about a "normal" stack depth do we? In that case, especially since the overhead of a single pointer seems small, we should probably use a number a bit large (like 4k?), until people start to complain? I think the right structure to use for (2) would be a linked list of arrays. (How about using Conses to store it?)

...

ArrayDeque would do, but that's java 1.6. Vector would otherwise do, but it's synchronised. I want something like c++ deque, but I'm not sure if it's available in java 1.5.

bye, Erik.

Ville Voutilainen

12:21 p.m.

...

...
...
2) chunked-allocated (set aside a bunch, when all are unwound, remove a chunk, before that, mark them unused for re-use) I like that idea. We have no idea about a "normal" stack depth do we?

Not really. This depends on how big our frame objects are, I think.

...

In that case, especially since the overhead of a single pointer seems small, we should probably use a number a bit large (like 4k?), until people start to complain?

Four thousand frames per chunk? Sounds a bit large off-hand.

...

I think the right structure to use for (2) would be a linked list of arrays. (How about using Conses to store it?)

Yeah, that sounds doable.

Alessio Stalla

12:59 p.m.

On Tue, Jul 21, 2009 at 2:21 PM, Ville Voutilainen<ville.voutilainen@gmail.com> wrote:

...

...
...
...
2) chunked-allocated (set aside a bunch, when all are unwound, remove a chunk, before that, mark them unused for re-use) I like that idea. We have no idea about a "normal" stack depth do we?

Not really. This depends on how big our frame objects are, I think.

...
In that case, especially since the overhead of a single pointer seems small, we should probably use a number a bit large (like 4k?), until people start to complain?

Four thousand frames per chunk? Sounds a bit large off-hand.

How does the Lisp stack relate to the Java stack, if they're related at all? I don't know the maximum number of stack frames in Java, maybe it's not even fixed, but I doubt I have ever seen stack traces longer than a few hundred frames.

...

...
I think the right structure to use for (2) would be a linked list of arrays. (How about using Conses to store it?)

Yeah, that sounds doable.

Conses can only store LispObjects IIRC, and thus an array would need to be wrapped... there's java.util.Stack since 1.4.2, that extends Vector, that might do?

Ville Voutilainen

1:44 p.m.

On Tue, Jul 21, 2009 at 3:59 PM, Alessio Stalla<alessiostalla@gmail.com> wrote:

...

Conses can only store LispObjects IIRC, and thus an array would need to be wrapped... there's java.util.Stack since 1.4.2, that extends Vector, that might do?

Vector, and by extension Stack, is synchronized. That's an overhead we don't want to pay, as small as it may be.

Alessio Stalla

1:47 p.m.

On Tue, Jul 21, 2009 at 3:44 PM, Ville Voutilainen<ville.voutilainen@gmail.com> wrote:

...

On Tue, Jul 21, 2009 at 3:59 PM, Alessio Stalla<alessiostalla@gmail.com> wrote:

...
Conses can only store LispObjects IIRC, and thus an array would need to be wrapped... there's java.util.Stack since 1.4.2, that extends Vector, that might do?

Vector, and by extension Stack, is synchronized. That's an overhead we don't want to pay, as small as it may be.

Yes, I have overlooked it... Ale

Matt Lamari

1:03 p.m.

I'm kinda new here, and maybe I'm not fully grasping what you mean. . . . . But there are considerable advantages to having a reference be eligible for collection the instant it is last referenceable, and cause for concern when this hold can persist in an invisible form. I have worked in code where a reference to a structure of size ends up at the top of a call stack. Having it persist would basically be a leak (especially if some derived duplicate were made). Having this effect occur on some granularity would make the effect nondeterministic (e.g. testing the same code fragment in a vacuum would not repeat the leak).

Ville Voutilainen

2 p.m.

On Tue, Jul 21, 2009 at 4:03 PM, Matt Lamari<matt.lamari@gmail.com> wrote:

...

I'm kinda new here, and maybe I'm not fully grasping what you mean. . . . .

There are no problems for which ascii graphics don't provide a solution/explanation. :) Let's say we have a call stack of three functions, aka we have three frames: [FFFUUUUUUU] I'm assuming we have 10 frames per chunk. That's for explanation only, it's just an example value. F is a frame object, U is an unused placeholder. We may create the frame objects when allocating the chunk, or we may do it in a delayed fashion. That's something an experiment will reveal. If we make 8 more calls, we would get [FFFFFFFFFF][FUUUUUUUUU] So we needed to allocate a new chunk. Now, when we unwind one frame, we get [FFFFFFFFFF][RUUUUUUUUU] where R is a frame object that can be recycled. So, if we call a function there, we get [FFFFFFFFFF][FUUUUUUUUU] we re-used an existing frame object, without having to recreate it. By tuning the chunk granularity in a reasonable manner, we get bulk allocation for the chunk, which helps the speed, and we don't need to reallocate when a frame object can be recycled. Moreover, we don't need to hold on to the frame objects forever, as this is an intermediate compromise solution. In theory, this should result in not having leaks, and having fewer allocator calls. For garbage collecting, if we unwind the last example by, let's say, two frames, we get [FFFFFFFFFR] where we get rid of the second chunk and recycle the first. Thus the second chunk becomes collected at this point. This obviously means that if there are many calls at the chunk boundary, we allocate and collect the second chunk continuously. No solution is perfect, I guess.

Tobias C. Rittweiler

2:52 p.m.

Ville Voutilainen writes:

...

where we get rid of the second chunk and recycle the first. Thus the second chunk becomes collected at this point. This obviously means that if there are many calls at the chunk boundary, we allocate and collect the second chunk continuously. No solution is perfect, I guess.

To remedy this problem, you typically make a new chunk twice as large as the last chunk. And you only deallocate, if your actually consumed capacity falls below 1/4 of the total capacity. This results in amortized constant time for insertion/deletion. You don't have to hack that up from scratch, btw: it's what Java's ArrayList does. (Which is not synchronized.) -T. -- Diese Nachricht wurde auf Viren und andere gefaerliche Inhalte untersucht und ist - aktuelle Virenscanner vorausgesetzt - sauber. Freebits E-Mail Virus Scanner

Ville Voutilainen

2:58 p.m.

On Tue, Jul 21, 2009 at 5:52 PM, Tobias C. Rittweiler<tcr@freebits.de> wrote:

...

You don't have to hack that up from scratch, btw: it's what Java's ArrayList does. (Which is not synchronized.)

That's great - I find it a bit hard to find the descriptions for java.utll containers in regards to how they do allocation.

Mark Evenson

3:06 p.m.

Ville Voutilainen wrote:

...

On Tue, Jul 21, 2009 at 5:52 PM, Tobias C. Rittweiler<tcr@freebits.de> wrote:

...
You don't have to hack that up from scratch, btw: it's what Java's ArrayList does. (Which is not synchronized.)

That's great - I find it a bit hard to find the descriptions for java.utll containers in regards to how they do allocation.

Reading the Java source is about the only way to go about learning implementation details like that. I think it is all in Java, so you don't have reference to native methods. -- "A screaming comes across the sky. It has happened before, but there is nothing to compare to it now."

Erik Huelsmann

3:33 p.m.

On Tue, Jul 21, 2009 at 4:52 PM, Tobias C. Rittweiler<tcr@freebits.de> wrote:

...

Ville Voutilainen writes:

...
where we get rid of the second chunk and recycle the first. Thus the second chunk becomes collected at this point. This obviously means that if there are many calls at the chunk boundary, we allocate and collect the second chunk continuously. No solution is perfect, I guess.

To remedy this problem, you typically make a new chunk twice as large as the last chunk. And you only deallocate, if your actually consumed capacity falls below 1/4 of the total capacity. This results in amortized constant time for insertion/deletion.

You don't have to hack that up from scratch, btw: it's what Java's ArrayList does. (Which is not synchronized.)

Ok. I can easily hack this into my current pending patch. It would also eliminate the need to create a Cons for every stack frame, because I wouldn't be storing the elements in a linked list anymore. Let's see what this does for performance when I finish later today. Bye, Erik.

logicmoo＠gmail.com

10:57 p.m.

New subject: [armedbear-devel] Parameterized Lisp Calls vs Array Calls

Parameterized Lisp Calls vs Array Calls In the code I added a Interface "barrier" just in-case there was some inlining /***********************START JAVA*************************************************/ package benchtest; interface ISomeFunParams { int SomeFun(int i, int j); } interface ISomeFunArray { int SomeFun(int[] ij); } interface ISomeFunVarArgs { int SomeFun(int... ij); } class BenchArrays1 { public static void main(String[] args) { ISomeFunParams sumFun = new SomeFunImpl1(); int res = 0; for (int i = 0; i < 100000; i++) for (int j = 0; j < 100000; j++) { res += sumFun.SomeFun(i, j); } // make sure we care about 'res' int foo = res; if (foo > 0) { res--; } } } class BenchArrays2 { public static void main(String[] args) { ISomeFunVarArgs sumFun = new SomeFunImpl2(); int res = 0; for (int i = 0; i < 100000; i++) for (int j = 0; j < 100000; j++) { res += sumFun.SomeFun(i, j); } // make sure we care about 'res' int foo = res; if (foo > 0) { res--; } } } class BenchArrays3 { public static void main(String[] args) { ISomeFunArray sumFun = new SomeFunImpl3(); int res = 0; for (int i = 0; i < 100000; i++) for (int j = 0; j < 100000; j++) { res += sumFun.SomeFun(new int[]{i, j}); } // make sure we care about 'res' int foo = res; if (foo > 0) { res--; } } } class SomeFunImpl1 implements ISomeFunParams { public int SomeFun(int i, int j) { return i + j; } } class SomeFunImpl2 implements ISomeFunVarArgs { public int SomeFun(int... ij) { return ij[0] + ij[1]; } } class SomeFunImpl3 implements ISomeFunArray { public int SomeFun(int[] ij) { return ij[0] + ij[1]; } } /***********************END JAVA*************************************************/ [root@titan ArrayBoundsCheckBenchmarks]# time java -cp bin/ benchtest.BenchArrays1 real 0m0.113s user 0m0.040s sys 0m0.013s [root@titan ArrayBoundsCheckBenchmarks]# time java -cp bin/ benchtest.BenchArrays2 real 1m42.898s user 1m21.279s sys 0m0.742s [root@titan ArrayBoundsCheckBenchmarks]# time java -cp bin/ benchtest.BenchArrays3 real 1m40.823s user 1m21.411s sys 0m0.651s

5838

Age (days ago)

5838

Last active (days ago)

List overview

Download

16 comments

7 participants

participants (7)

Alessio Stalla
Erik Huelsmann
logicmoo＠gmail.com
Mark Evenson
Matt Lamari
Tobias C. Rittweiler
Ville Voutilainen