Peter Graves pointed me at the language/implementation efficiency shootout at http://www.ffconsultancy.com/languages/ray_tracer/benchmark.html which implements a ray tracer in a multitude of languagues, one of them being Common Lisp.
There are a number of rows, which I conviniently name version 1 (row 1) to version 5 (last row). Peter points out that version 1 isn't optized at all: all functions lead to function calls (no inlining).
When running this code, it takes around 1440 seconds on my PC. Then I modified the stack management routine to re-use available stack frames (and conses), instead of creating new ones on every call. This took execution time down to around 1240 seconds. A good improvement, I'd say.
However, with the change described above, stack frames won't ever become garbage anymore. My question is: does the loss of GC-ability outweigh the performance gain?
Please note that it's possible to run without a lisp stack (but that also means without traces) by compiling code with a (SPEED 3) declaration.
Comments?
Bye,
Erik.
On Tue, Jul 21, 2009 at 1:07 PM, Erik Huelsmannehuels@gmail.com wrote:
When running this code, it takes around 1440 seconds on my PC. Then I modified the stack management routine to re-use available stack frames (and conses), instead of creating new ones on every call. This took execution time down to around 1240 seconds. A good improvement, I'd say. However, with the change described above, stack frames won't ever become garbage anymore. My question is: does the loss of GC-ability outweigh the performance gain?
How many frames will then hang around? If I make a deep call, and never do another deep call, will I then have frames lurking around from the deep call?
On Tue, Jul 21, 2009 at 12:45 PM, Ville Voutilainenville.voutilainen@gmail.com wrote:
On Tue, Jul 21, 2009 at 1:07 PM, Erik Huelsmannehuels@gmail.com wrote:
When running this code, it takes around 1440 seconds on my PC. Then I modified the stack management routine to re-use available stack frames (and conses), instead of creating new ones on every call. This took execution time down to around 1240 seconds. A good improvement, I'd say. However, with the change described above, stack frames won't ever become garbage anymore. My question is: does the loss of GC-ability outweigh the performance gain?
How many frames will then hang around? If I make a deep call, and never do another deep call, will I then have frames lurking around from the deep call?
In the current implementation: yes. However, with this improvement, there's probably some room for keeping a very small statistic of some kind which would allow building a shorter list of stack frames (ie dispose of all / a few). Ofcourse, that would undo some of the performance gain.
Do you have ideas? I thought to use a weak reference, meaning GC will collect the memory if it needs it, but other than that, I certainly think setting an upperbound won't fix the "keep hanging around" issue.
Oh; there's a single case where they get cleared up in my current implementation even: when the thread gets destroyed. For separate task threads that means relatively soon, for the main thread that would equal to 'never'.
Bye,
Erik.
On Tue, Jul 21, 2009 at 2:23 PM, Erik Huelsmannehuels@gmail.com wrote:
How many frames will then hang around? If I make a deep call, and never do another deep call, will I then have frames lurking around from the deep call?
In the current implementation: yes. However, with this improvement, there's probably some room for keeping a very small statistic of some kind which would allow building a shorter list of stack frames (ie dispose of all / a few). Ofcourse, that would undo some of the performance gain. Do you have ideas? I thought to use a weak reference, meaning GC will collect the memory if it needs it, but other than that, I certainly think setting an upperbound won't fix the "keep hanging around" issue.
I don't have a detailed implementation idea in mind, but on the general level, the stack-frame cache should be
1) per-thread (in order to avoid synchronisation overhead) 2) chunked-allocated (set aside a bunch, when all are unwound, remove a chunk, before that, mark them unused for re-use)
I don't yet know what would be a proper data structure for 2).
On Tue, Jul 21, 2009 at 2:33 PM, Ville Voutilainenville.voutilainen@gmail.com wrote:
- per-thread (in order to avoid synchronisation overhead)
- chunked-allocated (set aside a bunch, when all are unwound, remove
a chunk, before that, mark them unused for re-use) I don't yet know what would be a proper data structure for 2).
ArrayDeque would do, but that's java 1.6. Vector would otherwise do, but it's synchronised. I want something like c++ deque, but I'm not sure if it's available in java 1.5.
On Tue, Jul 21, 2009 at 1:50 PM, Ville Voutilainenville.voutilainen@gmail.com wrote:
On Tue, Jul 21, 2009 at 2:33 PM, Ville Voutilainenville.voutilainen@gmail.com wrote:
- per-thread (in order to avoid synchronisation overhead)
Absolutely. It's like that at this moment.
- chunked-allocated (set aside a bunch, when all are unwound, remove
a chunk, before that, mark them unused for re-use) I don't yet know what would be a proper data structure for 2).
I like that idea. We have no idea about a "normal" stack depth do we? In that case, especially since the overhead of a single pointer seems small, we should probably use a number a bit large (like 4k?), until people start to complain?
I think the right structure to use for (2) would be a linked list of arrays. (How about using Conses to store it?)
ArrayDeque would do, but that's java 1.6. Vector would otherwise do, but it's synchronised. I want something like c++ deque, but I'm not sure if it's available in java 1.5.
bye,
Erik.
- chunked-allocated (set aside a bunch, when all are unwound, remove
a chunk, before that, mark them unused for re-use)
I like that idea. We have no idea about a "normal" stack depth do we?
Not really. This depends on how big our frame objects are, I think.
In that case, especially since the overhead of a single pointer seems small, we should probably use a number a bit large (like 4k?), until people start to complain?
Four thousand frames per chunk? Sounds a bit large off-hand.
I think the right structure to use for (2) would be a linked list of arrays. (How about using Conses to store it?)
Yeah, that sounds doable.
On Tue, Jul 21, 2009 at 2:21 PM, Ville Voutilainenville.voutilainen@gmail.com wrote:
- chunked-allocated (set aside a bunch, when all are unwound, remove
a chunk, before that, mark them unused for re-use)
I like that idea. We have no idea about a "normal" stack depth do we?
Not really. This depends on how big our frame objects are, I think.
In that case, especially since the overhead of a single pointer seems small, we should probably use a number a bit large (like 4k?), until people start to complain?
Four thousand frames per chunk? Sounds a bit large off-hand.
How does the Lisp stack relate to the Java stack, if they're related at all? I don't know the maximum number of stack frames in Java, maybe it's not even fixed, but I doubt I have ever seen stack traces longer than a few hundred frames.
I think the right structure to use for (2) would be a linked list of arrays. (How about using Conses to store it?)
Yeah, that sounds doable.
Conses can only store LispObjects IIRC, and thus an array would need to be wrapped... there's java.util.Stack since 1.4.2, that extends Vector, that might do?
On Tue, Jul 21, 2009 at 3:59 PM, Alessio Stallaalessiostalla@gmail.com wrote:
Conses can only store LispObjects IIRC, and thus an array would need to be wrapped... there's java.util.Stack since 1.4.2, that extends Vector, that might do?
Vector, and by extension Stack, is synchronized. That's an overhead we don't want to pay, as small as it may be.
On Tue, Jul 21, 2009 at 3:44 PM, Ville Voutilainenville.voutilainen@gmail.com wrote:
On Tue, Jul 21, 2009 at 3:59 PM, Alessio Stallaalessiostalla@gmail.com wrote:
Conses can only store LispObjects IIRC, and thus an array would need to be wrapped... there's java.util.Stack since 1.4.2, that extends Vector, that might do?
Vector, and by extension Stack, is synchronized. That's an overhead we don't want to pay, as small as it may be.
Yes, I have overlooked it...
Ale
I'm kinda new here, and maybe I'm not fully grasping what you mean. . . . .
But there are considerable advantages to having a reference be eligible for collection the instant it is last referenceable, and cause for concern when this hold can persist in an invisible form.
I have worked in code where a reference to a structure of size ends up at the top of a call stack. Having it persist would basically be a leak (especially if some derived duplicate were made). Having this effect occur on some granularity would make the effect nondeterministic (e.g. testing the same code fragment in a vacuum would not repeat the leak).
On Tue, Jul 21, 2009 at 4:03 PM, Matt Lamarimatt.lamari@gmail.com wrote:
I'm kinda new here, and maybe I'm not fully grasping what you mean. . . . .
There are no problems for which ascii graphics don't provide a solution/explanation. :)
Let's say we have a call stack of three functions, aka we have three frames:
[FFFUUUUUUU]
I'm assuming we have 10 frames per chunk. That's for explanation only, it's just an example value. F is a frame object, U is an unused placeholder. We may create the frame objects when allocating the chunk, or we may do it in a delayed fashion. That's something an experiment will reveal. If we make 8 more calls, we would get
[FFFFFFFFFF][FUUUUUUUUU]
So we needed to allocate a new chunk.
Now, when we unwind one frame, we get
[FFFFFFFFFF][RUUUUUUUUU]
where R is a frame object that can be recycled. So, if we call a function there, we get
[FFFFFFFFFF][FUUUUUUUUU]
we re-used an existing frame object, without having to recreate it. By tuning the chunk granularity in a reasonable manner, we get bulk allocation for the chunk, which helps the speed, and we don't need to reallocate when a frame object can be recycled. Moreover, we don't need to hold on to the frame objects forever, as this is an intermediate compromise solution. In theory, this should result in not having leaks, and having fewer allocator calls.
For garbage collecting, if we unwind the last example by, let's say, two frames, we get
[FFFFFFFFFR]
where we get rid of the second chunk and recycle the first. Thus the second chunk becomes collected at this point. This obviously means that if there are many calls at the chunk boundary, we allocate and collect the second chunk continuously. No solution is perfect, I guess.
Ville Voutilainen writes:
where we get rid of the second chunk and recycle the first. Thus the second chunk becomes collected at this point. This obviously means that if there are many calls at the chunk boundary, we allocate and collect the second chunk continuously. No solution is perfect, I guess.
To remedy this problem, you typically make a new chunk twice as large as the last chunk. And you only deallocate, if your actually consumed capacity falls below 1/4 of the total capacity. This results in amortized constant time for insertion/deletion.
You don't have to hack that up from scratch, btw: it's what Java's ArrayList does. (Which is not synchronized.)
-T.
On Tue, Jul 21, 2009 at 5:52 PM, Tobias C. Rittweilertcr@freebits.de wrote:
You don't have to hack that up from scratch, btw: it's what Java's ArrayList does. (Which is not synchronized.)
That's great - I find it a bit hard to find the descriptions for java.utll containers in regards to how they do allocation.
Ville Voutilainen wrote:
On Tue, Jul 21, 2009 at 5:52 PM, Tobias C. Rittweilertcr@freebits.de wrote:
You don't have to hack that up from scratch, btw: it's what Java's ArrayList does. (Which is not synchronized.)
That's great - I find it a bit hard to find the descriptions for java.utll containers in regards to how they do allocation.
Reading the Java source is about the only way to go about learning implementation details like that. I think it is all in Java, so you don't have reference to native methods.
On Tue, Jul 21, 2009 at 4:52 PM, Tobias C. Rittweilertcr@freebits.de wrote:
Ville Voutilainen writes:
where we get rid of the second chunk and recycle the first. Thus the second chunk becomes collected at this point. This obviously means that if there are many calls at the chunk boundary, we allocate and collect the second chunk continuously. No solution is perfect, I guess.
To remedy this problem, you typically make a new chunk twice as large as the last chunk. And you only deallocate, if your actually consumed capacity falls below 1/4 of the total capacity. This results in amortized constant time for insertion/deletion.
You don't have to hack that up from scratch, btw: it's what Java's ArrayList does. (Which is not synchronized.)
Ok. I can easily hack this into my current pending patch. It would also eliminate the need to create a Cons for every stack frame, because I wouldn't be storing the elements in a linked list anymore.
Let's see what this does for performance when I finish later today.
Bye,
Erik.
Parameterized Lisp Calls vs Array Calls In the code I added a Interface "barrier" just in-case there was some inlining
/***********************START JAVA*************************************************/ package benchtest;
interface ISomeFunParams { int SomeFun(int i, int j); }
interface ISomeFunArray { int SomeFun(int[] ij); }
interface ISomeFunVarArgs { int SomeFun(int... ij); }
class BenchArrays1 {
public static void main(String[] args) {
ISomeFunParams sumFun = new SomeFunImpl1();
int res = 0; for (int i = 0; i < 100000; i++) for (int j = 0; j < 100000; j++) { res += sumFun.SomeFun(i, j); }
// make sure we care about 'res' int foo = res; if (foo > 0) { res--; } } }
class BenchArrays2 {
public static void main(String[] args) {
ISomeFunVarArgs sumFun = new SomeFunImpl2();
int res = 0; for (int i = 0; i < 100000; i++) for (int j = 0; j < 100000; j++) { res += sumFun.SomeFun(i, j); }
// make sure we care about 'res' int foo = res; if (foo > 0) { res--; } } }
class BenchArrays3 {
public static void main(String[] args) {
ISomeFunArray sumFun = new SomeFunImpl3();
int res = 0; for (int i = 0; i < 100000; i++) for (int j = 0; j < 100000; j++) { res += sumFun.SomeFun(new int[]{i, j}); }
// make sure we care about 'res' int foo = res; if (foo > 0) { res--; } } }
class SomeFunImpl1 implements ISomeFunParams { public int SomeFun(int i, int j) { return i + j; } }
class SomeFunImpl2 implements ISomeFunVarArgs { public int SomeFun(int... ij) { return ij[0] + ij[1]; } }
class SomeFunImpl3 implements ISomeFunArray { public int SomeFun(int[] ij) { return ij[0] + ij[1]; } } /***********************END JAVA*************************************************/
[root@titan ArrayBoundsCheckBenchmarks]# time java -cp bin/ benchtest.BenchArrays1
real 0m0.113s user 0m0.040s sys 0m0.013s [root@titan ArrayBoundsCheckBenchmarks]# time java -cp bin/ benchtest.BenchArrays2
real 1m42.898s user 1m21.279s sys 0m0.742s [root@titan ArrayBoundsCheckBenchmarks]# time java -cp bin/ benchtest.BenchArrays3
real 1m40.823s user 1m21.411s sys 0m0.651s
armedbear-devel@common-lisp.net