Hi Liam,
I recently realised that I completely forgot to submit the forms that I used to check where the problem lies with the slow FFT tests. So my apologies, and hereby some very simple forms (should be run in package :gsll). I quickly re-ran them to verify that the timings are still the same (with updated gsll and now using antik).
;;;; Check 1: Perform FFTs on 1000 different vectors, each with 1000 ;;;; random complex elements (consisting of a real and a complex ;;;; random element). Takes ~2 seconds on my old laptop. (time (progn (loop for i from 0 below 1000 do (forward-fourier-transform (make-urand-vector '(complex double-float) 1000))) nil))
;;;; Check 2: Same as 1, but without the FFTs. Takes ~1.8 seconds. (time (progn (loop for i from 0 below 1000 do (make-urand-vector '(complex double-float) 1000)) nil))
Here, it's clear that the time increases by an order of magnitude by just generating random vectors. 1000 FFTs take less time:
;;;; Check 3: Generate only 1 random vector, FFT 1000 times. Takes ~ ;;;; 0.14 seconds. (time (let ((rand (make-urand-vector '(complex double-float) 1000))) (loop for i from 0 below 1000 do (forward-fourier-transform rand)) nil))
So let's see where the problems lie in making the random vector:
;;;; Check 4: Same as 2, but without making the vector. Takes ~1 second. (time (progn (loop for i from 0 below 1000 do (loop for j from 0 below 1000 do (coerce (complex (urand) (urand)) '(complex double-float)))) nil))
;;;; Check 5: Same as 4, but without coercing. Takes ~0.6 seconds. (time (progn (loop for i from 0 below 1000 do (loop for j from 0 below 1000 do (complex (urand) (urand)))) nil))
You can see that half the time of Check 1 is just spent generating and coercing elements. Then ~40% is making and setting the vectors (non-optimized) and ~10% is the FFTs themselves.
I tried making an optimized version of make-urand-vector using declarations, which only sped up things by a little bit (certainly not an order of magnitude); we discussed this before on #lisp. However, since the transition to antik, I didn't manage to get this working in a reasonable amount of time. Regardless, the fact that it still contained the form:
(coerce (complex (urand) (urand)) '(complex double-float))
is the reason for the fact that even the optimized version is slow. Since the original GSL tests use random vectors a lot, this is a real problem. The second thing is that I suspect that setf-ing array elements is slower than in C as well. The GSL tests also use this a lot, since they initialize the vectors with known numbers, to verify that those elements were untouched when using the stride keyword.
It could be that optimized functions, using declarations and gsetf, will cause those parts of the tests to go faster. I will try that when I have some more time, and when I get it working.
By the way, here's the defun for the attempt at an optimized version of make-urand-vector for complex double-floats. I tried many different things (also without the let around) but couldn't get it to run: I always get a type-error.
(defun make-urand-complex-double-float-vector (dimension &key (stride 1) init-offset) "Make a complex double-float vector with random elements." (let ((vec (make-and-init-vector '(complex double-float) (* stride dimension) :init-offset init-offset))) (declare (type grid:vector-complex-double-float vec)) (loop for i from 0 below (* stride dimension) by stride do (let ((value (coerce (complex (urand) (urand)) '(complex double-float)))) (declare (type (complex double-float) value)) (grid:gsetf (grid:aref* vec i) value))) vec))
Regards, Sumant
Hi Sumant,
Thanks for this interesting analysis. It is encouraging to me that the FFT itself is only a small fraction of the total time taken, and therefore there is opportunity for real speed improvement. Here is my breakdown:
1. FFT only (Checks 2 and 3): 10% 2. make-urand-vector: 90% 1. making and setting the vector (Check 4) (make-and-init-vector, (setf grid:aref)): 40% of the overall 2. pointless coercing of (complex (urand) (urand)) to (complex double-float), which it already is (Check 5): 20% of the overall
I don't understand why the coerce costs so much, as it's essentially a no-op (complex (urand) (urand)) is already a (complex double-float) but regardless, it should be an easy win of 20% to just remove the coerce.
With some care, I bet make-urand-vector could be made much faster than the remaining 70%. One thing is inlining #'urand in make-urand-vector and declaring as much as possible like declaring urand-seed fixnum. Another is to call make-foreign-array with :initial-contents instead of setfing each element, I'm not sure if that's faster or slower, but it would be interesting to find out because it might lead us to rethink how to do that if it is slower. (BTW on SBCL there might be a price paid here in that we are making a static vector). I don't think your declaration (type grid:vector-complex-double-float vec) does anything because the compiler won't use that information, and we don't read the environment (we would need variable-information from CLtL2 but removed from the CL standard; it is used in e.g. grid::declared-type-dimension).
Liam
Hi Liam,
Thanks for the tips. If you remember, though, I previously reported that using :initial-contents (from a list) is slow, in fact, it's slower than setf-ing each element by hand (at least on my system). I don't know (and haven't tried) to use :initial-contents with anything else than a list. E.g. when I read a file containing ascii data, there are two ways I tried to read that into an array: read the whole table into a list and use that as :initial-contents, or first make the array and read value by value into the array. The latter was quite a bit faster.
Regarding the coerce, I know it should be a no-op. The reason it's there, is that the tests call for real and complex single and double float values, but in the previous e-mail I only put the tests where the coerce *should* be a no-op. A coerce could be avoided by making separate functions based on type, I guess.
I don't really understand why the declarations don't do anything; I thought that I had copy-pasted that from the antik manual tips on speed. Apparently I did something wrong.
To me, these tests show that the speed problem is basically a problem in CL handling the grids/foreign arrays, and not with the FFI, as witnessed by the speedy FFTs themselves. A significant part of my data processing only relies for one or two operations on GSL algorithms, and the biggest part are simple operations on data manipulations that I would do directly in CL (reading and writing data, cutting, pasting, transforming rows and columns).
As such, I think the problem exceeds the domain of FFT tests, and aside from tinkering with the code here and there, and trying some things out, I'm not sure of how much use I can be. I will still be looking into it, but I also don't have too much time to spend. So if there's anyone that can help on this issue, or provide more insights, that would be a great help.
-Sumant
On Sat, Nov 26, 2011 at 10:24:18AM -0500, Liam Healy wrote:
Hi Sumant,
Thanks for this interesting analysis.� It is encouraging to me that the FFT itself is only a small fraction of the total time taken, and therefore there is opportunity for real speed improvement. Here is my breakdown:
- FFT only (Checks 2 and 3): 10%
- make-urand-vector: 90%
����� 1. making and setting the vector (Check 4) (make-and-init-vector, (setf grid:aref)): 40% of the overall ����� 2. pointless coercing of (complex (urand) (urand)) to (complex double-float), which it already is (Check 5): 20% of the overall
I don't understand why the coerce costs so much, as it's essentially a no-op (complex (urand) (urand)) is already a (complex double-float) but regardless, it should be an easy win of 20% to just remove the coerce.
With some care, I bet make-urand-vector could be made much faster than the remaining 70%.� One thing is inlining #'urand in make-urand-vector and declaring as much as possible like declaring urand-seed fixnum.�� Another is to call make-foreign-array with :initial-contents instead of setfing each element, I'm not sure if that's faster or slower, but it would be interesting to find out because it might lead us to rethink how to do that if it is slower.� (BTW on SBCL there might be a price paid here in that we are making a static vector).� I don't think your declaration (type grid:vector-complex-double-float vec) does anything because the compiler won't use that information, and we don't read the environment (we would need variable-information from CLtL2 but removed from the CL standard;� it is used in e.g. grid::declared-type-dimension).
Liam