I took a file of about 450MB of characters. Using SBCL, when I read it like this:
``` (defun do-test2 () (with-open-file (stream *text-file*) (let ((buffer-size (* 16 1024 1024)) ; 16M ) (time (loop with buffer = (make-array buffer-size :element-type 'character) for n-characters = (read-sequence buffer stream) while (< 0 n-characters)))))) ```
It took an average of 1.08125s to read (4 trials).
This procedure: ``` (defun do-test3 () (with-open-file (stream *text-file* :element-type '(unsigned-byte 8)) (let ((buffer-size (* 16 1024 1024)) ; 16M ) (time (loop with buffer = (make-array buffer-size :element-type '(unsigned-byte 8)) for n-characters = (read-sequence buffer stream) while (< 0 n-characters)))))) ```
It took an average of 0.07s
Modifying this to set the `:external-format` to `:iso8859-1` and reading into an array of `:element-type 'character` it takes an average of 0.8095s
So there seems to be *some* overhead to the unicode handling. Note that I didn't have a file at hand that actually had ISO8859-1 in it, so I don't know if that would have complicated matters.
This suggests that just moving around bits without worrying about their interpretation *may* be faster than treating them as characters. So you could see if that changes your results at all.
I'm not a real expert in CL file I/O, so it's likely that this could be done better.
On 21 Oct 2022, at 16:18, Garrett Dangerfield wrote:
I tried changing (make-array buffer-size :element-type 'character) to (make-array buffer-size :element-type 'byte) and I got additional warnings and it took 70 seconds instead of 20.
Thanks, Garrett.
On Fri, Oct 21, 2022 at 1:47 PM Robert Goldman rpgoldman@sift.net wrote:
I don't know what data you are reading but is there any chance that the difference is that when you read text in lisp as ISO-8859-1 lisp is actually processing the text as unicode, but when you are reading it in Java you are just slamming raw bytes into memory?
Maybe this is relevant? https://stackoverflow.com/questions/979932/read-unicode-text-files-with-java
I don't use Java myself, so I can't say, and I don't have access to your data, but it does seem like the Java code is doing something simpler than the Lisp code.
What happens if you change your Lisp code to read-sequence of type byte instead of character?
On 21 Oct 2022, at 13:43, Garrett Dangerfield wrote:
I don't want to cause a firestore here but I was doing some simple benchmarks on file i/o between Java, ABCL, and SBCL and I'm a bit shocked, honestly.
Reading a 2.5M file in 16M chunks in (using iso-8859-1):
- abcl takes a tad over 1 second
- sbcl takes 0.04 seconds
Reading a 5.8G file in 16M chunks in (using iso-8859-1 for Lisp, for Java it's just bytes):
- abcl takes...too long, I gave up
- sbcl takes between 20 and 21 seconds
- Java takes 1.5 seconds
These are all run on the same computer using the same files, etc.
What's up with this? Thoughts? I'd heard that SBCL should be as fast as C under at least some circumstances. I'd wager that C is at least as fast as Java (probably faster).
Thanks, Garrett Dangerfield. (he/him/his)
P.S. Don't get me wrong, I *LOVE* Lisp, I'm trying to get away from Java as fast as I can (the syntax is killing me slowly). I've used ABCL in projects before (it was wonderful, Java doesn't handle XML well).
Lisp code: (with-open-file (stream "/media/danger/OS/temp/jars.txt" :external-format :iso-8859-1) ; great_expectations.iso (let ((size (file-length stream)) (buffer-size (* 16 1024 1024)) ; 16M ) (time (loop with buffer = (make-array buffer-size :element-type 'character) for n-characters = (read-sequence buffer stream) while (< 0 n-characters))) )))
Java code: private static final int BUFFER_SIZE = 16 * 1024 * 1024; try (InputStream in = new FileInputStream("/media/danger/OS/temp/great_expectations.iso"); ) { byte[] buff = new byte[BUFFER_SIZE]; int chunkLen = -1; long start = System.currentTimeMillis(); while ((chunkLen = in.read(buff)) != -1) { System.out.println("chunkLen = " + chunkLen); } double duration = System.currentTimeMillis() - start; duration /= 1000; System.out.println(String.format("it took %,2f secs", duration)); } catch (Exception e) { e.printStackTrace(System.out); } finally { System.out.println("Done."); }
Robert P. Goldman Research Fellow Smart Information Flow Technologies (d/b/a SIFT, LLC)
319 N. First Ave., Suite 400 Minneapolis, MN 55401
Voice: (612) 326-3934 Email: rpgoldman@SIFT.net
Robert P. Goldman Research Fellow Smart Information Flow Technologies (d/b/a SIFT, LLC)
319 N. First Ave., Suite 400 Minneapolis, MN 55401
Voice: (612) 326-3934 Email: rpgoldman@SIFT.net