ChangeLog:
Version 0.11.1 2007-03-22 More ugliness for a bit of output performance in special cases
Download:
http://weitz.de/files/flexi-streams.tar.gz
Cheers, Edi.
Hi!
Edi, as far as I understand external format with multibyte encodings and cr-lf style newlines are not optimized because its difficult to predict number of characters that will fit into buffer.
We can solve it if we will always have few reserved bytes in buffer. 20 will be sufficient for any encoding. I.e. loop while we have at least 20 free bytes in buffer.
This solution is implemented in the attached patch (against 0.11.2).
The patch also contains some additions in tests. As for separate tests for READ-SEQUENCE/WRITE-SEQUENCE, maybe thay are useless - I noticed that, at least in CLISP, WRITE-LINE is implemented using WRITE-SEQUENCE. So in case of errors in WRITE-SEQUENCE both WRITE-LINE and WRITE-SEQUENCE tests fail.
In regard to the way I reused existing WRITE-CHAR code in STREAM-WRITE-SEQUENCE. I do not like mach working throught temporary stream, for example we have redundant slot access in WRITE-BYTE*. I've tried to keep changes small and not disturb other code. Maybe with some refactoring it will be possible to have more clean and efficient code.
For example WRITE-CHAR code will not use WRITE-BYTE* directly but use instead some BYTE-WRITER-FUN passed as a parameter. Also external format may be distinguished as a separate entity responsible for byte/character conversions.
BTW, I've started changing STREAM-WRITE-SEQUENCE with the version provided below. It is more efficient, but it isn't thread safe.
(defmacro dyna-let-f-global (symbol func-to-bind &body body) "Something similar to FLET for global functions but with dynamic extent" `(let ((old-fdef (fdefinition ,symbol))) (unwind-protect (progn (setf (fdefinition ,symbol) ,func-to-bind) ,@body) (setf (fdefinition ,symbol) old-fdef))))
(defmethod stream-write-sequence ((stream flexi-output-stream) (sequence string) start end &key) (declare (optimize speed) (type (integer 0 *) start end fill-pointer src-ptr)) ; (declare (optimize speed) (type (integer 0 *) start end)) (let ((buffer (make-array (+ +buffer-size+ 20) :element-type 'octet))) ;; use repeated calls to WRITE-SEQUENCE for arrays of octets (loop with src-ptr = start while (< src-ptr end) do (let ((fill-pointer 0)) (dyna-let-f-global 'write-byte* (lambda (byte stream) (declare (ignore stream) (type (integer 0 *) fill-pointer) (type octet byte)) (setf (aref buffer fill-pointer) byte) (incf fill-pointer)) (loop while (and (< src-ptr end) (< fill-pointer +buffer-size+)) do (stream-write-char stream (aref sequence src-ptr)) (incf src-ptr))) (write-sequence buffer (flexi-stream-stream stream) :start 0 :end fill-pointer)))) sequence)
Here are some performance tests. Tested in CLISP on Windows XP.
(asdf:operate 'asdf:load-op :flexi-streams) (asdf:oos 'asdf:test-op :flexi-streams)
(defparameter long-str "") ;; populate it with some Russian characters (dotimes (i 10) (setf long-str (concatenate 'string long-str "Достаточно длинная строка на русском языке. Ее длина составляет порядка ста символов")))
(defun time-test(external-format) (with-open-file (stream "/cygdrive/c/usr/projects/flexi-dev/test-output.txt" :direction :output :element-type '(unsigned-byte 8) :buffered nil) ; :buffered is a CLISP extension (with-open-stream (fstream (flex:make-flexi-stream stream :external-format external-format)) (loop for i from 0 below 200 do (stream-write-sequence long-str fstream)))))
;; original flexi-streams-0.11.2 ;; =============================
(time (time-test :utf-8)) ;; Real time: 12.178 sec. ;; Run time: 12.093 sec. ;; Space: 430660 Bytes ;; GC: 1, GC time: 0.015 sec.
(time (time-test :koi8-r)) ;; Real time: 0.246 sec. ;; Run time: 0.25 sec. ;; Space: 1820584 Bytes ;; GC: 2, GC time: 0.048 sec.
;; with temp flexi stream ;; ======================
(time (time-test :utf-8)) ;; Real time: 1.08 sec. ;; Run time: 1.079 sec. ;; Space: 1791632 Bytes ;; GC: 2, GC time: 0.063 sec.
(time (time-test :koi8-r)) ;; Real time: 0.861 sec. ;; Run time: 0.875 sec. ;; Space: 1795636 Bytes ;; GC: 2, GC time: 0.048 sec.
;; with dyna-let-f-global ;; ======================
(time (time-test :utf-8)) ;; Real time: 0.639 sec. ;; Run time: 0.641 sec. ;; Space: 1734832 Bytes ;; GC: 2, GC time: 0.062 sec.
(time (time-test :koi8-r)) ;; Real time: 0.567 sec. ;; Run time: 0.563 sec. ;; Space: 1734836 Bytes ;; GC: 2, GC time: 0.062 sec.
Best regards, -Anton
-----Original Message----- From: Edi Weitz edi@agharta.de To: flexi-streams-devel@common-lisp.net Date: Thu, 22 Mar 2007 22:58:27 +0100 Subject: [flexi-streams-devel] New release 0.11.1
ChangeLog:
Version 0.11.1 2007-03-22 More ugliness for a bit of output performance in special cases
Download:
http://weitz.de/files/flexi-streams.tar.gz
Cheers, Edi. _______________________________________________ flexi-streams-devel mailing list flexi-streams-devel@common-lisp.net http://common-lisp.net/cgi-bin/mailman/listinfo/flexi-streams-devel
Hi Anton,
sorry for the delay. I have to think about this a bit more, but here's some preliminary feedback:
On Sat, 21 Apr 2007 22:33:02 +0400, Vodonosov Anton vodonosov@mail.ru wrote:
Edi, as far as I understand external format with multibyte encodings and cr-lf style newlines are not optimized because its difficult to predict number of characters that will fit into buffer.
We can solve it if we will always have few reserved bytes in buffer. 20 will be sufficient for any encoding. I.e. loop while we have at least 20 free bytes in buffer.
This solution is implemented in the attached patch (against 0.11.2).
Thanks. The win for UTF-8 is impressive, but I'm a bit concerned that you'll lose a lot of performance for 8-bit encodings. I think it'd be better to leave the 8-bit version in there and only use your code for other strings.
The patch also contains some additions in tests. As for separate tests for READ-SEQUENCE/WRITE-SEQUENCE, maybe thay are useless - I noticed that, at least in CLISP, WRITE-LINE is implemented using WRITE-SEQUENCE. So in case of errors in WRITE-SEQUENCE both WRITE-LINE and WRITE-SEQUENCE tests fail.
Thanks, more tests are always good.
In regard to the way I reused existing WRITE-CHAR code in STREAM-WRITE-SEQUENCE. I do not like mach working throught temporary stream, for example we have redundant slot access in WRITE-BYTE*. I've tried to keep changes small and not disturb other code. Maybe with some refactoring it will be possible to have more clean and efficient code.
Yes, the more we optimize for performance, the more of a mess it becomes... :(
BTW, I've started changing STREAM-WRITE-SEQUENCE with the version provided below. It is more efficient, but it isn't thread safe.
Hmmm...
A I said, I have to ponder this a bit more, but unfortunately I'm too busy right now. More later.
Cheers, Edi.
Hello Edi
sorry for the delay. ... i'm too busy right now.
No problem, I'm not hurry.
I have a new version, with better performance (see attached diff of output.lisp against 0.11.2).
Results for the same test:
;; original flexi-streams-0.11.2 ;; =============================
(time (time-test :utf-8)) ;; Real time: 12.178 sec. ;; Run time: 12.093 sec. ;; Space: 430660 Bytes ;; GC: 1, GC time: 0.015 sec.
(time (time-test :koi8-r)) ;; Real time: 0.246 sec. ;; Run time: 0.25 sec. ;; Space: 1820584 Bytes ;; GC: 2, GC time: 0.048 sec.
;; with char-to-octets ;; ===================
(time (time-test :utf-8)) ;; Real time: 0.328 sec. ;; Run time: 0.328 sec. ;; Space: 1728432 Bytes ;; GC: 2, GC time: 0.047 sec.
(time (time-test :koi8-r)) ;; Real time: 0.323 sec. ;; Run time: 0.297 sec. ;; Space: 1728436 Bytes ;; GC: 2, GC time: 0.063 sec.
Surprising is that utf-8 works with almost equal efficiency as 8-bit encoding.
I think the difference in performance for 8-bit between this version and the original 0.11.2 is caused by two function calls instead of inlined character to bytes conversion (CLOS dispatch of CHAR-TO-OCTETS which in turn calls OCTET-SINK passed to it).
I have to think about this a bit more ... The win for UTF-8 is impressive, but I'm a bit concerned that you'll lose a lot of performance for 8-bit encodings. I think it'd be better to leave the 8-bit version in there and only use your code for other strings.
It's up to you. I hope it will be possible to inline code as you do in 0.11.2, maybe redefining CHAR-TO-OCTETS methods as macros.
Maybe I'll try it when I have some spare time.
Regards, -Anton
On Sun, 29 Apr 2007 22:43:44 +0400, Vodonosov Anton vodonosov@mail.ru wrote:
I have a new version, with better performance (see attached diff of output.lisp against 0.11.2).
I've finally found some time to think about this. I've now released a new version which is based on your ideas but which (I hope) is a bit more elegant and maybe even a bit faster for the case where characters are output individually.
This new release also fixes a glaring bug in STREAM-WRITE-BYTE.
Thanks a lot for your patch, Edi.
PS: And the next time please send a patch without TAB characters... :)
Edi Weitz:
On Sun, 29 Apr 2007 22:43:44 +0400, Vodonosov Anton vodonosov@mail.ru wrote:
I have a new version, with better performance (see attached diff of output.lisp against 0.11.2).
I've finally found some time to think about this. I've now released a new version which is based on your ideas but which (I hope) is a bit more elegant and maybe even a bit faster for the case where characters are output individually.
This new release also fixes a glaring bug in STREAM-WRITE-BYTE.
Thanks a lot for your patch, Edi.
PS: And the next time please send a patch without TAB characters... :)
Hi. Glad you committed it.
Sorry about tabs. You may consider setting emacs variable indent-tabs-mode to nil in your sources.
Few days ago I've tried to measure performance improvement gained by buffering in stream-write-sequence. It turned out that in real word scenarios no improvement at all, because underlying stream always provides necessary buffering.
For example, to notice changes in performance when working with file, we must use unbuffered file stream, which is of course very uncommon (moreover, in case of normal, buffered, file stream performance is slightly worse because we do some additional work on our side; but you must do lot of output to notice this).
Therefore, I've decided to test it with network stream. I've setup hunchentoot and configured simple easy-handler that just return a 10k string. Also, I've changed the make-socket-stream function to create unbuffered socket (:buffering :none in sbcl).
On the other host the was drakma performing a hundred of requests to this handler.
I've tested this setup with two versions of flexi streams: 0.11.0 - without buffering and 0.13.1 - with buffering. It is surprising (for me) that in this case there was no performance difference too.
Using sniffer (ethreal) I've discovered that even in case if we do not do any buffering on sbcl side, buffering is anyway performed - by TCP implementation. When we do lot of write-byte it sends first packet with data length = 1, but all subsequent packets are of size > 1k. Looks like they have some micro timeout after our write-byte to see whether we will do write-byte again.
Maybe this buffering in stream-write-sequence is unnecessary...
Best regards, -Anton
P.S. There is a little test for stream-line-column function in the attach.
On Fri, 09 Nov 2007 03:27:50 +0300, Anton Vodonosov vodonosov@mail.ru wrote:
Sorry about tabs. You may consider setting emacs variable indent-tabs-mode to nil in your sources.
I have that, but the tabs were in your patch. Emacs won't automatically remove tabs introduced from patches, you'd have to do that automatically...
Few days ago I've tried to measure performance improvement gained by buffering in stream-write-sequence. It turned out that in real word scenarios no improvement at all, because underlying stream always provides necessary buffering.
For example, to notice changes in performance when working with file, we must use unbuffered file stream, which is of course very uncommon (moreover, in case of normal, buffered, file stream performance is slightly worse because we do some additional work on our side; but you must do lot of output to notice this).
Therefore, I've decided to test it with network stream. I've setup hunchentoot and configured simple easy-handler that just return a 10k string. Also, I've changed the make-socket-stream function to create unbuffered socket (:buffering :none in sbcl).
On the other host the was drakma performing a hundred of requests to this handler.
I've tested this setup with two versions of flexi streams: 0.11.0 - without buffering and 0.13.1 - with buffering. It is surprising (for me) that in this case there was no performance difference too.
Using sniffer (ethreal) I've discovered that even in case if we do not do any buffering on sbcl side, buffering is anyway performed - by TCP implementation. When we do lot of write-byte it sends first packet with data length = 1, but all subsequent packets are of size
1k. Looks like they have some micro timeout after our write-byte
to see whether we will do write-byte again.
Maybe this buffering in stream-write-sequence is unnecessary...
Hmmm....
ISTR that I did some simple tests with your patch and that there was a noticeable difference, but I don't have the results anymore. Anyway, I think I'll leave it like it is for now.
P.S. There is a little test for stream-line-column function in the attach.
Thanks.
Cheers, Edi.
Edi Weitz edi@agharta.de:
On Fri, 09 Nov 2007 03:27:50 +0300, Anton Vodonosov vodonosov@mail.ru wrote:
Sorry about tabs. You may consider setting emacs variable indent-tabs-mode to nil in your sources.
I have that, but the tabs were in your patch. Emacs won't automatically remove tabs introduced from patches, you'd have to do that automatically...
I mean to add ;;; -*- ... indent-tabs-mode: nil -*- at the top of source files. This may decrease probability that people will introduce tabs into patches.
Best regards, -Anton
On Fri, 09 Nov 2007 12:15:49 +0300, Anton Vodonosov vodonosov@mail.ru wrote:
I mean to add ;;; -*- ... indent-tabs-mode: nil -*- at the top of source files. This may decrease probability that people will introduce tabs into patches.
Ah, I see, didn't know that.
Thanks, Edi.
Edi Weitz edi@agharta.de:
Few days ago I've tried to measure performance improvement gained by buffering in stream-write-sequence. It turned out that in real word scenarios no improvement at all, because underlying stream always provides necessary buffering.
For example, to notice changes in performance when working with file, we must use unbuffered file stream, which is of course very uncommon (moreover, in case of normal, buffered, file stream performance is slightly worse because we do some additional work on our side; but you must do lot of output to notice this).
Therefore, I've decided to test it with network stream. I've setup hunchentoot and configured simple easy-handler that just return a 10k string. Also, I've changed the make-socket-stream function to create unbuffered socket (:buffering :none in sbcl).
On the other host the was drakma performing a hundred of requests to this handler.
I've tested this setup with two versions of flexi streams: 0.11.0 - without buffering and 0.13.1 - with buffering. It is surprising (for me) that in this case there was no performance difference too.
Using sniffer (ethreal) I've discovered that even in case if we do not do any buffering on sbcl side, buffering is anyway performed - by TCP implementation. When we do lot of write-byte it sends first packet with data length = 1, but all subsequent packets are of size
1k. Looks like they have some micro timeout after our write-byte
to see whether we will do write-byte again.
Maybe this buffering in stream-write-sequence is unnecessary...
Hmmm....
ISTR that I did some simple tests with your patch and that there was a noticeable difference, but I don't have the results anymore. Anyway, I think I'll leave it like it is for now.
Looks like slow network connection between hosts was the bottleneck in my test. I've tried with drakma and hunchentoot on the same computer and indeed, there is a difference, recent flexi-streams is a little faster. This difference doesn't depend on whether socket stream is buffered or no. Therefore I conclude that performance is saved by avoiding internal machinery of write-byte (access to internal slots of stream object, etc. in both flexi-stream and sbcl socket stream).
I was confused because thought that multiple write-byte on socket will result multiple packets to be sent, that would made write-byte on underlying socket stream extremely expensive, and expected radical performance improvement when individual write-byte are avoided.
Best regards, -Anton
flexi-streams-devel@common-lisp.net