#36: file-position broken for utf16 and utf32 --------------------+------------------------------------------------------- Reporter: rtoy | Owner: somebody Type: defect | Status: new Priority: minor | Milestone: Component: Core | Version: 19f Keywords: | --------------------+------------------------------------------------------- Consider this code:
{{{ (defun bug (&optional (format :utf16)) (with-open-file (s "/tmp/bom.txt" :direction :output :if-exists :supersede :external-format format) (format s "Hello~%")) (with-open-file (s "/tmp/bom.txt" :direction :input :external-format format) (print (read-char s)) (print (file-position s))) (values)) }}}
Running {{{(bug :utf16)}}} produces {{{ #\H 2 }}}
{{{(bug :utf32)}}} produces {{{ #\H 4 }}}
In both cases, the actual position is wrong. For utf16, the position should 4; utf32, 8. The BOM has been ignored.
This is caused by {{{STRING-ENCODE}}} outputting the BOM for these formats. {{{STRING-ENCODE)}}} is used to figure out how many octets have not yet been processed but have been read from the file. If the BOM was not output, the position would be correct.
This bug (will) occur in the 2010-02 snapshot and later.
#36: file-position broken for utf16 and utf32 ---------------------+------------------------------------------------------ Reporter: rtoy | Owner: somebody Type: defect | Status: new Priority: minor | Milestone: Component: Core | Version: 2010-01 Resolution: | Keywords: ---------------------+------------------------------------------------------ Changes (by rtoy):
* version: 19f => 2010-01
#36: file-position broken for utf16 and utf32 ---------------------+------------------------------------------------------ Reporter: rtoy | Owner: somebody Type: defect | Status: new Priority: minor | Milestone: Component: Core | Version: 2010-01 Resolution: | Keywords: ---------------------+------------------------------------------------------
Comment(by rtoy):
One possible solution is to keep track of the number of octets used to create each character. This has a relatively high cost because we need to save this for each character, for all inputs, but the data is only used for file-position. This seems really wasteful of MIPS and memory since file-position probably occurs much less often than reading characters.
Another alternative would be to modify string-encode so that the BOM is not included. But that's a bit tricky too. Either we need a new method for each external format (that needs it) or we need to add an extra parameter to the external format method to say we don't want a BOM. Not too hard to do, but some work to modify every format for this.
Or maybe string-encode can take a new argument specifying the ef state. But then we would need a new ef function to give us the ef state that will guarantee no BOM.
Or, the most hackish, but workable solution is to look at the output of string-encode. If the first two octets are the BOM, adjust for that. A bit hackish, but seems doable.
#36: file-position broken for utf16 and utf32 ---------------------+------------------------------------------------------ Reporter: rtoy | Owner: somebody Type: defect | Status: new Priority: minor | Milestone: Component: Core | Version: 2010-01 Resolution: | Keywords: ---------------------+------------------------------------------------------
Comment(by rtoy):
Keeping track of the octets is probably the only "correct" solution. There's no guarantee that the input (octet-to-code) state has any relationship to the output (code-to-octet) state, so there may be no consistent way run string-encode correctly.
Some tests with keeping track of the char lengths indicate that the cost is fairly low, at least when reading characters one at a time (but the conversion is still done a block at a time and doled out one character at a time).
#36: file-position broken for utf16 and utf32 ---------------------+------------------------------------------------------ Reporter: rtoy | Owner: somebody Type: defect | Status: closed Priority: minor | Milestone: Component: Core | Version: 2010-01 Resolution: fixed | Keywords: ---------------------+------------------------------------------------------ Changes (by rtoy):
* status: new => closed * resolution: => fixed
Comment:
Fixed by using an array to hold the octet length of each character. Tests show very small change in speed (about 1% increase in time).