Raymond Toy pushed to branch issue-511-update-unicode-tests at cmucl / cmucl

Commits:

1 changed file:

Changes:

  • src/code/unicode.lisp
    ... ... @@ -200,39 +200,12 @@
    200 200
          (string-downcase-full string :start start :end end))))
    
    201 201
     
    
    202 202
     
    
    203
    -;;;
    
    204
    -;;; This is a Lisp translation of the Scheme code from William
    
    205
    -;;; D. Clinger that implements the word-breaking algorithm.  This is
    
    206
    -;;; used with permission.
    
    207
    -;;;
    
    208
    -;;; This version is modified from the original at
    
    209
    -;;; http://www.ccs.neu.edu/home/will/R6RS/ to conform to CMUCL's
    
    210
    -;;; implementation of the word break properties.
    
    211
    -;;;
    
    212
    -;;;
    
    213
    -;;; Copyright statement and original comments:
    
    214
    -;;;
    
    215
    -;;;--------------------------------------------------------------------------------
    
    216
    -
    
    217
    -;; Copyright 2006 William D Clinger.
    
    218
    -;;
    
    219
    -;; Permission to copy this software, in whole or in part, to use this
    
    220
    -;; software for any lawful purpose, and to redistribute this software
    
    221
    -;; is granted subject to the restriction that all copies made of this
    
    222
    -;; software must include this copyright and permission notice in full.
    
    223
    -;;
    
    224
    -;; I also request that you send me a copy of any improvements that you
    
    225
    -;; make to this software so that they may be incorporated within it to
    
    226
    -;; the benefit of the Scheme community.
    
    227
    -
    
    228 203
     ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
    
    229 204
     ;;
    
    230 205
     ;; Word-breaking as defined by Unicode Standard Annex #29.
    
    231 206
     ;;
    
    232 207
     ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
    
    233 208
     
    
    234
    -;; Implementation notes.
    
    235
    -;;
    
    236 209
     ;; The string-foldcase, string-downcase, and string-titlecase
    
    237 210
     ;; procedures rely on the notion of a word, which is defined
    
    238 211
     ;; by Unicode Standard Annex 29.
    
    ... ... @@ -247,64 +220,11 @@
    247 220
     ;;
    
    248 221
     ;; Hence the performance of the word-breaking algorithm should
    
    249 222
     ;; not matter too much for this reference implementation.
    
    250
    -;; Word-breaking is more generally useful, however, so I tried
    
    251
    -;; to make this implementation reasonably efficient.
    
    252
    -;;
    
    253
    -;; Word boundaries are defined by 14 different rules in
    
    254
    -;; Unicode Standard Annex #29, and by GraphemeBreakProperty.txt
    
    255
    -;; and WordBreakProperty.txt.  See also WordBreakTest.html.
    
    256
    -;;
    
    257
    -;; My original implementation of those specifications failed
    
    258
    -;; 6 of the 494 tests in auxiliary/WordBreakTest.txt, but it
    
    259
    -;; appeared to me that those tests were inconsistent with the
    
    260
    -;; word-breaking rules in UAX #29.  John Cowan forwarded my
    
    261
    -;; bug report to the Unicode experts, and Mark Davis responded
    
    262
    -;; on 29 May 2007:
    
    263
    -;;
    
    264
    -;;   Thanks for following up on this. I think you have found a problem in the
    
    265
    -;;   formulation of word break, not the test. The intention was to break after a
    
    266
    -;;   Sep character, as is done in Sentence break. So my previous suggestion was
    
    267
    -;;   incorrect; instead, what we need is a new rule:
    
    268
    -;; 
    
    269
    -;;   *Break after paragraph separators.*
    
    270
    -;;    WB3a. Sep
    
    271
    -;;   I'll make a propose to the UTC for this.
    
    272
    -;;
    
    273
    -;; Here is Will's translation of those rules (including WB3a)
    
    274
    -;; into a finite state machine that searches forward within a
    
    275
    -;; string, looking for the next position at which a word break
    
    276
    -;; is allowed.  The current state consists of an index i into
    
    277
    -;; the string and a summary of the left context whose rightmost
    
    278
    -;; character is at index i.  The left context is usually
    
    279
    -;; determined by the character at index i, but there are three
    
    280
    -;; complications:
    
    281
    -;;
    
    282
    -;;     Extend and Format characters are ignored unless they
    
    283
    -;;         follow a separator or the beginning of the text.
    
    284
    -;;     ALetter followed by MidLetter is treated specially.
    
    285
    -;;     Numeric followed by MidNum is treated specially.
    
    286
    -;;
    
    287
    -;; In the implementation below, the left context ending at i
    
    288
    -;; is encoded by the following symbols:
    
    289
    -;;
    
    290
    -;;     CR
    
    291
    -;;     Sep (excluding CR)
    
    292
    -;;     ALetter
    
    293
    -;;     MidLetter
    
    294
    -;;     ALetterMidLetter (ALetter followed by MidLetter)
    
    295
    -;;     Numeric
    
    296
    -;;     MidNum
    
    297
    -;;     NumericMidNum (Numeric followed by MidNum)
    
    298
    -;;     Katakana
    
    299
    -;;     ExtendNumLet
    
    300
    -;;     other (none of the above)
    
    223
    +;; Word-breaking is more generally useful, however.
    
    301 224
     ;;
    
    302
    -;; Given a string s and an exact integer i (which need not be
    
    303
    -;; a valid index into s), returns the index of the next character
    
    304
    -;; that is not part of the word containing the character at i,
    
    305
    -;; or the length of s if the word containing the character at i
    
    306
    -;; extends through the end of s.  If i is negative or a valid
    
    307
    -;; index into s, then the returned value will be greater than i.
    
    225
    +;; Word boundaries are defined by different rules in Unicode Standard
    
    226
    +;; Annex #29, and by GraphemeBreakProperty.txt and
    
    227
    +;; WordBreakProperty.txt.  See also WordBreakTest.html.
    
    308 228
     ;;
    
    309 229
     ;;;--------------------------------------------------------------------------------
    
    310 230