| ... |
... |
@@ -200,39 +200,12 @@ |
|
200
|
200
|
(string-downcase-full string :start start :end end))))
|
|
201
|
201
|
|
|
202
|
202
|
|
|
203
|
|
-;;;
|
|
204
|
|
-;;; This is a Lisp translation of the Scheme code from William
|
|
205
|
|
-;;; D. Clinger that implements the word-breaking algorithm. This is
|
|
206
|
|
-;;; used with permission.
|
|
207
|
|
-;;;
|
|
208
|
|
-;;; This version is modified from the original at
|
|
209
|
|
-;;; http://www.ccs.neu.edu/home/will/R6RS/ to conform to CMUCL's
|
|
210
|
|
-;;; implementation of the word break properties.
|
|
211
|
|
-;;;
|
|
212
|
|
-;;;
|
|
213
|
|
-;;; Copyright statement and original comments:
|
|
214
|
|
-;;;
|
|
215
|
|
-;;;--------------------------------------------------------------------------------
|
|
216
|
|
-
|
|
217
|
|
-;; Copyright 2006 William D Clinger.
|
|
218
|
|
-;;
|
|
219
|
|
-;; Permission to copy this software, in whole or in part, to use this
|
|
220
|
|
-;; software for any lawful purpose, and to redistribute this software
|
|
221
|
|
-;; is granted subject to the restriction that all copies made of this
|
|
222
|
|
-;; software must include this copyright and permission notice in full.
|
|
223
|
|
-;;
|
|
224
|
|
-;; I also request that you send me a copy of any improvements that you
|
|
225
|
|
-;; make to this software so that they may be incorporated within it to
|
|
226
|
|
-;; the benefit of the Scheme community.
|
|
227
|
|
-
|
|
228
|
203
|
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
|
|
229
|
204
|
;;
|
|
230
|
205
|
;; Word-breaking as defined by Unicode Standard Annex #29.
|
|
231
|
206
|
;;
|
|
232
|
207
|
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
|
|
233
|
208
|
|
|
234
|
|
-;; Implementation notes.
|
|
235
|
|
-;;
|
|
236
|
209
|
;; The string-foldcase, string-downcase, and string-titlecase
|
|
237
|
210
|
;; procedures rely on the notion of a word, which is defined
|
|
238
|
211
|
;; by Unicode Standard Annex 29.
|
| ... |
... |
@@ -247,64 +220,11 @@ |
|
247
|
220
|
;;
|
|
248
|
221
|
;; Hence the performance of the word-breaking algorithm should
|
|
249
|
222
|
;; not matter too much for this reference implementation.
|
|
250
|
|
-;; Word-breaking is more generally useful, however, so I tried
|
|
251
|
|
-;; to make this implementation reasonably efficient.
|
|
252
|
|
-;;
|
|
253
|
|
-;; Word boundaries are defined by 14 different rules in
|
|
254
|
|
-;; Unicode Standard Annex #29, and by GraphemeBreakProperty.txt
|
|
255
|
|
-;; and WordBreakProperty.txt. See also WordBreakTest.html.
|
|
256
|
|
-;;
|
|
257
|
|
-;; My original implementation of those specifications failed
|
|
258
|
|
-;; 6 of the 494 tests in auxiliary/WordBreakTest.txt, but it
|
|
259
|
|
-;; appeared to me that those tests were inconsistent with the
|
|
260
|
|
-;; word-breaking rules in UAX #29. John Cowan forwarded my
|
|
261
|
|
-;; bug report to the Unicode experts, and Mark Davis responded
|
|
262
|
|
-;; on 29 May 2007:
|
|
263
|
|
-;;
|
|
264
|
|
-;; Thanks for following up on this. I think you have found a problem in the
|
|
265
|
|
-;; formulation of word break, not the test. The intention was to break after a
|
|
266
|
|
-;; Sep character, as is done in Sentence break. So my previous suggestion was
|
|
267
|
|
-;; incorrect; instead, what we need is a new rule:
|
|
268
|
|
-;;
|
|
269
|
|
-;; *Break after paragraph separators.*
|
|
270
|
|
-;; WB3a. Sep
|
|
271
|
|
-;; I'll make a propose to the UTC for this.
|
|
272
|
|
-;;
|
|
273
|
|
-;; Here is Will's translation of those rules (including WB3a)
|
|
274
|
|
-;; into a finite state machine that searches forward within a
|
|
275
|
|
-;; string, looking for the next position at which a word break
|
|
276
|
|
-;; is allowed. The current state consists of an index i into
|
|
277
|
|
-;; the string and a summary of the left context whose rightmost
|
|
278
|
|
-;; character is at index i. The left context is usually
|
|
279
|
|
-;; determined by the character at index i, but there are three
|
|
280
|
|
-;; complications:
|
|
281
|
|
-;;
|
|
282
|
|
-;; Extend and Format characters are ignored unless they
|
|
283
|
|
-;; follow a separator or the beginning of the text.
|
|
284
|
|
-;; ALetter followed by MidLetter is treated specially.
|
|
285
|
|
-;; Numeric followed by MidNum is treated specially.
|
|
286
|
|
-;;
|
|
287
|
|
-;; In the implementation below, the left context ending at i
|
|
288
|
|
-;; is encoded by the following symbols:
|
|
289
|
|
-;;
|
|
290
|
|
-;; CR
|
|
291
|
|
-;; Sep (excluding CR)
|
|
292
|
|
-;; ALetter
|
|
293
|
|
-;; MidLetter
|
|
294
|
|
-;; ALetterMidLetter (ALetter followed by MidLetter)
|
|
295
|
|
-;; Numeric
|
|
296
|
|
-;; MidNum
|
|
297
|
|
-;; NumericMidNum (Numeric followed by MidNum)
|
|
298
|
|
-;; Katakana
|
|
299
|
|
-;; ExtendNumLet
|
|
300
|
|
-;; other (none of the above)
|
|
|
223
|
+;; Word-breaking is more generally useful, however.
|
|
301
|
224
|
;;
|
|
302
|
|
-;; Given a string s and an exact integer i (which need not be
|
|
303
|
|
-;; a valid index into s), returns the index of the next character
|
|
304
|
|
-;; that is not part of the word containing the character at i,
|
|
305
|
|
-;; or the length of s if the word containing the character at i
|
|
306
|
|
-;; extends through the end of s. If i is negative or a valid
|
|
307
|
|
-;; index into s, then the returned value will be greater than i.
|
|
|
225
|
+;; Word boundaries are defined by different rules in Unicode Standard
|
|
|
226
|
+;; Annex #29, and by GraphemeBreakProperty.txt and
|
|
|
227
|
+;; WordBreakProperty.txt. See also WordBreakTest.html.
|
|
308
|
228
|
;;
|
|
309
|
229
|
;;;--------------------------------------------------------------------------------
|
|
310
|
230
|
|