June 2026 - cmucl-cvs - mailman3.common-lisp.net

[Git][cmucl/cmucl][master] 2 commits: Fix #317: String comparisons done with code-units
by Raymond Toy (＠rtoy) 27 Jun '26

27 Jun '26

Raymond Toy pushed to branch master at cmucl / cmucl Commits: f1c6efeb by Raymond Toy at 2026-06-26T18:29:40-07:00 Fix #317: String comparisons done with code-units - - - - - 38111614 by Raymond Toy at 2026-06-26T18:29:40-07:00 Merge branch 'issue-317-string-compare-by-code-unit' into 'master' Fix #317: String comparisons done with code-units Closes #317 See merge request cmucl/cmucl!383 - - - - - 1 changed file: - src/code/string.lisp Changes: ===================================== src/code/string.lisp ===================================== @@ -280,37 +280,11 @@ ,(if lessp `nil `(- (the fixnum index) ,offset1))) - #-unicode ((,(if lessp 'char< 'char>) (schar string1 index) (schar string2 (+ (the fixnum index) (- start2 start1)))) (- (the fixnum index) ,offset1)) - #-unicode - (t nil) - #+unicode - (t - ;; Compare in code point order. See - ;; http://icu-project.org/docs/papers/utf16_code_point_order.html - (flet ((fixup (code) - (if (>= code #xe000) - (- code #x800) - (+ code #x2000)))) - (declare (inline fixup)) - (let* ((c1 (char-code (schar string1 index))) - (c2 (char-code (schar string2 - (+ (the fixnum index) - (- start2 start1)))))) - (cond ((and (>= c1 #xd800) - (>= c2 #xd800)) - (let ((fix-c1 (fixup c1)) - (fix-c2 (fixup c2))) - (if (,(if lessp '< '>) fix-c1 fix-c2) - (- (the fixnum index) ,offset1) - nil))) - (t - (if (,(if lessp '< '>) c1 c2) - (- (the fixnum index) ,offset1) - nil))))))) + (t nil)) ,(if equalp `(- (the fixnum end1) ,offset1) 'nil)))))) ) ; eval-when View it on GitLab: https://gitlab.common-lisp.net/cmucl/cmucl/-/compare/9b6135db7475bbf23557e5… -- View it on GitLab: https://gitlab.common-lisp.net/cmucl/cmucl/-/compare/9b6135db7475bbf23557e5… You're receiving this email because of your account on gitlab.common-lisp.net. Manage all notifications: https://gitlab.common-lisp.net/-/profile/notifications | Help: https://gitlab.common-lisp.net/help

1 0

[Git][cmucl/cmucl][issue-435-add-core-math-lisp-support] Fix typo in KERNEL package definition
by Raymond Toy (＠rtoy) 27 Jun '26

27 Jun '26

Raymond Toy pushed to branch issue-435-add-core-math-lisp-support at cmucl / cmucl Commits: 14f47f94 by Raymond Toy at 2026-06-26T18:20:46-07:00 Fix typo in KERNEL package definition Extra closing paren after "STANDARD-CHAR-TYPE-P" was accidentally added while fixing a merge conflict. Remove it. - - - - - 1 changed file: - src/code/exports.lisp Changes: ===================================== src/code/exports.lisp ===================================== @@ -2193,7 +2193,7 @@ "%SINCOS" "STANDARD-CHAR-TYPE" "MAKE-STANDARD-CHAR-TYPE" - "STANDARD-CHAR-TYPE-P") + "STANDARD-CHAR-TYPE-P" ;; Single-float functions "%ACOSF" View it on GitLab: https://gitlab.common-lisp.net/cmucl/cmucl/-/commit/14f47f94f60133806291ef9… -- View it on GitLab: https://gitlab.common-lisp.net/cmucl/cmucl/-/commit/14f47f94f60133806291ef9… You're receiving this email because of your account on gitlab.common-lisp.net. Manage all notifications: https://gitlab.common-lisp.net/-/profile/notifications | Help: https://gitlab.common-lisp.net/help

1 0

[Git][cmucl/cmucl][master] 2 commits: Fix #500: Refactor package-locked-error handling
by Raymond Toy (＠rtoy) 26 Jun '26

26 Jun '26

Raymond Toy pushed to branch master at cmucl / cmucl Commits: f0a597b9 by Raymond Toy at 2026-06-26T15:57:48-07:00 Fix #500: Refactor package-locked-error handling - - - - - 9b6135db by Raymond Toy at 2026-06-26T15:57:48-07:00 Merge branch 'issue-500-common-package-error-restart-function' into 'master' Fix #500: Refactor package-locked-error handling Closes #500 See merge request cmucl/cmucl!376 - - - - - 5 changed files: - src/code/defstruct.lisp - src/code/macros.lisp - src/code/package.lisp - src/i18n/locale/cmucl.pot - + tests/package-lock.lisp Changes: ===================================== src/code/defstruct.lisp ===================================== @@ -452,25 +452,11 @@ name-and-options))) (name (dd-name defstruct)) (pkg (symbol-package name))) - (when (and lisp::*enable-package-locked-errors* - pkg + (when (and pkg (ext:package-definition-lock pkg)) - (restart-case - (error 'lisp::package-locked-error - :package pkg - :format-control (intl:gettext "defining structure ~A") - :format-arguments (list name)) - (continue () - :report (lambda (stream) - (write-string (intl:gettext "Ignore the lock and continue") stream))) - (unlock-package () - :report (lambda (stream) - (write-string (intl:gettext "Disable package's definition lock then continue") stream)) - (setf (ext:package-definition-lock pkg) nil)) - (unlock-all () - :report (lambda (stream) - (write-string (intl:gettext "Unlock all packages, then continue") stream)) - (lisp::unlock-all-packages)))) + (lisp::signal-package-locked-error pkg :definition + (intl:gettext "defining structure ~A") + name)) (when (info declaration recognized name) (error (intl:gettext "Defstruct already names a declaration: ~S.") name)) (when (stringp (car slot-descriptions)) ===================================== src/code/macros.lisp ===================================== @@ -221,6 +221,13 @@ "Syntax like DEFMACRO, but defines a new type." (unless (symbolp name) (simple-program-error (intl:gettext "~S -- Type name not a symbol.") name)) + (let ((pkg (symbol-package name))) + (when (and pkg (ext:package-definition-lock pkg)) + (signal-package-locked-error pkg :definition + (intl:gettext "defining type ~A") + name))) + + #+nil (and lisp::*enable-package-locked-errors* (symbol-package name) (ext:package-definition-lock (symbol-package name)) ===================================== src/code/package.lisp ===================================== @@ -189,8 +189,46 @@ (ext:compiler-let ((*enable-package-locked-errors* nil)) ,@body)))) - -;; trap attempts to redefine a function in a locked package, and +;;; SIGNAL-PACKAGE-LOCKED-ERROR -- Internal +;;; +;;; This encapsulates signaling of package locked errors. LOCK-KIND +;;; should be one of the following which will clear the corresponding +;;; lock when the UNLOCK-PACKAGE restart is selected. +;;; +;;; :definition - resets package-definition-lock +;;; :namespace - resets package-lock +;;; +;;; Error is signaled only if *ENABLE-PACKAGE-LOCKED-ERRORS* is non-NIL. +(defun signal-package-locked-error (package lock-kind message-control &rest message-args) + (declare (type (member :definition :namespace) lock-kind)) + ;; During bootstrap, *ENABLE-PACKAGE-LOCKED-ERRORS* may not be + ;; bound. Treat that is if it were NIL, so nothing is signaled. + (when (and (boundp '*enable-package-locked-errors*) + *enable-package-locked-errors*) + (restart-case + (error 'lisp::package-locked-error + :package package + :format-control message-control + :format-arguments message-args) + (continue () + :report (lambda (stream) + (write-string (intl:gettext "Ignore the lock and continue") + stream))) + (unlock-package () + :report (lambda (stream) + (write-string (intl:gettext "Disable the package's definition-lock then continue") + stream)) + (ecase lock-kind + (:definition + (setf (ext:package-definition-lock package) nil)) + (:namespace + (setf (ext:package-lock package) nil)))) + (unlock-all () + :report (lambda (stream) + (write-string (intl:gettext "Unlock all packages, then continue") stream)) + (unlock-all-packages))))) + +;; Trap attempts to redefine a function in a locked package, and ;; signal a continuable error. (defun redefining-function (function replacement) (declare (ignore replacement)) @@ -201,30 +239,16 @@ (let ((package (symbol-package block-name))) (when package (when (package-definition-lock package) - (when (and (consp function) - (member (first function) - '(pcl::slot-accessor - pcl::method - pcl::fast-method - pcl::effective-method - pcl::ctor))) - (return-from redefining-function nil)) - (restart-case - (error 'package-locked-error - :package package - :format-control (intl:gettext "redefining function ~A") - :format-arguments (list function)) - (continue () - :report (lambda (stream) - (write-string (intl:gettext "Ignore the lock and continue") stream))) - (unlock-package () - :report (lambda (stream) - (write-string (intl:gettext "Disable package's definition-lock, then continue") stream)) - (setf (ext:package-definition-lock package) nil)) - (unlock-all () - :report (lambda (stream) - (write-string (intl:gettext "Disable all package locks, then continue") stream)) - (unlock-all-packages))))))))) + (unless (and (consp function) + (member (first function) + '(pcl::slot-accessor + pcl::method + pcl::fast-method + pcl::effective-method + pcl::ctor))) + (signal-package-locked-error package :definition + (intl:gettext "redefining function ~A") + function)))))))) ;;; This magical variable is T during initialization so Use-Package's of packages @@ -1438,24 +1462,10 @@ (name (symbol-name symbol)) (shadowing-symbols (package-%shadowing-symbols package))) (declare (list shadowing-symbols) (simple-string name)) - (when *enable-package-locked-errors* - (when (ext:package-lock package) - (restart-case - (error 'package-locked-error - :package package - :format-control (intl:gettext "uninterning symbol ~A") - :format-arguments (list name)) - (continue () - :report (lambda (stream) - (write-string (intl:gettext "Ignore the lock and continue") stream))) - (unlock-package () - :report (lambda (stream) - (write-string (intl:gettext "Disable package's lock then continue") stream)) - (setf (ext:package-lock package) nil)) - (unlock-all () - :report (lambda (stream) - (write-string (intl:gettext "Unlock all packages, then continue") stream)) - (unlock-all-packages))))) + (when (ext:package-lock package) + (signal-package-locked-error package :namespace + (intl:gettext "uninterning symbol ~A") + name)) ;; ;; If a name conflict is revealed, give use a chance to shadowing-import ;; one of the accessible symbols. @@ -1620,24 +1630,11 @@ "Makes SYMBOLS no longer exported from PACKAGE." (let ((package (package-or-lose package)) (syms ())) - (when *enable-package-locked-errors* - (when (ext:package-lock package) - (restart-case - (error 'package-locked-error - :package package - :format-control (intl:gettext "unexporting symbols ~A") - :format-arguments (list symbols)) - (continue () - :report (lambda (stream) - (write-string (intl:gettext "Ignore the lock and continue") stream))) - (unlock-package () - :report (lambda (stream) - (write-string (intl:gettext "Disable package's lock then continue") stream)) - (setf (ext:package-lock package) nil)) - (unlock-all () - :report (lambda (stream) - (write-string (intl:gettext "Unlock all packages, then continue") stream)) - (unlock-all-packages))))) + (when (ext:package-lock package) + (signal-package-locked-error package :namespace + (intl:gettext "unexporting symbols ~A") + symbols)) + (dolist (sym (symbol-listify symbols)) (multiple-value-bind (s w) (find-symbol (symbol-name sym) package) (cond ((or (not w) (not (eq s sym))) ===================================== src/i18n/locale/cmucl.pot ===================================== @@ -8196,20 +8196,20 @@ msgstr "" msgid "~&~@<Attempt to modify the locked package ~A, by ~3i~:_~?~:>" msgstr "" -#: src/code/package.lisp -msgid "redefining function ~A" +#: src/code/macros.lisp src/code/package.lisp +msgid "Ignore the lock and continue" msgstr "" -#: src/code/macros.lisp src/code/defstruct.lisp src/code/package.lisp -msgid "Ignore the lock and continue" +#: src/code/macros.lisp src/code/package.lisp +msgid "Disable the package's definition-lock then continue" msgstr "" -#: src/code/package.lisp -msgid "Disable package's definition-lock, then continue" +#: src/code/macros.lisp src/code/package.lisp +msgid "Unlock all packages, then continue" msgstr "" #: src/code/package.lisp -msgid "Disable all package locks, then continue" +msgid "redefining function ~A" msgstr "" #: src/code/package.lisp @@ -8459,14 +8459,6 @@ msgstr "" msgid "uninterning symbol ~A" msgstr "" -#: src/code/package.lisp -msgid "Disable package's lock then continue" -msgstr "" - -#: src/code/macros.lisp src/code/defstruct.lisp src/code/package.lisp -msgid "Unlock all packages, then continue" -msgstr "" - #: src/code/package.lisp msgid "prompt for a symbol to shadowing-import." msgstr "" @@ -14472,10 +14464,6 @@ msgstr "" msgid "defining structure ~A" msgstr "" -#: src/code/defstruct.lisp -msgid "Disable package's definition lock then continue" -msgstr "" - #: src/code/defstruct.lisp msgid "Defstruct already names a declaration: ~S." msgstr "" @@ -14832,10 +14820,6 @@ msgstr "" msgid "defining macro ~A" msgstr "" -#: src/code/macros.lisp -msgid "Disable the package's definition-lock then continue" -msgstr "" - #: src/code/macros.lisp msgid "Define a compiler-macro for NAME." msgstr "" @@ -14868,10 +14852,6 @@ msgstr "" msgid "defining type ~A" msgstr "" -#: src/code/macros.lisp -msgid "Disable package's definition-lock then continue" -msgstr "" - #: src/code/macros.lisp msgid "Deftype already names a declaration: ~S." msgstr "" ===================================== tests/package-lock.lisp ===================================== @@ -0,0 +1,91 @@ +;;; Regression tests for package-locked-error signaling. +;;; +;;; These tests exist as a baseline before refactoring the +;;; package-lock checks into a shared helper. They verify that +;;; package-locked-error is signaled at each call site we expect to +;;; check the lock; restart behavior is tested separately. + +(defpackage :package-lock-tests + (:use :cl :lisp-unit)) + +(in-package "PACKAGE-LOCK-TESTS") + +(defpackage :test-locked-package + (:use :cl)) + +(defmacro with-definition-locked ((package) &body body) + "Run BODY with PACKAGE's definition-lock enabled and namespace-lock + disabled, so failures from BODY can be attributed unambiguously to + the definition lock." + `(let ((p ,package)) + (setf (ext:package-definition-lock p) t + (ext:package-lock p) nil) + (assert-true (ext:package-definition-lock p)) + (assert-false (ext:package-lock p)) + (unwind-protect (progn ,@body) + (setf (ext:package-definition-lock p) nil + (ext:package-lock p) nil)))) + +(defmacro with-namespace-locked ((package) &body body) + "Run BODY with PACKAGE's namespace-lock enabled and definition-lock + disabled, so failures from BODY can be attributed unambiguously to + the namespace lock." + `(let ((p ,package)) + (setf (ext:package-definition-lock p) nil + (ext:package-lock p) t) + (assert-false (ext:package-definition-lock p)) + (assert-true (ext:package-lock p)) + (unwind-protect (progn ,@body) + (setf (ext:package-definition-lock p) nil + (ext:package-lock p) nil)))) + + +;;; ---- Definition-lock tests ---- + +(define-test package-locked.defmacro + (:tag :issues) + (with-definition-locked ((find-package :test-locked-package)) + (assert-error 'lisp::package-locked-error + (eval '(defmacro test-locked-package::a-macro (x) + `(list ,x)))))) + +(define-test package-locked.defun + (:tag :issues) + (with-definition-locked ((find-package :test-locked-package)) + (assert-error 'lisp::package-locked-error + (eval '(defun test-locked-package::a-fn (x) x))))) + +(define-test package-locked.deftype + (:tag :issues) + (with-definition-locked ((find-package :test-locked-package)) + (assert-error 'lisp::package-locked-error + (eval '(deftype test-locked-package::a-type () + 'integer))))) + +(define-test package-locked.defstruct + (:tag :issues) + (with-definition-locked ((find-package :test-locked-package)) + (assert-error 'lisp::package-locked-error + (eval '(defstruct test-locked-package::a-struct + slot-1 slot-2))))) + + +;;; ---- Namespace-lock tests ---- + +(define-test package-locked.unintern + (:tag :issues) + (let ((sym (intern "TO-BE-UNINTERNED" + (find-package :test-locked-package)))) + (with-namespace-locked ((find-package :test-locked-package)) + (assert-error 'lisp::package-locked-error + (unintern sym (find-package :test-locked-package)))))) + +(define-test package-locked.unexport + (:tag :issues) + (let* ((p (find-package :test-locked-package)) + (sym (intern "TO-BE-UNEXPORTED" p))) + (export sym p) + (with-namespace-locked (p) + (assert-error 'lisp::package-locked-error + (unexport sym p))))) + View it on GitLab: https://gitlab.common-lisp.net/cmucl/cmucl/-/compare/5926490b13936966d04811… -- View it on GitLab: https://gitlab.common-lisp.net/cmucl/cmucl/-/compare/5926490b13936966d04811… You're receiving this email because of your account on gitlab.common-lisp.net. Manage all notifications: https://gitlab.common-lisp.net/-/profile/notifications | Help: https://gitlab.common-lisp.net/help

1 0

[Git][cmucl/cmucl][master] 2 commits: Fix #504: Read denormals with correct rounding
by Raymond Toy (＠rtoy) 26 Jun '26

26 Jun '26

Raymond Toy pushed to branch master at cmucl / cmucl Commits: 20b8c4eb by Raymond Toy at 2026-06-26T14:12:04-07:00 Fix #504: Read denormals with correct rounding - - - - - 5926490b by Raymond Toy at 2026-06-26T14:12:04-07:00 Merge branch 'issue-504-read-denormals-with-rounding' into 'master' Fix #504: Read denormals with correct rounding Closes #504 See merge request cmucl/cmucl!380 - - - - - 3 changed files: - src/code/float.lisp - src/general-info/release-22a.md - tests/float.lisp Changes: ===================================== src/code/float.lisp ===================================== @@ -885,40 +885,57 @@ ;;; denormalized or underflows to 0. ;;; (defun scale-float-maybe-underflow (x exp) - (multiple-value-bind (sig old-exp) - (integer-decode-float x) + (declare (type (or single-float double-float) x) + (fixnum exp)) + (multiple-value-bind (sig old-exp float-sign) + (integer-decode-float x) (let* ((digits (float-digits x)) + (1+digits (1+ digits)) (new-exp (+ exp old-exp digits (etypecase x (single-float vm:single-float-bias) (double-float vm:double-float-bias)))) - (sign (if (minusp (float-sign x)) 1 0))) + (sign (if (minusp float-sign) 1 0))) (cond - ((< new-exp - (etypecase x - (single-float vm:single-float-normal-exponent-min) - (double-float vm:double-float-normal-exponent-min))) - (when (vm:current-float-trap :inexact) - (error 'floating-point-inexact :operation 'scale-float - :operands (list x exp))) - (when (vm:current-float-trap :underflow) - (error 'floating-point-underflow :operation 'scale-float - :operands (list x exp))) - (let ((shift (1- new-exp))) - ;; Is it necessary to have this IF here? Is there any case - ;; where (ash sig shift) won't return 0 when - ;; shift < -(digits-1)? - (if (< shift (- (1- digits))) + ((< new-exp + (etypecase x + (single-float vm:single-float-normal-exponent-min) + (double-float vm:double-float-normal-exponent-min))) + (when (vm:current-float-trap :inexact) + (error 'floating-point-inexact :operation 'scale-float + :operands (list x exp))) + (when (vm:current-float-trap :underflow) + (error 'floating-point-underflow :operation 'scale-float + :operands (list x exp))) + ;; To round correctly, let the hardware multiplier do the + ;; rounding: build a normal float whose stored exponent is + ;; bumped up by 1+DIGITS (which puts it safely in the normal + ;; range), then multiply by 2^-(1+DIGITS). The multiplier is + ;; an exact power of two, so the multiplication is exact + ;; apart from the unavoidable rounding step that expresses + ;; the product as a denormal, which the FPU performs in the + ;; current rounding mode. If the bumped exponent is zero or + ;; negative the bumped float would itself be a denormal -- + ;; losing the implicit 1 bit of SIG -- so handle that case + ;; explicitly by returning signed zero. + (let ((bumped-exp (+ new-exp 1+digits))) + (cond + ((<= bumped-exp 0) (etypecase x (single-float (single-from-bits sign 0 0)) - (double-float (double-from-bits sign 0 0))) + (double-float (double-from-bits sign 0 0)))) + (t (etypecase x - (single-float (single-from-bits sign 0 (ash sig shift))) - (double-float (double-from-bits sign 0 (ash sig shift))))))) - (t - (etypecase x - (single-float (single-from-bits sign new-exp sig)) - (double-float (double-from-bits sign new-exp sig)))))))) + (single-float + (* (single-from-bits sign bumped-exp sig) + (scale-float 1f0 (- 1+digits)))) + (double-float + (* (double-from-bits sign bumped-exp sig) + (scale-float 1d0 (- 1+digits))))))))) + (t + (etypecase x + (single-float (single-from-bits sign new-exp sig)) + (double-float (double-from-bits sign new-exp sig)))))))) ;;; SCALE-FLOAT-MAYBE-OVERFLOW -- Internal @@ -1135,42 +1152,74 @@ (assert (= len (the fixnum (1+ digits)))) (multiple-value-bind (f0) (floatit (ash bits -1)) - #+nil - (progn - (format t "x = ~A~%" x) - (format t "1: f0, f1 = ~A~%" f0) - (format t " scale = ~A~%" (1+ scale))) - (scale-float f0 (1+ scale)))) (t (multiple-value-bind (f0) (floatit bits) - #+nil - (progn - (format t "2: f0, f1 = ~A~%" f0) - (format t " scale = ~A~%" scale) - (format t "scale-float f0 = ~A~%" (scale-float f0 scale))) - (let ((min-exponent - ;; Compute the min (unbiased) exponent - (ecase format - (single-float - (- vm:single-float-normal-exponent-min - vm:single-float-bias - vm:single-float-digits)) - (double-float - (- vm:double-float-normal-exponent-min - vm:double-float-bias - vm:double-float-digits))))) - ;; F0 is always between 0.5 and 1. If - ;; SCALE is the min exponent, we have a - ;; denormal number just less than the - ;; least-positive float. We want to - ;; return the least-positive-float so - ;; multiply F0 by 2 (without adjusting - ;; SCALE) to get the nearest float. - (if (= scale min-exponent) - (scale-float (* 2 f0) scale) - (scale-float f0 scale)))))))) + (scale-float f0 scale)))))) + (denormal-excess () + ;; How many bits of precision the result loses by being + ;; denormal instead of normal. A normal-precision return + ;; would be BITS*2^(SCALE-DIGITS) with BITS having DIGITS + ;; bits. Once it's known that this representation will + ;; produce a denormal -- equivalently, that SCALE-FLOAT- + ;; MAYBE-UNDERFLOW would take the underflow branch -- + ;; (1 - BIAS) - SCALE bits of the mantissa fall below the + ;; denormal's narrower storage and must be rounded off. + ;; Zero in the normal range. + (let ((bias + (ecase format + (single-float vm:single-float-bias) + (double-float vm:double-float-bias)))) + (declare (fixnum bias)) + (max 0 (the fixnum + (- (the fixnum (- 1 bias)) + scale))))) + (round-denormal (fraction-and-guard rem excess) + ;; FRACTION-AND-GUARD has (1+ DIGITS) bits with one guard + ;; bit; round it to (- DIGITS EXCESS) bits using round-to- + ;; nearest, ties to even, with REM as the sticky tail. + ;; Drops EXCESS+1 low bits in a single step. This is the + ;; one rounding the denormal result undergoes; no + ;; subsequent SCALE-FLOAT call is needed, so there is no + ;; double rounding. + (declare (type unsigned-byte fraction-and-guard rem) + (fixnum excess)) + (let* ((shift (1+ excess)) + (low (ldb (byte shift 0) fraction-and-guard)) + (quot (ash fraction-and-guard (- shift))) + (halfway (ash 1 excess))) + (declare (fixnum shift)) + (cond ((< low halfway) quot) + ((> low halfway) (1+ quot)) + ((not (zerop rem)) (1+ quot)) + ((oddp quot) (1+ quot)) + (t quot)))) + (denormal-from-bits (mantissa excess) + ;; MANTISSA has at most (- DIGITS EXCESS) bits and is the + ;; stored significand of a denormal result. Denormal + ;; storage holds (1- DIGITS) bits, so rounding can carry + ;; into the smallest normal only when EXCESS = 1, in + ;; which case MANTISSA can be exactly (ASH 1 (1- DIGITS)). + (declare (fixnum excess)) + (let ((sign (if plusp 0 1))) + (case format + (single-float + (cond ((and (= excess 1) + (= mantissa + (ash 1 (1- vm:single-float-digits)))) + (single-from-bits + sign vm:single-float-normal-exponent-min 0)) + (t + (single-from-bits sign 0 mantissa)))) + (double-float + (cond ((and (= excess 1) + (= mantissa + (ash 1 (1- vm:double-float-digits)))) + (double-from-bits + sign vm:double-float-normal-exponent-min 0)) + (t + (double-from-bits sign 0 mantissa))))))) (floatit (bits) (let ((sign (if plusp 0 1))) (case format @@ -1188,16 +1237,36 @@ (declare (fixnum extra)) (cond ((/= extra 1) (assert (> extra 1))) - ((oddp fraction-and-guard) - (return - (if (zerop rem) - (float-and-scale - (if (zerop (logand fraction-and-guard 2)) - fraction-and-guard - (1+ fraction-and-guard))) - (float-and-scale (1+ fraction-and-guard))))) (t - (return (float-and-scale fraction-and-guard))))) + (return + (let ((excess (denormal-excess))) + (cond + ((zerop excess) + ;; Normal result: original odd/even tie-break. + (cond ((oddp fraction-and-guard) + (if (zerop rem) + (float-and-scale + (if (zerop + (logand fraction-and-guard 2)) + fraction-and-guard + (1+ fraction-and-guard))) + (float-and-scale + (1+ fraction-and-guard)))) + (t + (float-and-scale fraction-and-guard)))) + (t + ;; Denormal result: re-round directly to the + ;; denormal's narrower precision so the only + ;; rounding step happens here. Rounding to + ;; DIGITS first and re-rounding via + ;; SCALE-FLOAT-MAYBE-UNDERFLOW would double- + ;; round (e.g. 7.290983e-39 would land on an + ;; artifical tie at the 24-bit boundary). + (let ((mantissa + (round-denormal fraction-and-guard rem + excess))) + (declare (type unsigned-byte mantissa)) + (denormal-from-bits mantissa excess))))))))) (setq shifted-num (ash shifted-num -1)) (incf scale))))))) ===================================== src/general-info/release-22a.md ===================================== @@ -58,6 +58,8 @@ public domain. * #463: `double-double-float` is missing comparison operations between `double-double-float` and `double-float` * #474: Add functions to print and parse C-style hex floats. + * #504: Do correct rounding in `scale-float-maybe-underflow`. + This was causing some denormals to be read incorrectly. * Other changes: * Improvements to the PCL implementation of CLOS: * Changes to building procedure: ===================================== tests/float.lisp ===================================== @@ -179,7 +179,9 @@ (kernel::float-ratio-float (* 4 expo) 'double-float)) (assert-equal least-positive-double-float (kernel::float-ratio-float (* 494/100 expo) 'double-float)) - (assert-equal least-positive-double-float + ;; 988/100*10^-324 is very close to 2*least-positive (the exact ratio + ;; is 1.9997 * least-positive), so it rounds to 2*least-positive. + (assert-equal (* 2 least-positive-double-float) (kernel::float-ratio-float (* 988/100 expo) 'double-float))))) (define-test reader-error.small-single-floats @@ -678,8 +680,11 @@ (frob cdiv.maxima-case #c(5.43d-10 1.13d-100) #c(1.2d-311 5.7d-312) - #c(3.691993880674614517999740937026568563794896024143749539711267954d301 - -1.753697093319947872394996242210428954266103103602859195409591583d301) + ;; Compute the expected value using rational arithmetic after + ;; converting the complex numbers above to the equivalent + ;; complex rationals. + (/ (complex (rational 5.43d-10) (rational 1.13d-100)) + (complex (rational 1.2d-311) (rational 5.7d-312))) 52) ;; 12 ;; @@ -766,3 +771,258 @@ (coerce y '(complex single-float))) x y))) + +(define-test scale-float-underflow-rounding.single + (:tag :issues) + ;; SCALE-FLOAT into the denormal range must round to nearest, ties to + ;; even, instead of truncating the discarded bits. Each (X EXP BITS) + ;; triple gives a normal X, an exponent EXP to scale by, and the IEEE + ;; bits of the expected single-float result. + (ext:with-float-traps-masked (:underflow :inexact) + (dolist (case (list + ;; 1.7f0 * 2^-149: between denormals 1 and 2, closer to 2. + (list 1.7f0 -149 #x00000002) + ;; 1.5f0 * 2^-149: halfway between denormals 1 and 2; + ;; ties round to even -> 2. + (list 1.5f0 -149 #x00000002) + ;; 1.1f0 * 2^-149: closer to denormal 1. + (list 1.1f0 -149 #x00000001) + ;; 1.0001f0 * 2^-150: just above halfway between 0 and + ;; smallest denormal; rounds up to 1. + (list 1.0001f0 -150 #x00000001) + ;; 1.0f0 * 2^-150: exactly halfway between 0 and smallest + ;; denormal; ties round to even -> 0. + (list 1.0f0 -150 #x00000000) + ;; Largest single < 2 scaled by 2^-127: rounding carries + ;; into the implicit-1 position and produces the smallest + ;; normal number. + (list (kernel:make-single-float #x3fffffff) + -127 #x00800000))) + (destructuring-bind (x exp bits) case + (let ((result (scale-float x exp))) + (assert-equal bits (kernel:single-float-bits result) + x exp result)))))) + +(define-test scale-float-underflow-rounding.double + (:tag :issues) + ;; Like SCALE-FLOAT-UNDERFLOW-ROUNDING.SINGLE but for double-floats. + ;; Each (X EXP HI LO) gives a normal X, an exponent EXP, and the IEEE + ;; high and low bits of the expected double-float result. + (ext:with-float-traps-masked (:underflow :inexact) + (dolist (case (list + ;; 1.7d0 * 2^-1074: between denormals 1 and 2, closer to 2. + (list 1.7d0 -1074 0 2) + ;; 1.5d0 * 2^-1074: tie, rounds to even -> 2. + (list 1.5d0 -1074 0 2) + ;; 1.1d0 * 2^-1074: closer to denormal 1. + (list 1.1d0 -1074 0 1) + ;; 1.0001d0 * 2^-1075: just above halfway, rounds up. + (list 1.0001d0 -1075 0 1) + ;; 1.0d0 * 2^-1075: tie at the bottom, rounds to even -> 0. + (list 1.0d0 -1075 0 0) + ;; Largest double < 2 scaled by 2^-1023: rounding carries + ;; into the implicit-1 position and produces the smallest + ;; normal number. + (list (kernel:make-double-float #x3fffffff #xffffffff) + -1023 #x00100000 0))) + (destructuring-bind (x exp hi lo) case + (let ((result (scale-float x exp))) + (assert-equal hi (kernel:double-float-high-bits result) + x exp result) + (assert-equal lo (kernel:double-float-low-bits result) + x exp result)))))) + +(define-test scale-float-underflow-rounding.reader-single + (:tag :issues) + ;; The reader uses FLOAT-RATIO-FLOAT, which calls SCALE-FLOAT and + ;; hence SCALE-FLOAT-MAYBE-UNDERFLOW on denormal results. Reading + ;; small float literals must therefore also round to nearest. + (ext:with-float-traps-masked (:underflow :inexact) + ;; 1.1e-44 is closer to 8 * least-positive-single-float than to 7. + (assert-equal #x00000008 + (kernel:single-float-bits 1.1e-44)) + ;; 1.121e-44 (essentially the IEEE representation of 1.1e-44) reads + ;; as the same denormal. + (assert-equal #x00000008 + (kernel:single-float-bits 1.121e-44)))) + +(define-test scale-float-underflow-rounding.reader-double + (:tag :issues) + ;; Like the single-float reader test, in the double denormal range. + ;; 1.1d-322 is closest to denormal #x16 (= 22). + (ext:with-float-traps-masked (:underflow :inexact) + (assert-equal 0 (kernel:double-float-high-bits 1.1d-322)) + (assert-equal #x16 (kernel:double-float-low-bits 1.1d-322)))) + +(define-test float-ratio-float.denormal-double-rounding.single + (:tag :issues) + ;; Regression for the double-rounding bug exposed by reading + ;; "7.290983e-39". The exact rational lies at 5203019.332 * 2^-149, + ;; below halfway between denormals #x4f644b and #x4f644c. An earlier + ;; fix rounded to 24 bits first (lifting it to the artificial tie + ;; 5203019.5 * 2^-149) and then re-rounded ties-to-even to #x4f644c. + ;; FLOAT-RATIO-FLOAT now re-rounds directly to denormal precision in a + ;; single step, yielding the correctly-rounded #x4f644b. + (assert-equal #x4f644b + (kernel:single-float-bits 7.290983e-39))) + +(define-test float-ratio-float.denormal-low-below-halfway.single + (:tag :issues) + ;; Exercise ROUND-DENORMAL's `low < halfway' branch via FLOAT-RATIO- + ;; FLOAT. Value 11/10 * least-positive is 1.1 denormal-units; the + ;; loop sees several bits of low and the comparison must take the + ;; round-down path so the mantissa stays at 1. + (let ((x (* 11/10 (expt 2 -149)))) + (assert-equal #x00000001 + (kernel:single-float-bits + (kernel::float-ratio-float x 'single-float)) + x))) + +(define-test float-ratio-float.denormal-low-above-halfway.single + (:tag :issues) + ;; ROUND-DENORMAL's `low > halfway' branch: 17/10 * least-positive + ;; (= 1.7 denormal-units), past halfway between 1 and 2, rounds up. + (let ((x (* 17/10 (expt 2 -149)))) + (assert-equal #x00000002 + (kernel:single-float-bits + (kernel::float-ratio-float x 'single-float)) + x))) + +(define-test float-ratio-float.denormal-tie-to-even.single + (:tag :issues) + ;; ROUND-DENORMAL's tie path, REM = 0: exact halfway between two + ;; denormals rounds to even. 3/2 * least-positive ties between + ;; denormals 1 and 2; the even neighbour is 2. + (let ((x (* 3/2 (expt 2 -149)))) + (assert-equal #x00000002 + (kernel:single-float-bits + (kernel::float-ratio-float x 'single-float)) + x)) + ;; 5/2 * least-positive ties between 2 and 3; even is 2. + (let ((x (* 5/2 (expt 2 -149)))) + (assert-equal #x00000002 + (kernel:single-float-bits + (kernel::float-ratio-float x 'single-float)) + x)) + ;; 7/2 * least-positive ties between 3 and 4; even is 4. + (let ((x (* 7/2 (expt 2 -149)))) + (assert-equal #x00000004 + (kernel:single-float-bits + (kernel::float-ratio-float x 'single-float)) + x)) + ;; 1/2 * least-positive ties between 0 and 1; even is 0. + (let ((x (* 1/2 (expt 2 -149)))) + (assert-equal #x00000000 + (kernel:single-float-bits + (kernel::float-ratio-float x 'single-float)) + x))) + +(define-test float-ratio-float.denormal-tie-with-sticky.single + (:tag :issues) + ;; ROUND-DENORMAL's tie path with REM != 0: when the loop's + ;; FRACTION-AND-GUARD reaches an exact halfway pattern but the + ;; division has a nonzero remainder, the rounding must go up because + ;; the original rational is strictly above halfway. 3/2 * least- + ;; positive + a tiny fraction of least-positive lands just past tie 1-2, + ;; rounds up to 2. + (let ((x (+ (* 3/2 (expt 2 -149)) + (* (expt 2 -149) 1/1000000000)))) + (assert-equal #x00000002 + (kernel:single-float-bits + (kernel::float-ratio-float x 'single-float)) + x)) + ;; 1/2 + tiny: just past tie 0-1, rounds up to 1. + (let ((x (+ (* 1/2 (expt 2 -149)) + (* (expt 2 -149) 1/1000000000)))) + (assert-equal #x00000001 + (kernel:single-float-bits + (kernel::float-ratio-float x 'single-float)) + x))) + +(define-test float-ratio-float.denormal-excess.single + (:tag :issues) + ;; Exercise DENORMAL-EXCESS over a range of magnitudes. EXCESS + ;; varies inversely with the result's magnitude; each case has the + ;; result land on a chosen denormal so an off-by-one in the EXCESS + ;; computation would shift the result by a factor of two. + ;; EXCESS = 1: value near smallest normal; pick (2^23 - 1) * 2^-149, + ;; the largest denormal. + (let ((x (* (1- (expt 2 23)) (expt 2 -149)))) + (assert-equal #x007fffff + (kernel:single-float-bits + (kernel::float-ratio-float x 'single-float)) + x)) + ;; EXCESS = 4: small denormal, mantissa 2^19. + (let ((x (expt 2 -130))) + (assert-equal (ash 1 19) + (kernel:single-float-bits + (kernel::float-ratio-float x 'single-float)) + x)) + ;; EXCESS = 21: very small denormal, mantissa 8. + (let ((x (* 8 (expt 2 -149)))) + (assert-equal #x00000008 + (kernel:single-float-bits + (kernel::float-ratio-float x 'single-float)) + x)) + ;; EXCESS = 23 (the maximum): only the bottom three denormals are + ;; reachable. 3 * least-positive gives mantissa 3. + (let ((x (* 3 (expt 2 -149)))) + (assert-equal #x00000003 + (kernel:single-float-bits + (kernel::float-ratio-float x 'single-float)) + x))) + +(define-test float-ratio-float.denormal-carry-to-normal.single + (:tag :issues) + ;; DENORMAL-FROM-BITS's carry-into-smallest-normal branch. Rounding + ;; promotes the denormal mantissa to 2^(DIGITS-1), which doesn't fit + ;; in the denormal's stored mantissa width; the result must be the + ;; smallest normal (stored exponent = 1, mantissa = 0). + ;; + ;; (2^24 - 1)/2 * 2^-149 is an exact tie between the largest denormal + ;; (mantissa 2^23 - 1) and the smallest normal (2^-126); the latter + ;; has stored mantissa 0 which is even, so the tie rounds to it. + (let ((x (* (1- (expt 2 24)) 1/2 (expt 2 -149)))) + (assert-equal #x00800000 + (kernel:single-float-bits + (kernel::float-ratio-float x 'single-float)) + x)) + ;; Just past that tie, also rounds up to smallest normal. + (let ((x (+ (* (1- (expt 2 24)) 1/2 (expt 2 -149)) + (* (expt 2 -149) 1/1000)))) + (assert-equal #x00800000 + (kernel:single-float-bits + (kernel::float-ratio-float x 'single-float)) + x)) + ;; Just below that tie, rounds down to the largest denormal. + (let ((x (- (* (1- (expt 2 24)) 1/2 (expt 2 -149)) + (* (expt 2 -149) 1/1000)))) + (assert-equal #x007fffff + (kernel:single-float-bits + (kernel::float-ratio-float x 'single-float)) + x))) + +(define-test float-ratio-float.denormal-double-rounding.double + (:tag :issues) + ;; Double-float equivalents. Pick a rational that lands strictly + ;; below halfway between two denormals after 53-bit rounding would + ;; otherwise lift to the halfway position. Verify a few denormal + ;; cases also work end-to-end through FLOAT-RATIO-FLOAT. + ;; + ;; 1.1d-322 -> mantissa #x16 (= 22). + (assert-equal 0 (kernel:double-float-high-bits 1.1d-322)) + (assert-equal #x16 (kernel:double-float-low-bits 1.1d-322)) + ;; Tie-to-even, double: 3/2 * least-positive ties between denormals + ;; 1 and 2; even neighbour is 2. + (let ((x (* 3/2 (expt 2 -1074)))) + (assert-equal 0 (kernel:double-float-high-bits + (kernel::float-ratio-float x 'double-float))) + (assert-equal 2 (kernel:double-float-low-bits + (kernel::float-ratio-float x 'double-float)))) + ;; Carry into smallest normal: (2^53 - 1)/2 * 2^-1074 exactly halfway + ;; between the largest double denormal and the smallest normal; + ;; rounds up to the smallest normal #x00100000_00000000. + (let* ((x (* (1- (expt 2 53)) 1/2 (expt 2 -1074))) + (r (kernel::float-ratio-float x 'double-float))) + (assert-equal #x00100000 (kernel:double-float-high-bits r) x) + (assert-equal 0 (kernel:double-float-low-bits r) x))) View it on GitLab: https://gitlab.common-lisp.net/cmucl/cmucl/-/compare/06740b4098b7d3f22bb8ea… -- View it on GitLab: https://gitlab.common-lisp.net/cmucl/cmucl/-/compare/06740b4098b7d3f22bb8ea… You're receiving this email because of your account on gitlab.common-lisp.net. Manage all notifications: https://gitlab.common-lisp.net/-/profile/notifications | Help: https://gitlab.common-lisp.net/help

1 0

[Git][cmucl/cmucl][issue-317-string-compare-by-code-unit] 8 commits: Fix #508: Save ansi-test log even if it fails CI
by Raymond Toy (＠rtoy) 26 Jun '26

26 Jun '26

Raymond Toy pushed to branch issue-317-string-compare-by-code-unit at cmucl / cmucl Commits: f373711b by Raymond Toy at 2026-05-31T17:57:04-07:00 Fix #508: Save ansi-test log even if it fails CI - - - - - 1b0a461c by Raymond Toy at 2026-05-31T17:57:04-07:00 Merge branch 'issue-508-ansi-test-log-artifact-on-failure' into 'master' Fix #508: Save ansi-test log even if it fails CI Closes #508 See merge request cmucl/cmucl!382 - - - - - fad6dd9e by Raymond Toy at 2026-06-02T17:03:21-07:00 Fix #512: CI retries select stages - - - - - 47971b15 by Raymond Toy at 2026-06-02T17:03:21-07:00 Merge branch 'issue-512-ci-retries-select-stages' into 'master' Fix #512: CI retries select stages Closes #512 See merge request cmucl/cmucl!386 - - - - - 2aa2eff9 by Raymond Toy at 2026-06-26T09:56:12-07:00 Fix #318: Add standard-char type - - - - - 06740b40 by Raymond Toy at 2026-06-26T09:56:13-07:00 Merge branch 'issue-318-add-concrete-standard-char-type' into 'master' Fix #318: Add standard-char type Closes #318 See merge request cmucl/cmucl!377 - - - - - 15598355 by Raymond Toy at 2026-06-26T12:59:09-07:00 Merge branch 'master' into issue-317-string-compare-by-code-unit - - - - - 7cfe5c42 by Raymond Toy at 2026-06-26T12:59:58-07:00 Use normal cmucl-expected-failures branch for ansi-test Since #318 has been fixed, the `etypecase.15` error shouldn't be happening anymore, so we can use the normal cmucl-expected-failures branch for the ansi-tests. - - - - - 7 changed files: - .gitlab-ci.yml - bin/run-ansi-tests.sh - src/code/exports.lisp - src/code/pred.lisp - src/code/type.lisp - src/i18n/locale/cmucl.pot - + tests/standard-char.lisp Changes: ===================================== .gitlab-ci.yml ===================================== @@ -29,6 +29,7 @@ workflow: # for building. .install: stage: install + retry: 1 artifacts: paths: - snapshot/ @@ -85,15 +86,22 @@ workflow: .ansi-test: stage: ansi-test artifacts: + # Always save artifacts even when the job fails. + when: always paths: - ansi-test.out + retry: 1 script: - bin/run-ansi-tests.sh -l dist/bin/lisp + after_script: + # This after_script always runs to save the log file even if + # run-ansi-tests.sh fails. - cp ../ansi-test/test.out ansi-test.out # Default configuration for running unit tests. .unit-test: stage: test + retry: 1 artifacts: paths: - test.log ===================================== bin/run-ansi-tests.sh ===================================== @@ -36,7 +36,7 @@ shift $((OPTIND - 1)) # Use branch cmucl-expected-failures in general since this branch # generally has the list of expected failures. This is the branch to # use on cmucl master in general. -BRANCH=cmucl-expected-failures-issue-317 +BRANCH=cmucl-expected-failures set -x if [ -d ../ansi-test ]; then ===================================== src/code/exports.lisp ===================================== @@ -2190,7 +2190,11 @@ "STANDARD-PPRINT-DISPATCH-TABLE-MODIFIED-ERROR" "%IEEE754-REM-PI/2" - "%SINCOS") + "%SINCOS" + + "STANDARD-CHAR-TYPE" + "MAKE-STANDARD-CHAR-TYPE" + "STANDARD-CHAR-TYPE-P") #+heap-overflow-check (:export "DYNAMIC-SPACE-OVERFLOW-WARNING-HIT" "DYNAMIC-SPACE-OVERFLOW-ERROR-HIT" ===================================== src/code/pred.lisp ===================================== @@ -291,6 +291,9 @@ (and (consp object) (%%typep (car object) (cons-type-car-type type)) (%%typep (cdr object) (cons-type-cdr-type type)))) + (kernel::standard-char-type + (and (characterp object) + (standard-char-p object))) (unknown-type ;; Parse it again to make sure it's really undefined. (let ((reparse (specifier-type (unknown-type-specifier type)))) ===================================== src/code/type.lisp ===================================== @@ -52,6 +52,7 @@ (define-type-class intersection) (define-type-class alien) (define-type-class cons) +(define-type-class standard-char) ;;; The Args-Type structure is used both to represent Values types and ;;; and Function types. @@ -363,6 +364,16 @@ *empty-type* (%make-cons-type car-type cdr-type))) +(defstruct (standard-char-type + (:include ctype + (class-info (type-class-or-lose 'standard-char)) + (:enumerable t)) + (:constructor %make-standard-char-type ()) + (:copier nil) + (:print-function %print-type))) + +(defun make-standard-char-type () + (%make-standard-char-type)) ;;; @@ -3293,6 +3304,121 @@ (cons-type-car-type type2)) cdr-int2))))) + +;;;; Standard-char type +(def-type-translator standard-char () + (make-standard-char-type)) + +(define-type-method (standard-char :unparse) (type) + (declare (ignore type)) + 'standard-char) + +(define-type-method (standard-char :simple-=) (type1 type2) + (declare (ignore type1 type2)) + (values t t)) + +(define-type-method (standard-char :simple-subtypep) (type1 type2) + (declare (ignore type1 type2)) + (values t t)) + +(defconstant +standard-chars+ + '(#\NEWLINE #\SPACE #\! #\" #\# #\$ #\% #\& #\' #$ #$ #\* #\+ #\, + #\- #\. #\/ #\0 #\1 #\2 #\3 #\4 #\5 #\6 #\7 #\8 #\9 #\: #\; #\< #\= + #\> #\? #\@ #\A #\B #\C #\D #\E #\F #\G #\H #\I #\J #\K #\L #\M + #\N #\O #\P #\Q #\R #\S #\T #\U #\V #\W #\X #\Y #\Z #\[ #\\ #\] + #\^ #\_ #\` #\a #\b #\c #\d #\e #\f #\g #\h #\i #\j #\k #\l #\m + #\n #\o #\p #\q #\r #\s #\t #\u #\v #\w #\x #\y #\z #\{ + #\| #\} #\~) + "The set of characters in the STANDARD-CHAR type") + +(define-type-method (standard-char :simple-union) (type1 type2) + (declare (ignore type2)) + type1) + +(define-type-method (standard-char :simple-intersection) (type1 type2) + (declare (ignore type2)) + type1) + +;; (subtype standard-char other) +(define-type-method (standard-char :complex-subtypep-arg1) (type1 type2) + (declare (ignore type1)) + (cond ((csubtypep (specifier-type 'character) type2) + ;; STANDARD-CHAR is a subtype of CHARACTER/BASE-CHAR + (values t t)) + ((member-type-p type2) + ;; If TYPE2 is a member-type, check whether it contains all standard-chars + (values (let ((members (member-type-members type2))) + (every #'(lambda (c) + (member c members)) + +standard-chars+)) + t)) + (t + (values nil t)))) + +;; (subtypep other standard-char) +(define-type-method (standard-char :complex-subtypep-arg2) (type1 type2) + (declare (ignore type2)) + (cond ((member-type-p type1) + ;; If TYPE1 is a member-type, check whether it contains all + ;; standard-chars. + (values (every #'(lambda (c) + (member c +standard-chars+)) + (member-type-members type1)) + t)) + (t + (values nil t)))) + +(define-type-method (standard-char :complex-union) (type1 type2) + ;; The standard-char type could be in type1 or type2. Figure out + ;; which one is a standard-char. + (multiple-value-bind (sc other) + (if (standard-char-type-p type1) + (values type1 type2) + (values type2 type1)) + (cond + ((csubtypep (specifier-type 'character) other) + other) + ((and (member-type-p other) + ;; Check to see every member of OTHER is a STANDARD-CHAR. + (every #'(lambda (c) + (member c +standard-chars+)) + (member-type-members other))) + sc) + (t nil)))) + +(define-type-method (standard-char :complex-intersection) (type1 type2) + ;; The standard-char type could be in type1 or type2. Figure out + ;; which one is a standard-char. + (multiple-value-bind (sc other) + (if (standard-char-type-p type1) + (values type1 type2) + (values type2 type1)) + (cond + ((csubtypep (specifier-type 'character) other) + ;; STANDARD-CHAR intersect any super-type of CHARACTER is a + ;; STANDARD-CHAR. + sc) + (t + (block punt + ;; Look through OTHER and find OTHER contains any standard + ;; character. If so, collect them all. If there are, the + ;; intersection is a member-type of the collected characters. + (collect ((members)) + (dolist (ch +standard-chars+) + (multiple-value-bind (val win) + (ctypep ch other) + (unless win + (return-from punt nil)) + (when val + (members ch)))) + (cond ((null (members)) + c::*empty-type*) + ((= (length (members)) + (length +standard-chars+)) + sc) + (t + (make-member-type :members (members)))))))))) + ;;; TYPE-DIFFERENCE -- Interface ;;; @@ -3379,7 +3505,8 @@ (declare (type ctype type)) (etypecase type ((or numeric-type named-type member-type array-type - kernel::built-in-class cons-type) + kernel::built-in-class cons-type + standard-char-type) (values (%typep obj type) t)) (class (if (if (csubtypep type (specifier-type 'funcallable-instance)) @@ -3520,16 +3647,6 @@ "Type of characters that aren't base-char's. None in CMU CL." '(and character (not base-char))) -(deftype standard-char () - "Type corresponding to the charaters required by the standard." - '(member #\NEWLINE #\SPACE #\! #\" #\# #\$ #\% #\& #\' #$ #$ #\* #\+ #\, - #\- #\. #\/ #\0 #\1 #\2 #\3 #\4 #\5 #\6 #\7 #\8 #\9 #\: #\; #\< #\= - #\> #\? #\@ #\A #\B #\C #\D #\E #\F #\G #\H #\I #\J #\K #\L #\M - #\N #\O #\P #\Q #\R #\S #\T #\U #\V #\W #\X #\Y #\Z #\[ #\\ #\] - #\^ #\_ #\` #\a #\b #\c #\d #\e #\f #\g #\h #\i #\j #\k #\l #\m - #\n #\o #\p #\q #\r #\s #\t #\u #\v #\w #\x #\y #\z #\{ - #\| #\} #\~)) - (deftype keyword () "Type for any keyword symbol." '(and symbol (satisfies keywordp))) ===================================== src/i18n/locale/cmucl.pot ===================================== @@ -1225,11 +1225,11 @@ msgid "Array dimensions is not a list, integer or *:~% ~S" msgstr "" #: src/code/type.lisp -msgid "Type of characters that aren't base-char's. None in CMU CL." +msgid "The set of characters in the STANDARD-CHAR type" msgstr "" #: src/code/type.lisp -msgid "Type corresponding to the charaters required by the standard." +msgid "Type of characters that aren't base-char's. None in CMU CL." msgstr "" #: src/code/type.lisp ===================================== tests/standard-char.lisp ===================================== @@ -0,0 +1,367 @@ +;;; Tests for standard-char + +(defpackage :standard-char-tests + (:use :cl :lisp-unit)) + +(in-package "STANDARD-CHAR-TESTS") + +;; For the following tests, we generally want to use +;; kernel::type-intersection and kernel::type-union directly to make +;; sure we test the intersection and union methods for standard-char. + +(define-test standard-char.typep + (:tag :issues) + (assert-true (typep #\a 'standard-char)) + (assert-false (typep #\tab 'standard-char)) + (assert-true (typep #\Z 'standard-char)) + (assert-true (typep #\Space 'standard-char)) + (assert-true (typep #\Newline 'standard-char)) + (assert-false (typep #\Rubout 'standard-char)) + (assert-false (typep 5 'standard-char)) + (assert-false (typep "hello" 'standard-char)) + (assert-false (typep nil 'standard-char)) + (assert-false (typep t 'standard-char)) + + (assert-equal (values t t) + (subtypep 'standard-char 'character)) + (assert-equal (values nil t) + (subtypep 'character 'standard-char)) + (assert-equal (values t t) + (subtypep 'standard-char 'base-char)) + (assert-equal (values nil t) + (subtypep 'base-char 'standard-char))) + +(define-test standard-char.etypecase-15 + (:tag :issues) + (assert-equal (values t t) + (c::type= + (c::specifier-type + '(not (or pathname boolean standard-char standard-object character file-error))) + (c::specifier-type + '(not (or file-error character standard-object standard-char boolean pathname)))))) + +(define-test standard-char.identity + (:tag :issues) + (let ((a (c::specifier-type 'standard-char)) + (b (c::specifier-type 'standard-char))) + ;; Should be EQ due to internal caching. + (assert-eq a b))) + +(define-test standard-char.parsing + (:tag :issues) + (assert-eq 'standard-char + (c::type-specifier (c::specifier-type 'standard-char)))) + +(define-test standard-char.predicate + (:tag :issues) + (assert-true (c::standard-char-type-p (c::specifier-type 'standard-char)))) + +(define-test standard-char.simple-subtypep + (:tag :issues) + (assert-equal (values t t) + (c::type= (c::specifier-type 'standard-char) + (c::specifier-type 'standard-char))) + (assert-equal (values t t) + (subtypep 'standard-char 'standard-char))) + +(define-test standard-char.complex-subtype-arg1 + (:tag :issues) + ;; STANDARD-CHAR is a subtype of CHARACTER and T. + (assert-equal (values t t) + (subtypep 'standard-char 'character)) + (assert-equal (values t t) + (subtypep 'standard-char t)) + + ;; Not a subtype of disjoint types. + (assert-equal (values nil t) + (subtypep 'standard-char 'integer)) + (assert-equal (values nil t) + (subtypep 'standard-char 'symbol)) + (assert-equal (values nil t) + (subtypep 'standard-char 'pathname)) + + ;; Subtype of a member-type that contains all standard chars. + (assert-equal (values t t) + (subtypep 'standard-char + `(member ,@kernel::+standard-chars+))) + ;; Not a subtype of a member-type missing even one standard char. + (assert-equal (values nil t) + (subtypep 'standard-char '(member #\a)))) + +(define-test standard-char.complex-subtypep-arg + (:tag :issues) + ;; All standard chars: subtype. + (assert-equal (values t t) + (subtypep '(member #\a) 'standard-char)) + (assert-equal (values t t) + (subtypep '(member #\Space #\Newline) 'standard-char)) + + ;; Mixed — character but not standard. + (assert-equal (values nil t) + (subtypep '(member #\Tab) 'standard-char)) + (assert-equal (values nil t) + (subtypep '(member #\Rubout) 'standard-char)) + + ;; Mixed — non-character members. This was the crash case. + (assert-equal (values nil t) + (subtypep '(member t) 'standard-char)) + (assert-equal (values nil t) + (subtypep '(member t nil) 'standard-char)) + + ;; Mixed — some standard, some not. + (assert-equal (values nil t) + (subtypep '(member #\a #\Tab) 'standard-char)) + (assert-equal (values nil t) + (subtypep '(member #\a t) 'standard-char)) + + ;; CHARACTER is not a subtype of STANDARD-CHAR (non-standard chars exist). + (assert-equal (values nil t) + (subtypep 'character 'standard-char))) + +(define-test standard-char.complex-union + (:tag :issues) + ;; Absorbed by supertype. + (assert-equal (values t t) + (c::type= (c::type-union (c::specifier-type 'standard-char) + (c::specifier-type 'character)) + (c::specifier-type 'character))) + + (assert-equal (values t t) + (c::type= (c::type-union (c::specifier-type 'standard-char) + (c::specifier-type 't)) + (c::specifier-type 't))) + + ;; All-standard-chars member-type absorbed back into STANDARD-CHAR. + (assert-equal (values t t) + (c::type= (c::type-union (c::specifier-type 'standard-char) + (c::specifier-type '(member #\a #\b))) + (c::specifier-type 'standard-char))) + + ;; Disjoint type stays as a union (the bug-fix case). + ;; The result should NOT be a single member-type containing + ;; T, NIL, and 96 standard chars. + (let ((result (c::specifier-type '(or boolean standard-char)))) + (assert-true (c::union-type-p result)) + (assert-equal 2 (length (c::union-type-types result))) + (assert-true (notany (lambda (m) + (and (c::member-type-p m) + (some #'characterp (c::member-type-members m)) + (some (complement #'characterp) + (c::member-type-members m)))) + (c::union-type-types result)))) + + + ;; Permutation invariance — the original etypecase.15 trigger. + (assert-equal (values t t) + (c::type= (c::specifier-type '(or boolean standard-char)) + (c::specifier-type '(or standard-char boolean)))) + + (assert-equal (values t t) + (c::type= (c::specifier-type + '(not (or pathname boolean standard-char standard-object character file-error))) + (c::specifier-type + '(not (or file-error character standard-object standard-char boolean pathname))))) + + ;; Member-type with non-standard chars — kept symbolically separate. + (let ((result (c::type-union (c::specifier-type 'standard-char) + (c::specifier-type '(member #\Tab))))) + ;; Should not collapse into a 97-element MEMBER. + (assert-false (c::member-type-p result)) + (assert-true (c::union-type-p result)))) + +(define-test standard-char.complex-intersection + (:tag :issues) + ;; Intersection with supertype is STANDARD-CHAR. + (assert-equal (values t t) + (c::type= (c::type-intersection (c::specifier-type 'standard-char) + (c::specifier-type 'character)) + (c::specifier-type 'standard-char))) + + (assert-equal (values t t) + (c::type= (c::type-intersection (c::specifier-type 'standard-char) + (c::specifier-type 't)) + (c::specifier-type 'standard-char))) + + ;; Intersection with disjoint type is empty. + (assert-equal (values t t) + (c::type= (c::type-intersection (c::specifier-type 'standard-char) + (c::specifier-type 'integer)) + c::*empty-type*)) + + (assert-equal (values t t) + (c::type= (c::type-intersection (c::specifier-type 'standard-char) + (c::specifier-type 'symbol)) + c::*empty-type*)) + + ;; Intersection with member-type — filtered to standard chars. + (assert-equal (values t t) + (c::type= (c::type-intersection (c::specifier-type 'standard-char) + (c::specifier-type '(member #\a #\Tab #\b))) + (c::specifier-type '(member #\a #\b)))) + + ;; All-non-standard members → empty. + (assert-equal (values t t) + (c::type= (c::type-intersection (c::specifier-type 'standard-char) + (c::specifier-type '(member #\Tab #\Rubout))) + c::*empty-type*)) + + ;; All-standard members → that member-type unchanged. + (assert-equal (values t t) + (c::type= (c::type-intersection (c::specifier-type 'standard-char) + (c::specifier-type '(member #\a))) + (c::specifier-type '(member #\a))))) + + + +(define-test standard-char.negation + (:tag :issues) + ;; NOT STANDARD-CHAR catches non-standard characters. + (assert-true (typep #\Tab '(not standard-char))) + (assert-false (typep #\a '(not standard-char))) + + ;; AND CHARACTER (NOT STANDARD-CHAR) is the non-standard chars. + (assert-true (typep #\Tab '(and character (not standard-char)))) + (assert-false (typep #\a '(and character (not standard-char)))) + (assert-false (typep 5 '(and character (not standard-char)))) + + ;; Permutation invariance with negation, multiple types. + (assert-equal (values t t) + (c::type= (c::specifier-type '(and standard-char (not (member #\a)))) + (c::specifier-type '(and (not (member #\a)) standard-char))))) + +(define-test standard-char.etypecase + (:tag :issues) + ;; Test that etypecase works using ASCII characters which will cover + ;; standard-char values and other characters. + (dotimes (k 128) + (let* ((ch (code-char k)) + (expected (if (standard-char-p ch) + :is-standard :is-other)) + (actual (handler-case + (etypecase ch + (standard-char :is-standard) + (character :is-other)) + (error () + :error)))) + (assert-eql expected actual ch)))) + +(define-test standard-char.intersection-character-both-orderings + (:tag :issues) + ;; Standard-char intersect character = standard-char, regardless of argument order. + (assert-equal (values t t) + (kernel::type= + (kernel::specifier-type 'standard-char) + (kernel::type-intersection (kernel::specifier-type 'standard-char) + (kernel::specifier-type 'character)))) + (assert-equal (values t t) + (kernel::type= + (kernel::specifier-type 'standard-char) + (kernel::type-intersection (kernel::specifier-type 'character) + (kernel::specifier-type 'standard-char))))) + +(define-test standard-char.intersection-disjoint-both-orderings + (:tag :issues) + (assert-equal (values t t) + (kernel::type= + kernel::*empty-type* + (kernel::type-intersection (kernel::specifier-type 'standard-char) + (kernel::specifier-type 'integer)))) + (assert-equal (values t t) + (kernel::type= + kernel::*empty-type* + (kernel::type-intersection (kernel::specifier-type 'integer) + (kernel::specifier-type 'standard-char))))) + +(define-test standard-char.intersection-member-both-orderings + (:tag :issues) + ;; Filter member-type to standard chars only. + (assert-equal (values t t) + (kernel::type= + (kernel::specifier-type '(member #\a #\b)) + (kernel::type-intersection (kernel::specifier-type 'standard-char) + (kernel::specifier-type '(member #\a #\Tab #\b))))) + (assert-equal (values t t) + (kernel::type= + (kernel::specifier-type '(member #\a #\b)) + (kernel::type-intersection (kernel::specifier-type '(member #\a #\Tab #\b)) + (kernel::specifier-type 'standard-char))))) + +(define-test standard-char.union-character-both-orderings + (:tag :issues) + ;; Standard-char union character = character. + (assert-equal (values t t) + (kernel::type= + (kernel::specifier-type 'character) + (kernel::type-union (kernel::specifier-type 'standard-char) + (kernel::specifier-type 'character)))) + (assert-equal (values t t) + (kernel::type= + (kernel::specifier-type 'character) + (kernel::type-union (kernel::specifier-type 'character) + (kernel::specifier-type 'standard-char))))) + +(define-test standard-char.union-member-of-standard-both-orderings + (:tag :issues) + ;; Standard-char absorbs all-standard member-type. + (assert-equal (values t t) + (kernel::type= + (kernel::specifier-type 'standard-char) + (kernel::type-union (kernel::specifier-type 'standard-char) + (kernel::specifier-type '(member #\a #\b))))) + (assert-equal (values t t) + (kernel::type= + (kernel::specifier-type 'standard-char) + (kernel::type-union (kernel::specifier-type '(member #\a #\b)) + (kernel::specifier-type 'standard-char))))) + +(define-test standard-char.union-disjoint-stays-symbolic-both-orderings + (:tag :issues) + ;; (or boolean standard-char) and reverse — both should stay symbolic + ;; rather than collapsing into a giant member-type. + (let ((r1 (kernel::specifier-type '(or boolean standard-char))) + (r2 (kernel::specifier-type '(or standard-char boolean)))) + (assert-true (kernel::union-type-p r1)) + (assert-true (kernel::union-type-p r2)) + (assert-equal (values t t) + (kernel::type= r1 r2)) + ;; Neither should contain a member-type with both characters + ;; and non-characters. + (dolist (m (kernel::union-type-types r1)) + (assert-false (and (kernel::member-type-p m) + (some #'characterp (kernel::member-type-members m)) + (some (complement #'characterp) + (kernel::member-type-members m))))))) + +(defun assert-commutative-union (type-a-spec type-b-spec) + "Assert that union(A, B) and union(B, A) produce type= results." + (assert-equal (values t t) + (kernel::type= + (kernel::type-union (kernel::specifier-type type-a-spec) + (kernel::specifier-type type-b-spec)) + (kernel::type-union (kernel::specifier-type type-b-spec) + (kernel::specifier-type type-a-spec))))) + +(defun assert-commutative-intersection (type-a-spec type-b-spec) + (assert-equal (values t t) + (kernel::type= + (kernel::type-intersection (kernel::specifier-type type-a-spec) + (kernel::specifier-type type-b-spec)) + (kernel::type-intersection (kernel::specifier-type type-b-spec) + (kernel::specifier-type type-a-spec))))) + +(define-test standard-char.commutativity + (:tag :issues) + (assert-commutative-union 'standard-char 'character) + (assert-commutative-union 'standard-char 'integer) + (assert-commutative-union 'standard-char '(member #\a #\b)) + (assert-commutative-union 'standard-char '(member #\Tab)) + (assert-commutative-union 'standard-char 'boolean) + (assert-commutative-union 'standard-char '(not character)) + (assert-commutative-union 'standard-char 't) + (assert-commutative-intersection 'standard-char 'character) + (assert-commutative-intersection 'standard-char 'integer) + (assert-commutative-intersection 'standard-char '(member #\a #\b)) + (assert-commutative-intersection 'standard-char '(member #\Tab)) + (assert-commutative-intersection 'standard-char 'boolean) + (assert-commutative-intersection 'standard-char '(not character)) + (assert-commutative-intersection 'standard-char 't)) View it on GitLab: https://gitlab.common-lisp.net/cmucl/cmucl/-/compare/d051b51a054ffd8eb58db9… -- View it on GitLab: https://gitlab.common-lisp.net/cmucl/cmucl/-/compare/d051b51a054ffd8eb58db9… You're receiving this email because of your account on gitlab.common-lisp.net. Manage all notifications: https://gitlab.common-lisp.net/-/profile/notifications | Help: https://gitlab.common-lisp.net/help

1 0

[Git][cmucl/cmucl][issue-504-read-denormals-with-rounding] 8 commits: Fix #508: Save ansi-test log even if it fails CI
by Raymond Toy (＠rtoy) 26 Jun '26

26 Jun '26

Raymond Toy pushed to branch issue-504-read-denormals-with-rounding at cmucl / cmucl Commits: f373711b by Raymond Toy at 2026-05-31T17:57:04-07:00 Fix #508: Save ansi-test log even if it fails CI - - - - - 1b0a461c by Raymond Toy at 2026-05-31T17:57:04-07:00 Merge branch 'issue-508-ansi-test-log-artifact-on-failure' into 'master' Fix #508: Save ansi-test log even if it fails CI Closes #508 See merge request cmucl/cmucl!382 - - - - - fad6dd9e by Raymond Toy at 2026-06-02T17:03:21-07:00 Fix #512: CI retries select stages - - - - - 47971b15 by Raymond Toy at 2026-06-02T17:03:21-07:00 Merge branch 'issue-512-ci-retries-select-stages' into 'master' Fix #512: CI retries select stages Closes #512 See merge request cmucl/cmucl!386 - - - - - 2aa2eff9 by Raymond Toy at 2026-06-26T09:56:12-07:00 Fix #318: Add standard-char type - - - - - 06740b40 by Raymond Toy at 2026-06-26T09:56:13-07:00 Merge branch 'issue-318-add-concrete-standard-char-type' into 'master' Fix #318: Add standard-char type Closes #318 See merge request cmucl/cmucl!377 - - - - - c8825b9f by Raymond Toy at 2026-06-26T12:31:14-07:00 Merge branch 'master' into issue-504-read-denormals-with-rounding - - - - - 5896e64b by Raymond Toy at 2026-06-26T12:34:24-07:00 Get sign from integer-decode-float Instead of calling `float-sign` to get the sign, `integer-decode-float` already computes the sign (as -1 or +1), so use that to determine the sign bit to use. - - - - - 7 changed files: - .gitlab-ci.yml - src/code/exports.lisp - src/code/float.lisp - src/code/pred.lisp - src/code/type.lisp - src/i18n/locale/cmucl.pot - + tests/standard-char.lisp Changes: ===================================== .gitlab-ci.yml ===================================== @@ -29,6 +29,7 @@ workflow: # for building. .install: stage: install + retry: 1 artifacts: paths: - snapshot/ @@ -85,15 +86,22 @@ workflow: .ansi-test: stage: ansi-test artifacts: + # Always save artifacts even when the job fails. + when: always paths: - ansi-test.out + retry: 1 script: - bin/run-ansi-tests.sh -l dist/bin/lisp + after_script: + # This after_script always runs to save the log file even if + # run-ansi-tests.sh fails. - cp ../ansi-test/test.out ansi-test.out # Default configuration for running unit tests. .unit-test: stage: test + retry: 1 artifacts: paths: - test.log ===================================== src/code/exports.lisp ===================================== @@ -2190,7 +2190,11 @@ "STANDARD-PPRINT-DISPATCH-TABLE-MODIFIED-ERROR" "%IEEE754-REM-PI/2" - "%SINCOS") + "%SINCOS" + + "STANDARD-CHAR-TYPE" + "MAKE-STANDARD-CHAR-TYPE" + "STANDARD-CHAR-TYPE-P") #+heap-overflow-check (:export "DYNAMIC-SPACE-OVERFLOW-WARNING-HIT" "DYNAMIC-SPACE-OVERFLOW-ERROR-HIT" ===================================== src/code/float.lisp ===================================== @@ -887,7 +887,7 @@ (defun scale-float-maybe-underflow (x exp) (declare (type (or single-float double-float) x) (fixnum exp)) - (multiple-value-bind (sig old-exp) + (multiple-value-bind (sig old-exp float-sign) (integer-decode-float x) (let* ((digits (float-digits x)) (1+digits (1+ digits)) @@ -895,7 +895,7 @@ (etypecase x (single-float vm:single-float-bias) (double-float vm:double-float-bias)))) - (sign (if (minusp (float-sign x)) 1 0))) + (sign (if (minusp float-sign) 1 0))) (cond ((< new-exp (etypecase x ===================================== src/code/pred.lisp ===================================== @@ -291,6 +291,9 @@ (and (consp object) (%%typep (car object) (cons-type-car-type type)) (%%typep (cdr object) (cons-type-cdr-type type)))) + (kernel::standard-char-type + (and (characterp object) + (standard-char-p object))) (unknown-type ;; Parse it again to make sure it's really undefined. (let ((reparse (specifier-type (unknown-type-specifier type)))) ===================================== src/code/type.lisp ===================================== @@ -52,6 +52,7 @@ (define-type-class intersection) (define-type-class alien) (define-type-class cons) +(define-type-class standard-char) ;;; The Args-Type structure is used both to represent Values types and ;;; and Function types. @@ -363,6 +364,16 @@ *empty-type* (%make-cons-type car-type cdr-type))) +(defstruct (standard-char-type + (:include ctype + (class-info (type-class-or-lose 'standard-char)) + (:enumerable t)) + (:constructor %make-standard-char-type ()) + (:copier nil) + (:print-function %print-type))) + +(defun make-standard-char-type () + (%make-standard-char-type)) ;;; @@ -3293,6 +3304,121 @@ (cons-type-car-type type2)) cdr-int2))))) + +;;;; Standard-char type +(def-type-translator standard-char () + (make-standard-char-type)) + +(define-type-method (standard-char :unparse) (type) + (declare (ignore type)) + 'standard-char) + +(define-type-method (standard-char :simple-=) (type1 type2) + (declare (ignore type1 type2)) + (values t t)) + +(define-type-method (standard-char :simple-subtypep) (type1 type2) + (declare (ignore type1 type2)) + (values t t)) + +(defconstant +standard-chars+ + '(#\NEWLINE #\SPACE #\! #\" #\# #\$ #\% #\& #\' #$ #$ #\* #\+ #\, + #\- #\. #\/ #\0 #\1 #\2 #\3 #\4 #\5 #\6 #\7 #\8 #\9 #\: #\; #\< #\= + #\> #\? #\@ #\A #\B #\C #\D #\E #\F #\G #\H #\I #\J #\K #\L #\M + #\N #\O #\P #\Q #\R #\S #\T #\U #\V #\W #\X #\Y #\Z #\[ #\\ #\] + #\^ #\_ #\` #\a #\b #\c #\d #\e #\f #\g #\h #\i #\j #\k #\l #\m + #\n #\o #\p #\q #\r #\s #\t #\u #\v #\w #\x #\y #\z #\{ + #\| #\} #\~) + "The set of characters in the STANDARD-CHAR type") + +(define-type-method (standard-char :simple-union) (type1 type2) + (declare (ignore type2)) + type1) + +(define-type-method (standard-char :simple-intersection) (type1 type2) + (declare (ignore type2)) + type1) + +;; (subtype standard-char other) +(define-type-method (standard-char :complex-subtypep-arg1) (type1 type2) + (declare (ignore type1)) + (cond ((csubtypep (specifier-type 'character) type2) + ;; STANDARD-CHAR is a subtype of CHARACTER/BASE-CHAR + (values t t)) + ((member-type-p type2) + ;; If TYPE2 is a member-type, check whether it contains all standard-chars + (values (let ((members (member-type-members type2))) + (every #'(lambda (c) + (member c members)) + +standard-chars+)) + t)) + (t + (values nil t)))) + +;; (subtypep other standard-char) +(define-type-method (standard-char :complex-subtypep-arg2) (type1 type2) + (declare (ignore type2)) + (cond ((member-type-p type1) + ;; If TYPE1 is a member-type, check whether it contains all + ;; standard-chars. + (values (every #'(lambda (c) + (member c +standard-chars+)) + (member-type-members type1)) + t)) + (t + (values nil t)))) + +(define-type-method (standard-char :complex-union) (type1 type2) + ;; The standard-char type could be in type1 or type2. Figure out + ;; which one is a standard-char. + (multiple-value-bind (sc other) + (if (standard-char-type-p type1) + (values type1 type2) + (values type2 type1)) + (cond + ((csubtypep (specifier-type 'character) other) + other) + ((and (member-type-p other) + ;; Check to see every member of OTHER is a STANDARD-CHAR. + (every #'(lambda (c) + (member c +standard-chars+)) + (member-type-members other))) + sc) + (t nil)))) + +(define-type-method (standard-char :complex-intersection) (type1 type2) + ;; The standard-char type could be in type1 or type2. Figure out + ;; which one is a standard-char. + (multiple-value-bind (sc other) + (if (standard-char-type-p type1) + (values type1 type2) + (values type2 type1)) + (cond + ((csubtypep (specifier-type 'character) other) + ;; STANDARD-CHAR intersect any super-type of CHARACTER is a + ;; STANDARD-CHAR. + sc) + (t + (block punt + ;; Look through OTHER and find OTHER contains any standard + ;; character. If so, collect them all. If there are, the + ;; intersection is a member-type of the collected characters. + (collect ((members)) + (dolist (ch +standard-chars+) + (multiple-value-bind (val win) + (ctypep ch other) + (unless win + (return-from punt nil)) + (when val + (members ch)))) + (cond ((null (members)) + c::*empty-type*) + ((= (length (members)) + (length +standard-chars+)) + sc) + (t + (make-member-type :members (members)))))))))) + ;;; TYPE-DIFFERENCE -- Interface ;;; @@ -3379,7 +3505,8 @@ (declare (type ctype type)) (etypecase type ((or numeric-type named-type member-type array-type - kernel::built-in-class cons-type) + kernel::built-in-class cons-type + standard-char-type) (values (%typep obj type) t)) (class (if (if (csubtypep type (specifier-type 'funcallable-instance)) @@ -3520,16 +3647,6 @@ "Type of characters that aren't base-char's. None in CMU CL." '(and character (not base-char))) -(deftype standard-char () - "Type corresponding to the charaters required by the standard." - '(member #\NEWLINE #\SPACE #\! #\" #\# #\$ #\% #\& #\' #$ #$ #\* #\+ #\, - #\- #\. #\/ #\0 #\1 #\2 #\3 #\4 #\5 #\6 #\7 #\8 #\9 #\: #\; #\< #\= - #\> #\? #\@ #\A #\B #\C #\D #\E #\F #\G #\H #\I #\J #\K #\L #\M - #\N #\O #\P #\Q #\R #\S #\T #\U #\V #\W #\X #\Y #\Z #\[ #\\ #\] - #\^ #\_ #\` #\a #\b #\c #\d #\e #\f #\g #\h #\i #\j #\k #\l #\m - #\n #\o #\p #\q #\r #\s #\t #\u #\v #\w #\x #\y #\z #\{ - #\| #\} #\~)) - (deftype keyword () "Type for any keyword symbol." '(and symbol (satisfies keywordp))) ===================================== src/i18n/locale/cmucl.pot ===================================== @@ -1225,11 +1225,11 @@ msgid "Array dimensions is not a list, integer or *:~% ~S" msgstr "" #: src/code/type.lisp -msgid "Type of characters that aren't base-char's. None in CMU CL." +msgid "The set of characters in the STANDARD-CHAR type" msgstr "" #: src/code/type.lisp -msgid "Type corresponding to the charaters required by the standard." +msgid "Type of characters that aren't base-char's. None in CMU CL." msgstr "" #: src/code/type.lisp ===================================== tests/standard-char.lisp ===================================== @@ -0,0 +1,367 @@ +;;; Tests for standard-char + +(defpackage :standard-char-tests + (:use :cl :lisp-unit)) + +(in-package "STANDARD-CHAR-TESTS") + +;; For the following tests, we generally want to use +;; kernel::type-intersection and kernel::type-union directly to make +;; sure we test the intersection and union methods for standard-char. + +(define-test standard-char.typep + (:tag :issues) + (assert-true (typep #\a 'standard-char)) + (assert-false (typep #\tab 'standard-char)) + (assert-true (typep #\Z 'standard-char)) + (assert-true (typep #\Space 'standard-char)) + (assert-true (typep #\Newline 'standard-char)) + (assert-false (typep #\Rubout 'standard-char)) + (assert-false (typep 5 'standard-char)) + (assert-false (typep "hello" 'standard-char)) + (assert-false (typep nil 'standard-char)) + (assert-false (typep t 'standard-char)) + + (assert-equal (values t t) + (subtypep 'standard-char 'character)) + (assert-equal (values nil t) + (subtypep 'character 'standard-char)) + (assert-equal (values t t) + (subtypep 'standard-char 'base-char)) + (assert-equal (values nil t) + (subtypep 'base-char 'standard-char))) + +(define-test standard-char.etypecase-15 + (:tag :issues) + (assert-equal (values t t) + (c::type= + (c::specifier-type + '(not (or pathname boolean standard-char standard-object character file-error))) + (c::specifier-type + '(not (or file-error character standard-object standard-char boolean pathname)))))) + +(define-test standard-char.identity + (:tag :issues) + (let ((a (c::specifier-type 'standard-char)) + (b (c::specifier-type 'standard-char))) + ;; Should be EQ due to internal caching. + (assert-eq a b))) + +(define-test standard-char.parsing + (:tag :issues) + (assert-eq 'standard-char + (c::type-specifier (c::specifier-type 'standard-char)))) + +(define-test standard-char.predicate + (:tag :issues) + (assert-true (c::standard-char-type-p (c::specifier-type 'standard-char)))) + +(define-test standard-char.simple-subtypep + (:tag :issues) + (assert-equal (values t t) + (c::type= (c::specifier-type 'standard-char) + (c::specifier-type 'standard-char))) + (assert-equal (values t t) + (subtypep 'standard-char 'standard-char))) + +(define-test standard-char.complex-subtype-arg1 + (:tag :issues) + ;; STANDARD-CHAR is a subtype of CHARACTER and T. + (assert-equal (values t t) + (subtypep 'standard-char 'character)) + (assert-equal (values t t) + (subtypep 'standard-char t)) + + ;; Not a subtype of disjoint types. + (assert-equal (values nil t) + (subtypep 'standard-char 'integer)) + (assert-equal (values nil t) + (subtypep 'standard-char 'symbol)) + (assert-equal (values nil t) + (subtypep 'standard-char 'pathname)) + + ;; Subtype of a member-type that contains all standard chars. + (assert-equal (values t t) + (subtypep 'standard-char + `(member ,@kernel::+standard-chars+))) + ;; Not a subtype of a member-type missing even one standard char. + (assert-equal (values nil t) + (subtypep 'standard-char '(member #\a)))) + +(define-test standard-char.complex-subtypep-arg + (:tag :issues) + ;; All standard chars: subtype. + (assert-equal (values t t) + (subtypep '(member #\a) 'standard-char)) + (assert-equal (values t t) + (subtypep '(member #\Space #\Newline) 'standard-char)) + + ;; Mixed — character but not standard. + (assert-equal (values nil t) + (subtypep '(member #\Tab) 'standard-char)) + (assert-equal (values nil t) + (subtypep '(member #\Rubout) 'standard-char)) + + ;; Mixed — non-character members. This was the crash case. + (assert-equal (values nil t) + (subtypep '(member t) 'standard-char)) + (assert-equal (values nil t) + (subtypep '(member t nil) 'standard-char)) + + ;; Mixed — some standard, some not. + (assert-equal (values nil t) + (subtypep '(member #\a #\Tab) 'standard-char)) + (assert-equal (values nil t) + (subtypep '(member #\a t) 'standard-char)) + + ;; CHARACTER is not a subtype of STANDARD-CHAR (non-standard chars exist). + (assert-equal (values nil t) + (subtypep 'character 'standard-char))) + +(define-test standard-char.complex-union + (:tag :issues) + ;; Absorbed by supertype. + (assert-equal (values t t) + (c::type= (c::type-union (c::specifier-type 'standard-char) + (c::specifier-type 'character)) + (c::specifier-type 'character))) + + (assert-equal (values t t) + (c::type= (c::type-union (c::specifier-type 'standard-char) + (c::specifier-type 't)) + (c::specifier-type 't))) + + ;; All-standard-chars member-type absorbed back into STANDARD-CHAR. + (assert-equal (values t t) + (c::type= (c::type-union (c::specifier-type 'standard-char) + (c::specifier-type '(member #\a #\b))) + (c::specifier-type 'standard-char))) + + ;; Disjoint type stays as a union (the bug-fix case). + ;; The result should NOT be a single member-type containing + ;; T, NIL, and 96 standard chars. + (let ((result (c::specifier-type '(or boolean standard-char)))) + (assert-true (c::union-type-p result)) + (assert-equal 2 (length (c::union-type-types result))) + (assert-true (notany (lambda (m) + (and (c::member-type-p m) + (some #'characterp (c::member-type-members m)) + (some (complement #'characterp) + (c::member-type-members m)))) + (c::union-type-types result)))) + + + ;; Permutation invariance — the original etypecase.15 trigger. + (assert-equal (values t t) + (c::type= (c::specifier-type '(or boolean standard-char)) + (c::specifier-type '(or standard-char boolean)))) + + (assert-equal (values t t) + (c::type= (c::specifier-type + '(not (or pathname boolean standard-char standard-object character file-error))) + (c::specifier-type + '(not (or file-error character standard-object standard-char boolean pathname))))) + + ;; Member-type with non-standard chars — kept symbolically separate. + (let ((result (c::type-union (c::specifier-type 'standard-char) + (c::specifier-type '(member #\Tab))))) + ;; Should not collapse into a 97-element MEMBER. + (assert-false (c::member-type-p result)) + (assert-true (c::union-type-p result)))) + +(define-test standard-char.complex-intersection + (:tag :issues) + ;; Intersection with supertype is STANDARD-CHAR. + (assert-equal (values t t) + (c::type= (c::type-intersection (c::specifier-type 'standard-char) + (c::specifier-type 'character)) + (c::specifier-type 'standard-char))) + + (assert-equal (values t t) + (c::type= (c::type-intersection (c::specifier-type 'standard-char) + (c::specifier-type 't)) + (c::specifier-type 'standard-char))) + + ;; Intersection with disjoint type is empty. + (assert-equal (values t t) + (c::type= (c::type-intersection (c::specifier-type 'standard-char) + (c::specifier-type 'integer)) + c::*empty-type*)) + + (assert-equal (values t t) + (c::type= (c::type-intersection (c::specifier-type 'standard-char) + (c::specifier-type 'symbol)) + c::*empty-type*)) + + ;; Intersection with member-type — filtered to standard chars. + (assert-equal (values t t) + (c::type= (c::type-intersection (c::specifier-type 'standard-char) + (c::specifier-type '(member #\a #\Tab #\b))) + (c::specifier-type '(member #\a #\b)))) + + ;; All-non-standard members → empty. + (assert-equal (values t t) + (c::type= (c::type-intersection (c::specifier-type 'standard-char) + (c::specifier-type '(member #\Tab #\Rubout))) + c::*empty-type*)) + + ;; All-standard members → that member-type unchanged. + (assert-equal (values t t) + (c::type= (c::type-intersection (c::specifier-type 'standard-char) + (c::specifier-type '(member #\a))) + (c::specifier-type '(member #\a))))) + + + +(define-test standard-char.negation + (:tag :issues) + ;; NOT STANDARD-CHAR catches non-standard characters. + (assert-true (typep #\Tab '(not standard-char))) + (assert-false (typep #\a '(not standard-char))) + + ;; AND CHARACTER (NOT STANDARD-CHAR) is the non-standard chars. + (assert-true (typep #\Tab '(and character (not standard-char)))) + (assert-false (typep #\a '(and character (not standard-char)))) + (assert-false (typep 5 '(and character (not standard-char)))) + + ;; Permutation invariance with negation, multiple types. + (assert-equal (values t t) + (c::type= (c::specifier-type '(and standard-char (not (member #\a)))) + (c::specifier-type '(and (not (member #\a)) standard-char))))) + +(define-test standard-char.etypecase + (:tag :issues) + ;; Test that etypecase works using ASCII characters which will cover + ;; standard-char values and other characters. + (dotimes (k 128) + (let* ((ch (code-char k)) + (expected (if (standard-char-p ch) + :is-standard :is-other)) + (actual (handler-case + (etypecase ch + (standard-char :is-standard) + (character :is-other)) + (error () + :error)))) + (assert-eql expected actual ch)))) + +(define-test standard-char.intersection-character-both-orderings + (:tag :issues) + ;; Standard-char intersect character = standard-char, regardless of argument order. + (assert-equal (values t t) + (kernel::type= + (kernel::specifier-type 'standard-char) + (kernel::type-intersection (kernel::specifier-type 'standard-char) + (kernel::specifier-type 'character)))) + (assert-equal (values t t) + (kernel::type= + (kernel::specifier-type 'standard-char) + (kernel::type-intersection (kernel::specifier-type 'character) + (kernel::specifier-type 'standard-char))))) + +(define-test standard-char.intersection-disjoint-both-orderings + (:tag :issues) + (assert-equal (values t t) + (kernel::type= + kernel::*empty-type* + (kernel::type-intersection (kernel::specifier-type 'standard-char) + (kernel::specifier-type 'integer)))) + (assert-equal (values t t) + (kernel::type= + kernel::*empty-type* + (kernel::type-intersection (kernel::specifier-type 'integer) + (kernel::specifier-type 'standard-char))))) + +(define-test standard-char.intersection-member-both-orderings + (:tag :issues) + ;; Filter member-type to standard chars only. + (assert-equal (values t t) + (kernel::type= + (kernel::specifier-type '(member #\a #\b)) + (kernel::type-intersection (kernel::specifier-type 'standard-char) + (kernel::specifier-type '(member #\a #\Tab #\b))))) + (assert-equal (values t t) + (kernel::type= + (kernel::specifier-type '(member #\a #\b)) + (kernel::type-intersection (kernel::specifier-type '(member #\a #\Tab #\b)) + (kernel::specifier-type 'standard-char))))) + +(define-test standard-char.union-character-both-orderings + (:tag :issues) + ;; Standard-char union character = character. + (assert-equal (values t t) + (kernel::type= + (kernel::specifier-type 'character) + (kernel::type-union (kernel::specifier-type 'standard-char) + (kernel::specifier-type 'character)))) + (assert-equal (values t t) + (kernel::type= + (kernel::specifier-type 'character) + (kernel::type-union (kernel::specifier-type 'character) + (kernel::specifier-type 'standard-char))))) + +(define-test standard-char.union-member-of-standard-both-orderings + (:tag :issues) + ;; Standard-char absorbs all-standard member-type. + (assert-equal (values t t) + (kernel::type= + (kernel::specifier-type 'standard-char) + (kernel::type-union (kernel::specifier-type 'standard-char) + (kernel::specifier-type '(member #\a #\b))))) + (assert-equal (values t t) + (kernel::type= + (kernel::specifier-type 'standard-char) + (kernel::type-union (kernel::specifier-type '(member #\a #\b)) + (kernel::specifier-type 'standard-char))))) + +(define-test standard-char.union-disjoint-stays-symbolic-both-orderings + (:tag :issues) + ;; (or boolean standard-char) and reverse — both should stay symbolic + ;; rather than collapsing into a giant member-type. + (let ((r1 (kernel::specifier-type '(or boolean standard-char))) + (r2 (kernel::specifier-type '(or standard-char boolean)))) + (assert-true (kernel::union-type-p r1)) + (assert-true (kernel::union-type-p r2)) + (assert-equal (values t t) + (kernel::type= r1 r2)) + ;; Neither should contain a member-type with both characters + ;; and non-characters. + (dolist (m (kernel::union-type-types r1)) + (assert-false (and (kernel::member-type-p m) + (some #'characterp (kernel::member-type-members m)) + (some (complement #'characterp) + (kernel::member-type-members m))))))) + +(defun assert-commutative-union (type-a-spec type-b-spec) + "Assert that union(A, B) and union(B, A) produce type= results." + (assert-equal (values t t) + (kernel::type= + (kernel::type-union (kernel::specifier-type type-a-spec) + (kernel::specifier-type type-b-spec)) + (kernel::type-union (kernel::specifier-type type-b-spec) + (kernel::specifier-type type-a-spec))))) + +(defun assert-commutative-intersection (type-a-spec type-b-spec) + (assert-equal (values t t) + (kernel::type= + (kernel::type-intersection (kernel::specifier-type type-a-spec) + (kernel::specifier-type type-b-spec)) + (kernel::type-intersection (kernel::specifier-type type-b-spec) + (kernel::specifier-type type-a-spec))))) + +(define-test standard-char.commutativity + (:tag :issues) + (assert-commutative-union 'standard-char 'character) + (assert-commutative-union 'standard-char 'integer) + (assert-commutative-union 'standard-char '(member #\a #\b)) + (assert-commutative-union 'standard-char '(member #\Tab)) + (assert-commutative-union 'standard-char 'boolean) + (assert-commutative-union 'standard-char '(not character)) + (assert-commutative-union 'standard-char 't) + (assert-commutative-intersection 'standard-char 'character) + (assert-commutative-intersection 'standard-char 'integer) + (assert-commutative-intersection 'standard-char '(member #\a #\b)) + (assert-commutative-intersection 'standard-char '(member #\Tab)) + (assert-commutative-intersection 'standard-char 'boolean) + (assert-commutative-intersection 'standard-char '(not character)) + (assert-commutative-intersection 'standard-char 't)) View it on GitLab: https://gitlab.common-lisp.net/cmucl/cmucl/-/compare/bc6c560e5d4632f897854f… -- View it on GitLab: https://gitlab.common-lisp.net/cmucl/cmucl/-/compare/bc6c560e5d4632f897854f… You're receiving this email because of your account on gitlab.common-lisp.net. Manage all notifications: https://gitlab.common-lisp.net/-/profile/notifications | Help: https://gitlab.common-lisp.net/help

1 0

[Git][cmucl/cmucl][master] 2 commits: Fix #318: Add standard-char type
by Raymond Toy (＠rtoy) 26 Jun '26

26 Jun '26

Raymond Toy pushed to branch master at cmucl / cmucl Commits: 2aa2eff9 by Raymond Toy at 2026-06-26T09:56:12-07:00 Fix #318: Add standard-char type - - - - - 06740b40 by Raymond Toy at 2026-06-26T09:56:13-07:00 Merge branch 'issue-318-add-concrete-standard-char-type' into 'master' Fix #318: Add standard-char type Closes #318 See merge request cmucl/cmucl!377 - - - - - 5 changed files: - src/code/exports.lisp - src/code/pred.lisp - src/code/type.lisp - src/i18n/locale/cmucl.pot - + tests/standard-char.lisp Changes: ===================================== src/code/exports.lisp ===================================== @@ -2190,7 +2190,11 @@ "STANDARD-PPRINT-DISPATCH-TABLE-MODIFIED-ERROR" "%IEEE754-REM-PI/2" - "%SINCOS") + "%SINCOS" + + "STANDARD-CHAR-TYPE" + "MAKE-STANDARD-CHAR-TYPE" + "STANDARD-CHAR-TYPE-P") #+heap-overflow-check (:export "DYNAMIC-SPACE-OVERFLOW-WARNING-HIT" "DYNAMIC-SPACE-OVERFLOW-ERROR-HIT" ===================================== src/code/pred.lisp ===================================== @@ -291,6 +291,9 @@ (and (consp object) (%%typep (car object) (cons-type-car-type type)) (%%typep (cdr object) (cons-type-cdr-type type)))) + (kernel::standard-char-type + (and (characterp object) + (standard-char-p object))) (unknown-type ;; Parse it again to make sure it's really undefined. (let ((reparse (specifier-type (unknown-type-specifier type)))) ===================================== src/code/type.lisp ===================================== @@ -52,6 +52,7 @@ (define-type-class intersection) (define-type-class alien) (define-type-class cons) +(define-type-class standard-char) ;;; The Args-Type structure is used both to represent Values types and ;;; and Function types. @@ -363,6 +364,16 @@ *empty-type* (%make-cons-type car-type cdr-type))) +(defstruct (standard-char-type + (:include ctype + (class-info (type-class-or-lose 'standard-char)) + (:enumerable t)) + (:constructor %make-standard-char-type ()) + (:copier nil) + (:print-function %print-type))) + +(defun make-standard-char-type () + (%make-standard-char-type)) ;;; @@ -3293,6 +3304,121 @@ (cons-type-car-type type2)) cdr-int2))))) + +;;;; Standard-char type +(def-type-translator standard-char () + (make-standard-char-type)) + +(define-type-method (standard-char :unparse) (type) + (declare (ignore type)) + 'standard-char) + +(define-type-method (standard-char :simple-=) (type1 type2) + (declare (ignore type1 type2)) + (values t t)) + +(define-type-method (standard-char :simple-subtypep) (type1 type2) + (declare (ignore type1 type2)) + (values t t)) + +(defconstant +standard-chars+ + '(#\NEWLINE #\SPACE #\! #\" #\# #\$ #\% #\& #\' #$ #$ #\* #\+ #\, + #\- #\. #\/ #\0 #\1 #\2 #\3 #\4 #\5 #\6 #\7 #\8 #\9 #\: #\; #\< #\= + #\> #\? #\@ #\A #\B #\C #\D #\E #\F #\G #\H #\I #\J #\K #\L #\M + #\N #\O #\P #\Q #\R #\S #\T #\U #\V #\W #\X #\Y #\Z #\[ #\\ #\] + #\^ #\_ #\` #\a #\b #\c #\d #\e #\f #\g #\h #\i #\j #\k #\l #\m + #\n #\o #\p #\q #\r #\s #\t #\u #\v #\w #\x #\y #\z #\{ + #\| #\} #\~) + "The set of characters in the STANDARD-CHAR type") + +(define-type-method (standard-char :simple-union) (type1 type2) + (declare (ignore type2)) + type1) + +(define-type-method (standard-char :simple-intersection) (type1 type2) + (declare (ignore type2)) + type1) + +;; (subtype standard-char other) +(define-type-method (standard-char :complex-subtypep-arg1) (type1 type2) + (declare (ignore type1)) + (cond ((csubtypep (specifier-type 'character) type2) + ;; STANDARD-CHAR is a subtype of CHARACTER/BASE-CHAR + (values t t)) + ((member-type-p type2) + ;; If TYPE2 is a member-type, check whether it contains all standard-chars + (values (let ((members (member-type-members type2))) + (every #'(lambda (c) + (member c members)) + +standard-chars+)) + t)) + (t + (values nil t)))) + +;; (subtypep other standard-char) +(define-type-method (standard-char :complex-subtypep-arg2) (type1 type2) + (declare (ignore type2)) + (cond ((member-type-p type1) + ;; If TYPE1 is a member-type, check whether it contains all + ;; standard-chars. + (values (every #'(lambda (c) + (member c +standard-chars+)) + (member-type-members type1)) + t)) + (t + (values nil t)))) + +(define-type-method (standard-char :complex-union) (type1 type2) + ;; The standard-char type could be in type1 or type2. Figure out + ;; which one is a standard-char. + (multiple-value-bind (sc other) + (if (standard-char-type-p type1) + (values type1 type2) + (values type2 type1)) + (cond + ((csubtypep (specifier-type 'character) other) + other) + ((and (member-type-p other) + ;; Check to see every member of OTHER is a STANDARD-CHAR. + (every #'(lambda (c) + (member c +standard-chars+)) + (member-type-members other))) + sc) + (t nil)))) + +(define-type-method (standard-char :complex-intersection) (type1 type2) + ;; The standard-char type could be in type1 or type2. Figure out + ;; which one is a standard-char. + (multiple-value-bind (sc other) + (if (standard-char-type-p type1) + (values type1 type2) + (values type2 type1)) + (cond + ((csubtypep (specifier-type 'character) other) + ;; STANDARD-CHAR intersect any super-type of CHARACTER is a + ;; STANDARD-CHAR. + sc) + (t + (block punt + ;; Look through OTHER and find OTHER contains any standard + ;; character. If so, collect them all. If there are, the + ;; intersection is a member-type of the collected characters. + (collect ((members)) + (dolist (ch +standard-chars+) + (multiple-value-bind (val win) + (ctypep ch other) + (unless win + (return-from punt nil)) + (when val + (members ch)))) + (cond ((null (members)) + c::*empty-type*) + ((= (length (members)) + (length +standard-chars+)) + sc) + (t + (make-member-type :members (members)))))))))) + ;;; TYPE-DIFFERENCE -- Interface ;;; @@ -3379,7 +3505,8 @@ (declare (type ctype type)) (etypecase type ((or numeric-type named-type member-type array-type - kernel::built-in-class cons-type) + kernel::built-in-class cons-type + standard-char-type) (values (%typep obj type) t)) (class (if (if (csubtypep type (specifier-type 'funcallable-instance)) @@ -3520,16 +3647,6 @@ "Type of characters that aren't base-char's. None in CMU CL." '(and character (not base-char))) -(deftype standard-char () - "Type corresponding to the charaters required by the standard." - '(member #\NEWLINE #\SPACE #\! #\" #\# #\$ #\% #\& #\' #$ #$ #\* #\+ #\, - #\- #\. #\/ #\0 #\1 #\2 #\3 #\4 #\5 #\6 #\7 #\8 #\9 #\: #\; #\< #\= - #\> #\? #\@ #\A #\B #\C #\D #\E #\F #\G #\H #\I #\J #\K #\L #\M - #\N #\O #\P #\Q #\R #\S #\T #\U #\V #\W #\X #\Y #\Z #\[ #\\ #\] - #\^ #\_ #\` #\a #\b #\c #\d #\e #\f #\g #\h #\i #\j #\k #\l #\m - #\n #\o #\p #\q #\r #\s #\t #\u #\v #\w #\x #\y #\z #\{ - #\| #\} #\~)) - (deftype keyword () "Type for any keyword symbol." '(and symbol (satisfies keywordp))) ===================================== src/i18n/locale/cmucl.pot ===================================== @@ -1225,11 +1225,11 @@ msgid "Array dimensions is not a list, integer or *:~% ~S" msgstr "" #: src/code/type.lisp -msgid "Type of characters that aren't base-char's. None in CMU CL." +msgid "The set of characters in the STANDARD-CHAR type" msgstr "" #: src/code/type.lisp -msgid "Type corresponding to the charaters required by the standard." +msgid "Type of characters that aren't base-char's. None in CMU CL." msgstr "" #: src/code/type.lisp ===================================== tests/standard-char.lisp ===================================== @@ -0,0 +1,367 @@ +;;; Tests for standard-char + +(defpackage :standard-char-tests + (:use :cl :lisp-unit)) + +(in-package "STANDARD-CHAR-TESTS") + +;; For the following tests, we generally want to use +;; kernel::type-intersection and kernel::type-union directly to make +;; sure we test the intersection and union methods for standard-char. + +(define-test standard-char.typep + (:tag :issues) + (assert-true (typep #\a 'standard-char)) + (assert-false (typep #\tab 'standard-char)) + (assert-true (typep #\Z 'standard-char)) + (assert-true (typep #\Space 'standard-char)) + (assert-true (typep #\Newline 'standard-char)) + (assert-false (typep #\Rubout 'standard-char)) + (assert-false (typep 5 'standard-char)) + (assert-false (typep "hello" 'standard-char)) + (assert-false (typep nil 'standard-char)) + (assert-false (typep t 'standard-char)) + + (assert-equal (values t t) + (subtypep 'standard-char 'character)) + (assert-equal (values nil t) + (subtypep 'character 'standard-char)) + (assert-equal (values t t) + (subtypep 'standard-char 'base-char)) + (assert-equal (values nil t) + (subtypep 'base-char 'standard-char))) + +(define-test standard-char.etypecase-15 + (:tag :issues) + (assert-equal (values t t) + (c::type= + (c::specifier-type + '(not (or pathname boolean standard-char standard-object character file-error))) + (c::specifier-type + '(not (or file-error character standard-object standard-char boolean pathname)))))) + +(define-test standard-char.identity + (:tag :issues) + (let ((a (c::specifier-type 'standard-char)) + (b (c::specifier-type 'standard-char))) + ;; Should be EQ due to internal caching. + (assert-eq a b))) + +(define-test standard-char.parsing + (:tag :issues) + (assert-eq 'standard-char + (c::type-specifier (c::specifier-type 'standard-char)))) + +(define-test standard-char.predicate + (:tag :issues) + (assert-true (c::standard-char-type-p (c::specifier-type 'standard-char)))) + +(define-test standard-char.simple-subtypep + (:tag :issues) + (assert-equal (values t t) + (c::type= (c::specifier-type 'standard-char) + (c::specifier-type 'standard-char))) + (assert-equal (values t t) + (subtypep 'standard-char 'standard-char))) + +(define-test standard-char.complex-subtype-arg1 + (:tag :issues) + ;; STANDARD-CHAR is a subtype of CHARACTER and T. + (assert-equal (values t t) + (subtypep 'standard-char 'character)) + (assert-equal (values t t) + (subtypep 'standard-char t)) + + ;; Not a subtype of disjoint types. + (assert-equal (values nil t) + (subtypep 'standard-char 'integer)) + (assert-equal (values nil t) + (subtypep 'standard-char 'symbol)) + (assert-equal (values nil t) + (subtypep 'standard-char 'pathname)) + + ;; Subtype of a member-type that contains all standard chars. + (assert-equal (values t t) + (subtypep 'standard-char + `(member ,@kernel::+standard-chars+))) + ;; Not a subtype of a member-type missing even one standard char. + (assert-equal (values nil t) + (subtypep 'standard-char '(member #\a)))) + +(define-test standard-char.complex-subtypep-arg + (:tag :issues) + ;; All standard chars: subtype. + (assert-equal (values t t) + (subtypep '(member #\a) 'standard-char)) + (assert-equal (values t t) + (subtypep '(member #\Space #\Newline) 'standard-char)) + + ;; Mixed — character but not standard. + (assert-equal (values nil t) + (subtypep '(member #\Tab) 'standard-char)) + (assert-equal (values nil t) + (subtypep '(member #\Rubout) 'standard-char)) + + ;; Mixed — non-character members. This was the crash case. + (assert-equal (values nil t) + (subtypep '(member t) 'standard-char)) + (assert-equal (values nil t) + (subtypep '(member t nil) 'standard-char)) + + ;; Mixed — some standard, some not. + (assert-equal (values nil t) + (subtypep '(member #\a #\Tab) 'standard-char)) + (assert-equal (values nil t) + (subtypep '(member #\a t) 'standard-char)) + + ;; CHARACTER is not a subtype of STANDARD-CHAR (non-standard chars exist). + (assert-equal (values nil t) + (subtypep 'character 'standard-char))) + +(define-test standard-char.complex-union + (:tag :issues) + ;; Absorbed by supertype. + (assert-equal (values t t) + (c::type= (c::type-union (c::specifier-type 'standard-char) + (c::specifier-type 'character)) + (c::specifier-type 'character))) + + (assert-equal (values t t) + (c::type= (c::type-union (c::specifier-type 'standard-char) + (c::specifier-type 't)) + (c::specifier-type 't))) + + ;; All-standard-chars member-type absorbed back into STANDARD-CHAR. + (assert-equal (values t t) + (c::type= (c::type-union (c::specifier-type 'standard-char) + (c::specifier-type '(member #\a #\b))) + (c::specifier-type 'standard-char))) + + ;; Disjoint type stays as a union (the bug-fix case). + ;; The result should NOT be a single member-type containing + ;; T, NIL, and 96 standard chars. + (let ((result (c::specifier-type '(or boolean standard-char)))) + (assert-true (c::union-type-p result)) + (assert-equal 2 (length (c::union-type-types result))) + (assert-true (notany (lambda (m) + (and (c::member-type-p m) + (some #'characterp (c::member-type-members m)) + (some (complement #'characterp) + (c::member-type-members m)))) + (c::union-type-types result)))) + + + ;; Permutation invariance — the original etypecase.15 trigger. + (assert-equal (values t t) + (c::type= (c::specifier-type '(or boolean standard-char)) + (c::specifier-type '(or standard-char boolean)))) + + (assert-equal (values t t) + (c::type= (c::specifier-type + '(not (or pathname boolean standard-char standard-object character file-error))) + (c::specifier-type + '(not (or file-error character standard-object standard-char boolean pathname))))) + + ;; Member-type with non-standard chars — kept symbolically separate. + (let ((result (c::type-union (c::specifier-type 'standard-char) + (c::specifier-type '(member #\Tab))))) + ;; Should not collapse into a 97-element MEMBER. + (assert-false (c::member-type-p result)) + (assert-true (c::union-type-p result)))) + +(define-test standard-char.complex-intersection + (:tag :issues) + ;; Intersection with supertype is STANDARD-CHAR. + (assert-equal (values t t) + (c::type= (c::type-intersection (c::specifier-type 'standard-char) + (c::specifier-type 'character)) + (c::specifier-type 'standard-char))) + + (assert-equal (values t t) + (c::type= (c::type-intersection (c::specifier-type 'standard-char) + (c::specifier-type 't)) + (c::specifier-type 'standard-char))) + + ;; Intersection with disjoint type is empty. + (assert-equal (values t t) + (c::type= (c::type-intersection (c::specifier-type 'standard-char) + (c::specifier-type 'integer)) + c::*empty-type*)) + + (assert-equal (values t t) + (c::type= (c::type-intersection (c::specifier-type 'standard-char) + (c::specifier-type 'symbol)) + c::*empty-type*)) + + ;; Intersection with member-type — filtered to standard chars. + (assert-equal (values t t) + (c::type= (c::type-intersection (c::specifier-type 'standard-char) + (c::specifier-type '(member #\a #\Tab #\b))) + (c::specifier-type '(member #\a #\b)))) + + ;; All-non-standard members → empty. + (assert-equal (values t t) + (c::type= (c::type-intersection (c::specifier-type 'standard-char) + (c::specifier-type '(member #\Tab #\Rubout))) + c::*empty-type*)) + + ;; All-standard members → that member-type unchanged. + (assert-equal (values t t) + (c::type= (c::type-intersection (c::specifier-type 'standard-char) + (c::specifier-type '(member #\a))) + (c::specifier-type '(member #\a))))) + + + +(define-test standard-char.negation + (:tag :issues) + ;; NOT STANDARD-CHAR catches non-standard characters. + (assert-true (typep #\Tab '(not standard-char))) + (assert-false (typep #\a '(not standard-char))) + + ;; AND CHARACTER (NOT STANDARD-CHAR) is the non-standard chars. + (assert-true (typep #\Tab '(and character (not standard-char)))) + (assert-false (typep #\a '(and character (not standard-char)))) + (assert-false (typep 5 '(and character (not standard-char)))) + + ;; Permutation invariance with negation, multiple types. + (assert-equal (values t t) + (c::type= (c::specifier-type '(and standard-char (not (member #\a)))) + (c::specifier-type '(and (not (member #\a)) standard-char))))) + +(define-test standard-char.etypecase + (:tag :issues) + ;; Test that etypecase works using ASCII characters which will cover + ;; standard-char values and other characters. + (dotimes (k 128) + (let* ((ch (code-char k)) + (expected (if (standard-char-p ch) + :is-standard :is-other)) + (actual (handler-case + (etypecase ch + (standard-char :is-standard) + (character :is-other)) + (error () + :error)))) + (assert-eql expected actual ch)))) + +(define-test standard-char.intersection-character-both-orderings + (:tag :issues) + ;; Standard-char intersect character = standard-char, regardless of argument order. + (assert-equal (values t t) + (kernel::type= + (kernel::specifier-type 'standard-char) + (kernel::type-intersection (kernel::specifier-type 'standard-char) + (kernel::specifier-type 'character)))) + (assert-equal (values t t) + (kernel::type= + (kernel::specifier-type 'standard-char) + (kernel::type-intersection (kernel::specifier-type 'character) + (kernel::specifier-type 'standard-char))))) + +(define-test standard-char.intersection-disjoint-both-orderings + (:tag :issues) + (assert-equal (values t t) + (kernel::type= + kernel::*empty-type* + (kernel::type-intersection (kernel::specifier-type 'standard-char) + (kernel::specifier-type 'integer)))) + (assert-equal (values t t) + (kernel::type= + kernel::*empty-type* + (kernel::type-intersection (kernel::specifier-type 'integer) + (kernel::specifier-type 'standard-char))))) + +(define-test standard-char.intersection-member-both-orderings + (:tag :issues) + ;; Filter member-type to standard chars only. + (assert-equal (values t t) + (kernel::type= + (kernel::specifier-type '(member #\a #\b)) + (kernel::type-intersection (kernel::specifier-type 'standard-char) + (kernel::specifier-type '(member #\a #\Tab #\b))))) + (assert-equal (values t t) + (kernel::type= + (kernel::specifier-type '(member #\a #\b)) + (kernel::type-intersection (kernel::specifier-type '(member #\a #\Tab #\b)) + (kernel::specifier-type 'standard-char))))) + +(define-test standard-char.union-character-both-orderings + (:tag :issues) + ;; Standard-char union character = character. + (assert-equal (values t t) + (kernel::type= + (kernel::specifier-type 'character) + (kernel::type-union (kernel::specifier-type 'standard-char) + (kernel::specifier-type 'character)))) + (assert-equal (values t t) + (kernel::type= + (kernel::specifier-type 'character) + (kernel::type-union (kernel::specifier-type 'character) + (kernel::specifier-type 'standard-char))))) + +(define-test standard-char.union-member-of-standard-both-orderings + (:tag :issues) + ;; Standard-char absorbs all-standard member-type. + (assert-equal (values t t) + (kernel::type= + (kernel::specifier-type 'standard-char) + (kernel::type-union (kernel::specifier-type 'standard-char) + (kernel::specifier-type '(member #\a #\b))))) + (assert-equal (values t t) + (kernel::type= + (kernel::specifier-type 'standard-char) + (kernel::type-union (kernel::specifier-type '(member #\a #\b)) + (kernel::specifier-type 'standard-char))))) + +(define-test standard-char.union-disjoint-stays-symbolic-both-orderings + (:tag :issues) + ;; (or boolean standard-char) and reverse — both should stay symbolic + ;; rather than collapsing into a giant member-type. + (let ((r1 (kernel::specifier-type '(or boolean standard-char))) + (r2 (kernel::specifier-type '(or standard-char boolean)))) + (assert-true (kernel::union-type-p r1)) + (assert-true (kernel::union-type-p r2)) + (assert-equal (values t t) + (kernel::type= r1 r2)) + ;; Neither should contain a member-type with both characters + ;; and non-characters. + (dolist (m (kernel::union-type-types r1)) + (assert-false (and (kernel::member-type-p m) + (some #'characterp (kernel::member-type-members m)) + (some (complement #'characterp) + (kernel::member-type-members m))))))) + +(defun assert-commutative-union (type-a-spec type-b-spec) + "Assert that union(A, B) and union(B, A) produce type= results." + (assert-equal (values t t) + (kernel::type= + (kernel::type-union (kernel::specifier-type type-a-spec) + (kernel::specifier-type type-b-spec)) + (kernel::type-union (kernel::specifier-type type-b-spec) + (kernel::specifier-type type-a-spec))))) + +(defun assert-commutative-intersection (type-a-spec type-b-spec) + (assert-equal (values t t) + (kernel::type= + (kernel::type-intersection (kernel::specifier-type type-a-spec) + (kernel::specifier-type type-b-spec)) + (kernel::type-intersection (kernel::specifier-type type-b-spec) + (kernel::specifier-type type-a-spec))))) + +(define-test standard-char.commutativity + (:tag :issues) + (assert-commutative-union 'standard-char 'character) + (assert-commutative-union 'standard-char 'integer) + (assert-commutative-union 'standard-char '(member #\a #\b)) + (assert-commutative-union 'standard-char '(member #\Tab)) + (assert-commutative-union 'standard-char 'boolean) + (assert-commutative-union 'standard-char '(not character)) + (assert-commutative-union 'standard-char 't) + (assert-commutative-intersection 'standard-char 'character) + (assert-commutative-intersection 'standard-char 'integer) + (assert-commutative-intersection 'standard-char '(member #\a #\b)) + (assert-commutative-intersection 'standard-char '(member #\Tab)) + (assert-commutative-intersection 'standard-char 'boolean) + (assert-commutative-intersection 'standard-char '(not character)) + (assert-commutative-intersection 'standard-char 't)) View it on GitLab: https://gitlab.common-lisp.net/cmucl/cmucl/-/compare/47971b155ca22c2637a0d3… -- View it on GitLab: https://gitlab.common-lisp.net/cmucl/cmucl/-/compare/47971b155ca22c2637a0d3… You're receiving this email because of your account on gitlab.common-lisp.net. Manage all notifications: https://gitlab.common-lisp.net/-/profile/notifications | Help: https://gitlab.common-lisp.net/help

1 0

[Git][cmucl/cmucl][issue-318-add-concrete-standard-char-type] Don't need bootfile anymore
by Raymond Toy (＠rtoy) 26 Jun '26

26 Jun '26

Raymond Toy pushed to branch issue-318-add-concrete-standard-char-type at cmucl / cmucl Commits: 33c5bacb by Raymond Toy at 2026-06-26T08:54:51-07:00 Don't need bootfile anymore - - - - - 1 changed file: - .gitlab-ci.yml Changes: ===================================== .gitlab-ci.yml ===================================== @@ -7,7 +7,7 @@ variables: download_url: "https://common-lisp.net/project/cmucl/downloads/release/$release" version: "$release-x86" tar_ext: "xz" - bootstrap: "-B boot-21f" + bootstrap: "" workflow: rules: View it on GitLab: https://gitlab.common-lisp.net/cmucl/cmucl/-/commit/33c5bacb79cdc9cb65fdfa6… -- View it on GitLab: https://gitlab.common-lisp.net/cmucl/cmucl/-/commit/33c5bacb79cdc9cb65fdfa6… You're receiving this email because of your account on gitlab.common-lisp.net. Manage all notifications: https://gitlab.common-lisp.net/-/profile/notifications | Help: https://gitlab.common-lisp.net/help

1 0

[Git][cmucl/cmucl][issue-318-add-concrete-standard-char-type] 5 commits: Remove defypte of standard-char from code/type.lisp
by Raymond Toy (＠rtoy) 26 Jun '26

26 Jun '26

Raymond Toy pushed to branch issue-318-add-concrete-standard-char-type at cmucl / cmucl Commits: 3dfca652 by Raymond Toy at 2026-06-26T08:51:40-07:00 Remove defypte of standard-char from code/type.lisp It's not needed now that we have a separate concrete standard-char type. - - - - - 7d0efaa6 by Raymond Toy at 2026-06-26T08:51:40-07:00 Bootfile is not needed to build this change Not sure why we had this originally, but I just did a normal build without specifying a -B option, and everything built fine and the standard-char test ran fine. - - - - - 2302c6bc by Raymond Toy at 2026-06-26T08:51:40-07:00 Address review comment Reverse subtype args to verify that we're computing the correct result. - - - - - fecdd641 by Raymond Toy at 2026-06-26T08:51:40-07:00 Address review comment Replace `subsetp` with `every` to determine subtype relationship between standard-char and member types. There were two places that were updated. - - - - - a4842b2f by Raymond Toy at 2026-06-26T08:51:40-07:00 Address review comment Forgot to replace one `subsetp` with `every`. - - - - - 4 changed files: - bin/build.sh - − src/bootfiles/21f/boot-21f.lisp - src/code/type.lisp - tests/standard-char.lisp Changes: ===================================== bin/build.sh ===================================== @@ -38,7 +38,7 @@ ENABLE2="yes" ENABLE3="yes" ENABLE4="yes" -version=21f +version=21e SRCDIR=src BINDIR=bin TOOLDIR=$BINDIR ===================================== src/bootfiles/21f/boot-21f.lisp deleted ===================================== @@ -1,14 +0,0 @@ -;; For #318. Define new standard-char type. -(in-package "KERNEL") -(ext:without-package-locks -(define-type-class standard-char) -(defstruct (standard-char-type - (:include ctype - (class-info (type-class-or-lose 'standard-char)) - (:enumerable t)) - (:constructor %make-standard-char-type ()) - (:copier nil))) - -(defun make-standard-char-type () - (%make-standard-char-type)) -) ===================================== src/code/type.lisp ===================================== @@ -3347,7 +3347,10 @@ (values t t)) ((member-type-p type2) ;; If TYPE2 is a member-type, check whether it contains all standard-chars - (values (subsetp +standard-chars+ (member-type-members type2)) + (values (let ((members (member-type-members type2))) + (every #'(lambda (c) + (member c members)) + +standard-chars+)) t)) (t (values nil t)))) @@ -3358,7 +3361,9 @@ (cond ((member-type-p type1) ;; If TYPE1 is a member-type, check whether it contains all ;; standard-chars. - (values (subsetp (member-type-members type1) +standard-chars+) + (values (every #'(lambda (c) + (member c +standard-chars+)) + (member-type-members type1)) t)) (t (values nil t)))) @@ -3374,8 +3379,10 @@ ((csubtypep (specifier-type 'character) other) other) ((and (member-type-p other) - (subsetp (member-type-members other) - +standard-chars+)) + ;; Check to see every member of OTHER is a STANDARD-CHAR. + (every #'(lambda (c) + (member c +standard-chars+)) + (member-type-members other))) sc) (t nil)))) @@ -3640,16 +3647,6 @@ "Type of characters that aren't base-char's. None in CMU CL." '(and character (not base-char))) -#+nil -(deftype standard-char () - "Type corresponding to the charaters required by the standard." - '(member #\NEWLINE #\SPACE #\! #\" #\# #\$ #\% #\& #\' #$ #$ #\* #\+ #\, - #\- #\. #\/ #\0 #\1 #\2 #\3 #\4 #\5 #\6 #\7 #\8 #\9 #\: #\; #\< #\= - #\> #\? #\@ #\A #\B #\C #\D #\E #\F #\G #\H #\I #\J #\K #\L #\M - #\N #\O #\P #\Q #\R #\S #\T #\U #\V #\W #\X #\Y #\Z #\[ #\\ #\] - #\^ #\_ #\` #\a #\b #\c #\d #\e #\f #\g #\h #\i #\j #\k #\l #\m - #\n #\o #\p #\q #\r #\s #\t #\u #\v #\w #\x #\y #\z #\{ - #\| #\} #\~)) (deftype keyword () "Type for any keyword symbol." '(and symbol (satisfies keywordp))) ===================================== tests/standard-char.lisp ===================================== @@ -10,7 +10,7 @@ ;; sure we test the intersection and union methods for standard-char. (define-test standard-char.typep - (:tag :issues) + (:tag :issues) (assert-true (typep #\a 'standard-char)) (assert-false (typep #\tab 'standard-char)) (assert-true (typep #\Z 'standard-char)) @@ -24,8 +24,12 @@ (assert-equal (values t t) (subtypep 'standard-char 'character)) + (assert-equal (values nil t) + (subtypep 'character 'standard-char)) (assert-equal (values t t) - (subtypep 'standard-char 'base-char))) + (subtypep 'standard-char 'base-char)) + (assert-equal (values nil t) + (subtypep 'base-char 'standard-char))) (define-test standard-char.etypecase-15 (:tag :issues) View it on GitLab: https://gitlab.common-lisp.net/cmucl/cmucl/-/compare/95fb934cafa3ccd4a65de6… -- View it on GitLab: https://gitlab.common-lisp.net/cmucl/cmucl/-/compare/95fb934cafa3ccd4a65de6… You're receiving this email because of your account on gitlab.common-lisp.net. Manage all notifications: https://gitlab.common-lisp.net/-/profile/notifications | Help: https://gitlab.common-lisp.net/help

1 0

[Git][cmucl/cmucl][rtoy-unicode-collation-ducet] Add case-insensitive comparison functions
by Raymond Toy (＠rtoy) 17 Jun '26

17 Jun '26

Raymond Toy pushed to branch rtoy-unicode-collation-ducet at cmucl / cmucl Commits: d555c487 by Raymond Toy at 2026-06-17T14:03:47-07:00 Add case-insensitive comparison functions Add a STRENGTH parameter to the collation comparison path -- :PRIMARY, :SECONDARY, :TERTIARY (the default), or :QUATERNARY -- bounding the weight levels included in the sort key, and hence the distinctions the comparison makes. A lower strength makes more strings compare equal: :SECONDARY ignores case, :PRIMARY also ignores accents. COLLATION-SORT-KEY and COLLATION-COMPARE take it, and it is threaded through the UNICODE comparison functions. Add the case-insensitive comparison functions, the Unicode analogs of the COMMON-LISP -EQUAL / -LESSP family: STRING-EQUAL, STRING-NOT-EQUAL, STRING-LESSP, STRING-GREATERP, STRING-NOT-GREATERP, and STRING-NOT-LESSP. They default to :SECONDARY strength, so they ignore case (and other tertiary distinctions) while remaining sensitive to base letters and accents. Shadow and export the six in the UNICODE package. Add tests for the public comparison functions: the boolean result of each predicate, the four strength levels, the case-insensitive variants, string-designator and START/END handling, and the variable-weighting option. - - - - - 3 changed files: - src/code/exports.lisp - src/code/unicode-collation.lisp - tests/unicode-collation.lisp Changes: ===================================== src/code/exports.lisp ===================================== @@ -2313,7 +2313,13 @@ "STRING=" "STRING/=" "STRING>" - "STRING>=") + "STRING>=" + "STRING-EQUAL" + "STRING-NOT-EQUAL" + "STRING-LESSP" + "STRING-GREATERP" + "STRING-NOT-GREATERP" + "STRING-NOT-LESSP") (:import-from "LISP" "CODEPOINT" "SURROGATES" @@ -2337,6 +2343,12 @@ "STRING=" "STRING/=" "STRING>" - "STRING>=")) + "STRING>=" + "STRING-EQUAL" + "STRING-NOT-EQUAL" + "STRING-LESSP" + "STRING-GREATERP" + "STRING-NOT-GREATERP" + "STRING-NOT-LESSP")) ===================================== src/code/unicode-collation.lisp ===================================== @@ -467,18 +467,28 @@ is no fourth level. See COLLATION-WEIGHTS." (unless (zerop te) (push te l3)))) (values (nreverse l1) (nreverse l2) (nreverse l3) nil))) -(defun collation-sort-key (d string &optional (variable-weighting :shifted)) +(defun collation-sort-key (d string &optional (variable-weighting :shifted) + (strength :tertiary)) "Compute the UTS #10 sort key for STRING under DUCET D. Returns a -(simple-array (unsigned-byte 16) (*)) holding the level-1 weights, a 0000 -separator, the level-2 weights, 0000, the level-3 weights, and -- for the -:SHIFTED option -- a further 0000 separator and the level-4 weights. -Binary comparison of two such keys yields the collation order of their -strings. VARIABLE-WEIGHTING is as in COLLATION-WEIGHTS." +(simple-array (unsigned-byte 16) (*)) holding the weight levels separated +by 0000: level 1, level 2, level 3, and -- under the :SHIFTED option -- +level 4. Binary comparison of two such keys yields the collation order +of their strings. VARIABLE-WEIGHTING is as in COLLATION-WEIGHTS. + +STRENGTH bounds the levels included in the key, and hence the +distinctions the comparison makes: :PRIMARY (base letters only), +:SECONDARY (also accents), :TERTIARY (also case; the default), or +:QUATERNARY (also the level-4 weights, which exist only under the +:SHIFTED option and otherwise add nothing). A lower strength makes +more strings compare equal; for example :SECONDARY ignores case." (multiple-value-bind (l1 l2 l3 l4) (collation-weights d string variable-weighting) - (let* ((weights (if (eq variable-weighting :non-ignorable) - (append l1 (list 0) l2 (list 0) l3) - (append l1 (list 0) l2 (list 0) l3 (list 0) l4))) + (let* ((weights (ecase strength + (:primary l1) + (:secondary (append l1 (list 0) l2)) + (:tertiary (append l1 (list 0) l2 (list 0) l3)) + (:quaternary + (append l1 (list 0) l2 (list 0) l3 (list 0) l4)))) (key (make-array (length weights) :element-type '(unsigned-byte 16)))) (loop for w in weights @@ -486,12 +496,13 @@ strings. VARIABLE-WEIGHTING is as in COLLATION-WEIGHTS." do (setf (aref key k) w)) key))) -(defun collation-compare (d s1 s2 &optional (variable-weighting :shifted)) +(defun collation-compare (d s1 s2 &optional (variable-weighting :shifted) + (strength :tertiary)) "Compare strings S1 and S2 under DUCET D. Returns -1, 0, or 1 like a three-way comparison: negative if S1 sorts before S2, zero if equal, 1 -if after. VARIABLE-WEIGHTING is as in COLLATION-WEIGHTS." - (let ((k1 (collation-sort-key d s1 variable-weighting)) - (k2 (collation-sort-key d s2 variable-weighting))) +if after. VARIABLE-WEIGHTING and STRENGTH are as in COLLATION-SORT-KEY." + (let ((k1 (collation-sort-key d s1 variable-weighting strength)) + (k2 (collation-sort-key d s2 variable-weighting strength))) (let ((n (min (length k1) (length k2)))) (dotimes (i n) (let ((a (aref k1 i)) (b (aref k2 i))) @@ -615,7 +626,7 @@ collation section of unidata.bin on first use." (setf *collation-table* (lisp::unidata-ducet)))) (defun %collation-compare (string1 string2 start1 end1 start2 end2 - variable-weighting) + variable-weighting strength) "Three-way collation comparison of the designated substrings of STRING1 and STRING2: returns a negative integer, zero, or a positive integer as the first sorts before, equal to, or after the second." @@ -625,47 +636,87 @@ integer as the first sorts before, equal to, or after the second." (setf s1 (subseq s1 start1 end1))) (when (or (/= start2 0) end2) (setf s2 (subseq s2 start2 end2))) - (lisp::collation-compare (collation-table) s1 s2 variable-weighting))) + (lisp::collation-compare (collation-table) s1 s2 + variable-weighting strength))) -(defmacro %def-collation-predicate (name test docstring) +(defmacro %def-collation-predicate (name test default-strength docstring) "Define a collation comparison predicate NAME whose result is (TEST c) -where c is the three-way comparison of the two string arguments." +where c is the three-way comparison of the two string arguments. +DEFAULT-STRENGTH is the default value of the STRENGTH keyword." `(defun ,name (string1 string2 &key (start1 0) end1 (start2 0) end2 - (variable-weighting :shifted)) + (variable-weighting :shifted) + (strength ,default-strength)) ,docstring (let ((c (%collation-compare string1 string2 start1 end1 start2 end2 - variable-weighting))) + variable-weighting strength))) (,test c)))) -(%def-collation-predicate string= zerop +(%def-collation-predicate string= zerop :tertiary "Return true if STRING1 and STRING2 collate as equal under the Unicode Collation Algorithm. Note that this is collation equality, not code-point identity: canonically equivalent strings, and strings that differ only in collation-ignorable ways, compare equal. START1, END1, -START2 and END2 bound the substrings compared; VARIABLE-WEIGHTING is -:SHIFTED (the default) or :NON-IGNORABLE.") +START2 and END2 bound the substrings compared. VARIABLE-WEIGHTING is +:SHIFTED (the default) or :NON-IGNORABLE. STRENGTH is :PRIMARY, +:SECONDARY, :TERTIARY (the default), or :QUATERNARY, as in +LISP::COLLATION-SORT-KEY; a lower strength makes more strings compare +equal -- :SECONDARY, for instance, ignores case.") -(%def-collation-predicate string/= (lambda (c) (not (zerop c))) +(%def-collation-predicate string/= (lambda (c) (not (zerop c))) :tertiary "Return true if STRING1 and STRING2 do not collate as equal. See UNICODE:STRING= for the meaning of the keyword arguments.") -(%def-collation-predicate string< minusp +(%def-collation-predicate string< minusp :tertiary "Return true if STRING1 collates before STRING2 under the Unicode Collation Algorithm. See UNICODE:STRING= for the meaning of the keyword arguments.") -(%def-collation-predicate string> plusp +(%def-collation-predicate string> plusp :tertiary "Return true if STRING1 collates after STRING2 under the Unicode Collation Algorithm. See UNICODE:STRING= for the meaning of the keyword arguments.") -(%def-collation-predicate string<= (lambda (c) (not (plusp c))) +(%def-collation-predicate string<= (lambda (c) (not (plusp c))) :tertiary "Return true if STRING1 collates before or equal to STRING2 under the Unicode Collation Algorithm. See UNICODE:STRING= for the meaning of the keyword arguments.") -(%def-collation-predicate string>= (lambda (c) (not (minusp c))) +(%def-collation-predicate string>= (lambda (c) (not (minusp c))) :tertiary "Return true if STRING1 collates after or equal to STRING2 under the Unicode Collation Algorithm. See UNICODE:STRING= for the meaning of the keyword arguments.") + +;;; The case-insensitive comparison functions, the Unicode analogs of +;;; the COMMON-LISP -EQUAL/-LESSP/... family. They default to :SECONDARY +;;; strength, which drops the tertiary level where case is encoded, so +;;; they ignore case (and other tertiary distinctions, such as width) +;;; while remaining sensitive to base letters and accents. This is the +;;; closest collation analog of case-folded comparison; the Unicode +;;; Collation Algorithm has no operation that folds case alone. + +(%def-collation-predicate string-equal zerop :secondary + "Return true if STRING1 and STRING2 collate as equal ignoring case, +under the Unicode Collation Algorithm. Like UNICODE:STRING= but +defaulting to :SECONDARY strength; see it for the keyword arguments.") + +(%def-collation-predicate string-not-equal (lambda (c) (not (zerop c))) :secondary + "Return true if STRING1 and STRING2 do not collate as equal ignoring +case. See UNICODE:STRING-EQUAL for the keyword arguments.") + +(%def-collation-predicate string-lessp minusp :secondary + "Return true if STRING1 collates before STRING2 ignoring case, under +the Unicode Collation Algorithm. See UNICODE:STRING-EQUAL for the +keyword arguments.") + +(%def-collation-predicate string-greaterp plusp :secondary + "Return true if STRING1 collates after STRING2 ignoring case. See +UNICODE:STRING-EQUAL for the keyword arguments.") + +(%def-collation-predicate string-not-greaterp (lambda (c) (not (plusp c))) :secondary + "Return true if STRING1 collates before or equal to STRING2 ignoring +case. See UNICODE:STRING-EQUAL for the keyword arguments.") + +(%def-collation-predicate string-not-lessp (lambda (c) (not (minusp c))) :secondary + "Return true if STRING1 collates after or equal to STRING2 ignoring +case. See UNICODE:STRING-EQUAL for the keyword arguments.") ===================================== tests/unicode-collation.lisp ===================================== @@ -145,6 +145,83 @@ must match the expected key in the line's comment." (run-collation-conformance (ducet) *collation-non-ignorable-test* :non-ignorable)) +;;; Tests for the public UNICODE comparison functions. The conformance +;;; tests above already validate the collation weights themselves; these +;;; check the thin wrappers -- that each predicate maps the comparison to +;;; the right boolean, and that the STRENGTH and VARIABLE-WEIGHTING +;;; options and the case-insensitive variants behave as documented. + +(define-test unicode.string-predicates + "The six case-sensitive comparison predicates map the collation order +of a known pair to the correct boolean, including the reflexive case." + (:tag :unicode) + (assert-true (unicode:string< "a" "b")) + (assert-false (unicode:string< "b" "a")) + (assert-false (unicode:string< "a" "a")) + (assert-true (unicode:string> "b" "a")) + (assert-false (unicode:string> "a" "b")) + (assert-true (unicode:string<= "a" "b")) + (assert-true (unicode:string<= "a" "a")) + (assert-false (unicode:string<= "b" "a")) + (assert-true (unicode:string>= "b" "a")) + (assert-true (unicode:string>= "a" "a")) + (assert-false (unicode:string>= "a" "b")) + (assert-true (unicode:string= "a" "a")) + (assert-false (unicode:string= "a" "b")) + (assert-true (unicode:string/= "a" "b")) + (assert-false (unicode:string/= "a" "a"))) + +(define-test unicode.string-designators-and-bounds + "Comparison accepts string designators (characters, symbols) like the +COMMON-LISP functions, and honors START/END substring bounds." + (:tag :unicode) + (assert-true (unicode:string= #\a #\a)) + (assert-true (unicode:string= 'abc "ABC")) + (assert-true (unicode:string= "xab" "yab" :start1 1 :start2 1)) + (assert-false (unicode:string= "xab" "yzb" :start1 1 :start2 1))) + +(define-test unicode.string-strength + "STRENGTH bounds the distinctions made: case is a tertiary difference, +an accent a secondary one." + (:tag :unicode) + ;; Default (tertiary): case matters. Lowercase sorts before uppercase. + (assert-true (unicode:string< "abc" "ABC")) + (assert-false (unicode:string< "ABC" "abc")) + (assert-false (unicode:string= "ABC" "abc")) + ;; Secondary: case ignored, but accents still distinguish. + (assert-true (unicode:string= "ABC" "abc" :strength :secondary)) + (assert-false (unicode:string= "café" "cafe" :strength :secondary)) + ;; Primary: accents ignored too. + (assert-true (unicode:string= "café" "cafe" :strength :primary)) + ;; Quaternary: under the default :SHIFTED option the level-4 weights + ;; make variable elements (here the hyphen) break an otherwise equal + ;; comparison, where :TERTIARY ignores them. + (assert-true (unicode:string= "co-op" "coop")) + (assert-false (unicode:string= "co-op" "coop" :strength :quaternary))) + +(define-test unicode.string-case-insensitive + "The case-insensitive family ignores case (secondary strength) while +remaining sensitive to accents." + (:tag :unicode) + (assert-true (unicode:string-equal "ABC" "abc")) + (assert-false (unicode:string-not-equal "ABC" "abc")) + (assert-false (unicode:string-lessp "ABC" "abc")) + (assert-false (unicode:string-greaterp "ABC" "abc")) + (assert-true (unicode:string-not-greaterp "ABC" "abc")) + (assert-true (unicode:string-not-lessp "ABC" "abc")) + ;; An accent is a secondary difference, so it still distinguishes. + (assert-false (unicode:string-equal "café" "cafe"))) + +(define-test unicode.string-variable-weighting + "Under the default :SHIFTED option variable elements (punctuation) are +ignored; under :NON-IGNORABLE they are significant." + (:tag :unicode) + ;; Shifted (default): the hyphen is ignored, so the strings collate equal. + (assert-true (unicode:string= "co-op" "coop")) + ;; Non-ignorable: the hyphen counts, so they are not equal. + (assert-false (unicode:string= "co-op" "coop" + :variable-weighting :non-ignorable))) + ;; A DEFINE-TEST body is stored as source and run interpreted, and the ;; test runner (tests/run-tests.lisp) loads this file as source, so its ;; functions would otherwise run interpreted. The per-line parsing and View it on GitLab: https://gitlab.common-lisp.net/cmucl/cmucl/-/commit/d555c487cbc28742ca2967d… -- View it on GitLab: https://gitlab.common-lisp.net/cmucl/cmucl/-/commit/d555c487cbc28742ca2967d… You're receiving this email because of your account on gitlab.common-lisp.net. Manage all notifications: https://gitlab.common-lisp.net/-/profile/notifications | Help: https://gitlab.common-lisp.net/help

1 0

[Git][cmucl/cmucl][rtoy-unicode-collation-ducet] Serialize the DUCET into unidata.bin
by Raymond Toy (＠rtoy) 17 Jun '26

17 Jun '26

Raymond Toy pushed to branch rtoy-unicode-collation-ducet at cmucl / cmucl Commits: 6a425efa by Raymond Toy at 2026-06-17T13:30:01-07:00 Serialize the DUCET into unidata.bin The collation table was loaded at runtime by parsing allkeys.txt. Store it in unidata.bin instead, as a new section (index 19), and bump the file format version to 2. The section holds the collation elements in three parallel arrays (primary u16, secondary u16, tertiary u8 with the variable flag in bit 7), a single-codepoint index as an ntrie32 mapping a codepoint to a packed (offset << 6) | count into those arrays, a contraction table (four 32-bit words per entry), and the @implicitweights ranges. build-unidata.lisp reads allkeys.txt and builds the section; unidata.lisp reads it back (loader for section 19, plus the COLLATION struct and slot). UNIDATA-DUCET in unicode-collation.lisp builds the runtime DUCET from the loaded section -- structurally identical to one from LOAD-DUCET, so the sort-key construction code is unchanged -- and COLLATION-TABLE now uses it. LOAD-DUCET is kept for regenerating data and cross-checking. The collation conformance tests build the DUCET from unidata.bin and still pass all 437930 assertions. unidata.bin updated with the new collation data. - - - - - 4 changed files: - src/code/unicode-collation.lisp - src/code/unidata.lisp - src/i18n/unidata.bin - tests/unicode-collation.lisp Changes: ===================================== src/code/unicode-collation.lisp ===================================== @@ -518,6 +518,85 @@ if after. VARIABLE-WEIGHTING is as in COLLATION-WEIGHTS." ;;; no meaningful character index of the first difference to return. ;;; ------------------------------------------------------------------- + +;;; ------------------------------------------------------------------- +;;; Building the runtime DUCET from the collation section of +;;; unidata.bin. The resulting table is structurally identical to one +;;; built by LOAD-DUCET from allkeys.txt -- the same MAP / SINGLE / +;;; STARTERS hashes and implicit ranges -- so the sort-key construction +;;; code uses it unchanged. This replaces the runtime use of LOAD-DUCET +;;; (which is kept for regenerating data and for cross-checking). +;;; ------------------------------------------------------------------- + +(defun unidata-ducet () + "Build a DUCET from the collation section of unidata.bin, loading the +section first if necessary." + (unless (unidata-collation *unicode-data*) + (load-collation)) + (let* ((c (unidata-collation *unicode-data*)) + (primv (collation-primv c)) + (secv (collation-secv c)) + (terv (collation-terv c)) + (contractions (collation-contractions c)) + (ranges (collation-ranges c)) + (d (make-ducet :version (format nil "~D.~D.~D" + +unicode-major-version+ + +unicode-minor-version+ + +unicode-update-version+))) + (maxvar 0) + (maxkey 1)) + (flet ((ces-at (packed) + ;; Slice the parallel arrays into a simple-vector of + ;; collation-elements for the packed (offset << 6) | count. + (let* ((off (ash packed -6)) + (n (logand packed #x3f)) + (v (make-array n))) + (dotimes (i n) + (let* ((j (+ off i)) + (te (aref terv j)) + (var (logbitp 7 te)) + (p (aref primv j))) + (when (and var (> p maxvar)) + (setf maxvar p)) + (setf (aref v i) + (make-ce p (aref secv j) (logand te #x7f) var)))) + v))) + ;; Single-codepoint entries: walk the codepoint space and pull the + ;; non-zero values out of the index trie. (Many keys are astral, + ;; so the walk must cover the full range, not just the BMP.) + (dotimes (cp #x110000) + (let ((packed (qref32 c cp))) + (unless (zerop packed) + (let ((ces (ces-at packed))) + (setf (gethash cp (ducet-single d)) ces) + (setf (gethash (make-array 1 :initial-element cp) (ducet-map d)) + ces))))) + ;; Contractions: four 32-bit words each. + (loop for i from 0 below (length contractions) by 4 do + (let* ((cp1 (aref contractions i)) + (cp2 (aref contractions (+ i 1))) + (cp3 (aref contractions (+ i 2))) + (packed (aref contractions (+ i 3))) + (key (if (= cp3 #xFFFFFFFF) + (make-array 2 :initial-contents (list cp1 cp2)) + (make-array 3 :initial-contents (list cp1 cp2 cp3))))) + (setf (gethash key (ducet-map d)) (ces-at packed)) + (setf (gethash cp1 (ducet-starters d)) t) + (setf maxkey (max maxkey (length key))))) + ;; Implicit-weight ranges: four 32-bit words each (start, end, + ;; base, base-origin). + (let ((rl nil)) + (loop for i from 0 below (length ranges) by 4 do + (let ((r (make-implicit-range (aref ranges i) + (aref ranges (+ i 1)) + (aref ranges (+ i 2))))) + (setf (implicit-range-base-origin r) (aref ranges (+ i 3))) + (push r rl))) + (setf (ducet-implicit-ranges d) (nreverse rl))) + (setf (ducet-max-key-length d) maxkey + (ducet-max-variable-primary d) maxvar) + d))) + (in-package "UNICODE") (defvar *collation-table-path* "ext-formats:allkeys.txt" @@ -530,10 +609,10 @@ loaded. Loaded lazily from *COLLATION-TABLE-PATH* the first time a collation function needs it. Set to NIL to force a reload.") (defun collation-table () - "Return the default Unicode collation table, loading it from -*COLLATION-TABLE-PATH* on first use." + "Return the default Unicode collation table, building it from the +collation section of unidata.bin on first use." (or *collation-table* - (setf *collation-table* (lisp::load-ducet *collation-table-path*)))) + (setf *collation-table* (lisp::unidata-ducet)))) (defun %collation-compare (string1 string2 start1 end1 start2 end2 variable-weighting) ===================================== src/code/unidata.lisp ===================================== @@ -56,6 +56,7 @@ case-fold-simple case-fold-full word-break + collation ) (defvar *unicode-data* (make-unidata)) @@ -65,7 +66,7 @@ (defconstant +unicode-magic-number+ #x2A554344) ;; The format version for the unidata.bin file. -(defconstant +unicode-format-version+ 1) +(defconstant +unicode-format-version+ 2) ;; The expected Unicode version. This needs to be synced with ;; build-unidata.lisp. @@ -292,6 +293,27 @@ (defstruct (case-fold-full (:include decomp))) +(defstruct (collation (:include ntrie32)) + ;; Parallel collation-element arrays shared by the single-codepoint + ;; index (whose LVEC packs (offset << 6) | count into these) and the + ;; contraction table. TERV holds the tertiary weight in its low 7 + ;; bits and the variable flag in bit 7. + (primv (ext:required-argument) :read-only t + :type (simple-array (unsigned-byte 16) (*))) + (secv (ext:required-argument) :read-only t + :type (simple-array (unsigned-byte 16) (*))) + (terv (ext:required-argument) :read-only t + :type (simple-array (unsigned-byte 8) (*))) + ;; Contraction table: four 32-bit words per entry -- cp1, cp2, cp3 + ;; (or #xFFFFFFFF when the key has only two codepoints), and the + ;; packed (offset << 6) | count into the collation-element arrays. + (contractions (ext:required-argument) :read-only t + :type (simple-array (unsigned-byte 32) (*))) + ;; @implicitweights ranges: four 32-bit words per entry -- start, + ;; end, base, and base-origin (smallest start sharing the base). + (ranges (ext:required-argument) :read-only t + :type (simple-array (unsigned-byte 32) (*)))) + (defstruct (bidi (:include ntrie16)) (tabl (ext:required-argument) :read-only t :type (simple-array (unsigned-byte 16) (*)))) @@ -718,6 +740,29 @@ (read-ntrie 4 stm) (setf (unidata-word-break *unicode-data*) (make-ntrie4 :split split :hvec hvec :mvec mvec :lvec lvec)))) +(defloader load-collation (stm 19) + (multiple-value-bind (split hvec mvec lvec) + (read-ntrie 32 stm) + (let* ((nce (read32 stm)) + (primv (make-array nce :element-type '(unsigned-byte 16))) + (secv (make-array nce :element-type '(unsigned-byte 16))) + (terv (make-array nce :element-type '(unsigned-byte 8)))) + (read-vector primv stm :endian-swap :network-order) + (read-vector secv stm :endian-swap :network-order) + (read-vector terv stm :endian-swap :network-order) + (let* ((ncontr (read32 stm)) + (contractions (make-array (* 4 ncontr) + :element-type '(unsigned-byte 32)))) + (read-vector contractions stm :endian-swap :network-order) + (let* ((nrange (read-byte stm)) + (ranges (make-array (* 4 nrange) + :element-type '(unsigned-byte 32)))) + (read-vector ranges stm :endian-swap :network-order) + (setf (unidata-collation *unicode-data*) + (make-collation :split split :hvec hvec :mvec mvec :lvec lvec + :primv primv :secv secv :terv terv + :contractions contractions + :ranges ranges))))))) ;;; Accessor functions. @@ -1657,4 +1702,5 @@ unidata.bin." (unidata-case-fold-simple *unicode-data*) (unidata-case-fold-full *unicode-data*) (unidata-word-break *unicode-data*) + (unidata-collation *unicode-data*) t)) ===================================== src/i18n/unidata.bin ===================================== Binary files a/src/i18n/unidata.bin and b/src/i18n/unidata.bin differ ===================================== tests/unicode-collation.lisp ===================================== @@ -19,9 +19,10 @@ "The Default Unicode Collation Element Table, loaded on first use.") (defun ducet () - "Return the DUCET, loading it from *COLLATION-ALLKEYS* the first time." + "Return the DUCET, built from the collation section of unidata.bin on +first use." (or *ducet* - (setf *ducet* (lisp::load-ducet *collation-allkeys*)))) + (setf *ducet* (lisp::unidata-ducet)))) (defun collation-hex-list (string) "Parse all space-separated hexadecimal numbers in STRING into a list of View it on GitLab: https://gitlab.common-lisp.net/cmucl/cmucl/-/commit/6a425efa239f6911e1b1fc0… -- View it on GitLab: https://gitlab.common-lisp.net/cmucl/cmucl/-/commit/6a425efa239f6911e1b1fc0… You're receiving this email because of your account on gitlab.common-lisp.net. Manage all notifications: https://gitlab.common-lisp.net/-/profile/notifications | Help: https://gitlab.common-lisp.net/help

1 0

[Git][cmucl/cmucl][rtoy-unicode-collation-ducet] Update build-unidata to build the collation table
by Raymond Toy (＠rtoy) 17 Jun '26

17 Jun '26

Raymond Toy pushed to branch rtoy-unicode-collation-ducet at cmucl / cmucl Commits: b021b2a8 by Raymond Toy at 2026-06-17T13:04:12-07:00 Update build-unidata to build the collation table This reads allkeys.txt and builds the collation table as part of unidata.bin. - - - - - 1 changed file: - src/tools/build-unidata.lisp Changes: ===================================== src/tools/build-unidata.lisp ===================================== @@ -44,6 +44,7 @@ case-fold-full case-fold-simple word-break + collation ) (defvar *unicode-data* (make-unidata)) @@ -146,6 +147,27 @@ (tabl (ext:required-argument) :read-only t :type (simple-array (unsigned-byte 16) (*)))) +(defstruct (collation (:include ntrie32)) + ;; Parallel collation-element arrays shared by the single-codepoint + ;; index (whose LVEC packs (offset << 6) | count into these) and the + ;; contraction table. TERV holds the tertiary weight in its low 7 + ;; bits and the variable flag in bit 7. + (primv (ext:required-argument) :read-only t + :type (simple-array (unsigned-byte 16) (*))) + (secv (ext:required-argument) :read-only t + :type (simple-array (unsigned-byte 16) (*))) + (terv (ext:required-argument) :read-only t + :type (simple-array (unsigned-byte 8) (*))) + ;; Contraction table: four 32-bit words per entry -- cp1, cp2, cp3 + ;; (or #xFFFFFFFF when the key has only two codepoints), and the + ;; packed (offset << 6) | count into the collation-element arrays. + (contractions (ext:required-argument) :read-only t + :type (simple-array (unsigned-byte 32) (*))) + ;; @implicitweights ranges: four 32-bit words per entry -- start, + ;; end, base, and base-origin (smallest start sharing the base). + (ranges (ext:required-argument) :read-only t + :type (simple-array (unsigned-byte 32) (*)))) + (defstruct (bidi (:include ntrie16)) (tabl (ext:required-argument) :read-only t :type (simple-array (unsigned-byte 16) (*)))) @@ -535,11 +557,11 @@ :element-type '(unsigned-byte 8)) ;; The length of the index array is the number of sections to be ;; written. See below for each section. - (let ((index (make-array 19 :fill-pointer 0))) + (let ((index (make-array 20 :fill-pointer 0))) ;; File header (write32 +unicode-magic-number+ stm) ; identification "magic" - ;; File format version (1: dictionary nextv de-packed, keypv added) - (write-byte 1 stm) + ;; File format version (2: collation/DUCET section added) + (write-byte 2 stm) ;; Unicode version (write-byte +unicode-major-version+ stm) (write-byte +unicode-minor-version+ stm) @@ -638,6 +660,18 @@ (let ((data (unidata-word-break *unicode-data*))) (update-index (file-position stm) index) (write-ntrie4 data stm)) + ;; 19. Collation (DUCET) + (let ((data (unidata-collation *unicode-data*))) + (update-index (file-position stm) index) + (write-ntrie32 data stm) + (write32 (length (collation-primv data)) stm) + (write-vector (collation-primv data) stm :endian-swap :network-order) + (write-vector (collation-secv data) stm :endian-swap :network-order) + (write-vector (collation-terv data) stm :endian-swap :network-order) + (write32 (truncate (length (collation-contractions data)) 4) stm) + (write-vector (collation-contractions data) stm :endian-swap :network-order) + (write-byte (truncate (length (collation-ranges data)) 4) stm) + (write-vector (collation-ranges data) stm :endian-swap :network-order)) ;; All components saved. Patch up index table now. (file-position stm 8) (dotimes (i (length index)) @@ -1016,6 +1050,140 @@ ;; ucd-directory should be the directory where UnicodeData.txt is ;; located. +(defun parse-collation-key (string) + "Parse the space-separated hexadecimal codepoints in STRING (the part +of an allkeys.txt line before the semicolon) into a list of integers." + (let ((result nil) (i 0) (n (length string))) + (loop + (loop while (and (< i n) (not (digit-char-p (char string i) 16))) + do (incf i)) + (when (>= i n) (return)) + (let ((j i)) + (loop while (and (< j n) (digit-char-p (char string j) 16)) + do (incf j)) + (push (parse-integer string :start i :end j :radix 16) result) + (setf i j))) + (nreverse result))) + +(defun parse-collation-elements (string) + "Parse the collation elements [.pppp.ssss.tttt] (or [*pppp...] for a +variable element) from STRING into a list of (primary secondary tertiary +variablep) lists." + (let ((result nil) (i 0)) + (loop + (let ((open (position #\[ string :start i))) + (unless open (return)) + (let* ((var (char= (char string (1+ open)) #\*)) + (close (position #\] string :start open)) + (body (subseq string (+ open 2) close)) + (d1 (position #\. body)) + (d2 (position #\. body :start (1+ d1)))) + (push (list (parse-integer body :end d1 :radix 16) + (parse-integer body :start (1+ d1) :end d2 :radix 16) + (parse-integer body :start (1+ d2) :radix 16) + var) + result) + (setf i (1+ close))))) + (nreverse result))) + +(defun build-collation (ucd range ucd-directory) + "Read allkeys.txt (the DUCET) from UCD-DIRECTORY and build the +collation section: the parallel collation-element arrays, the +single-codepoint index (an ntrie32 mapping a codepoint to a packed +(offset << 6) | count into those arrays), the contraction table, and the +@implicitweights ranges." + (let ((path (make-pathname :name "allkeys" :type "txt" :defaults ucd-directory)) + (primv (make-array 65536 :element-type '(unsigned-byte 16) + :adjustable t :fill-pointer 0)) + (secv (make-array 65536 :element-type '(unsigned-byte 16) + :adjustable t :fill-pointer 0)) + (terv (make-array 65536 :element-type '(unsigned-byte 8) + :adjustable t :fill-pointer 0)) + (single (make-hash-table)) + (contractions nil) + (raw-ranges nil)) + (flet ((emit (ces) + ;; Append CES to the parallel arrays; return the packed + ;; (offset << 6) | count referring to them. + (let ((offset (fill-pointer primv)) + (count (length ces))) + (dolist (ce ces) + (destructuring-bind (p s te var) ce + (vector-push-extend p primv) + (vector-push-extend s secv) + (vector-push-extend (logior te (if var #x80 0)) terv))) + (logior (ash offset 6) count)))) + (with-open-file (s path :direction :input :external-format :utf-8) + (loop for line = (read-line s nil) while line do + (cond + ((zerop (length line))) + ((char= (char line 0) #\#)) + ((eql 0 (search "@implicitweights" line)) + (let* ((semi (position #\; line)) + (dd (search ".." line)) + (start (parse-integer line :start (length "@implicitweights") + :end dd :radix 16 :junk-allowed t)) + (end (parse-integer line :start (+ dd 2) :end semi + :radix 16 :junk-allowed t)) + (base (parse-integer line :start (1+ semi) + :radix 16 :junk-allowed t))) + (push (list start end base) raw-ranges))) + ((char= (char line 0) #\@)) + (t + (let ((semi (position #\; line))) + (when semi + (let* ((hash (position #\# line)) + (key (parse-collation-key (subseq line 0 semi))) + (ces (parse-collation-elements + (subseq line (1+ semi) hash))) + (packed (emit ces))) + (if (= (length key) 1) + (setf (gethash (first key) single) packed) + (push (list (first key) (second key) (third key) packed) + contractions)))))))))) + ;; base-origin: smallest start among ranges sharing a base. + (let ((origin (make-hash-table))) + (dolist (r raw-ranges) + (destructuring-bind (start end base) r + (declare (ignore end)) + (when (or (null (gethash base origin)) + (< start (gethash base origin))) + (setf (gethash base origin) start)))) + (let* ((rl (nreverse raw-ranges)) + (rvec (make-array (* 4 (length rl)) :element-type '(unsigned-byte 32))) + (cl (nreverse contractions)) + (cvec (make-array (* 4 (length cl)) :element-type '(unsigned-byte 32))) + (i 0)) + (dolist (r rl) + (destructuring-bind (start end base) r + (setf (aref rvec i) start + (aref rvec (+ i 1)) end + (aref rvec (+ i 2)) base + (aref rvec (+ i 3)) (gethash base origin)) + (incf i 4))) + (setf i 0) + (dolist (c cl) + (destructuring-bind (cp1 cp2 cp3 packed) c + (setf (aref cvec i) cp1 + (aref cvec (+ i 1)) cp2 + (aref cvec (+ i 2)) (or cp3 #xFFFFFFFF) + (aref cvec (+ i 3)) packed) + (incf i 4))) + (multiple-value-bind (hvec mvec lvec) + (pack ucd range + (lambda (ent) (gethash (ucdent-code ent) single 0)) + 0 32 #x54) + (make-collation + :split #x54 :hvec hvec :mvec mvec :lvec lvec + :primv (make-array (length primv) :element-type '(unsigned-byte 16) + :initial-contents primv) + :secv (make-array (length secv) :element-type '(unsigned-byte 16) + :initial-contents secv) + :terv (make-array (length terv) :element-type '(unsigned-byte 8) + :initial-contents terv) + :contractions cvec + :ranges rvec)))))) + (defun build-unidata (&optional (ucd-directory "target:i18n/")) (format t "~&Reading data from ~S~%" (probe-file ucd-directory)) (force-output) @@ -1216,4 +1384,9 @@ 0 4 split) (setf (unidata-word-break *unicode-data*) (make-ntrie4 :split split :hvec hvec :mvec mvec :lvec lvec)))) + + (format t "~&Building collation table~%") + (force-output) + (setf (unidata-collation *unicode-data*) + (build-collation ucd range ucd-directory)) nil)) View it on GitLab: https://gitlab.common-lisp.net/cmucl/cmucl/-/commit/b021b2a8b48c87ee5203741… -- View it on GitLab: https://gitlab.common-lisp.net/cmucl/cmucl/-/commit/b021b2a8b48c87ee5203741… You're receiving this email because of your account on gitlab.common-lisp.net. Manage all notifications: https://gitlab.common-lisp.net/-/profile/notifications | Help: https://gitlab.common-lisp.net/help

1 0

[Git][cmucl/cmucl][rtoy-unicode-collation-ducet] Add UNICODE string comparison functions (case-sensitive)
by Raymond Toy (＠rtoy) 17 Jun '26

17 Jun '26

Raymond Toy pushed to branch rtoy-unicode-collation-ducet at cmucl / cmucl Commits: f05b7f0e by Raymond Toy at 2026-06-17T07:11:52-07:00 Add UNICODE string comparison functions (case-sensitive) Add Unicode-aware equivalents of the COMMON-LISP string comparison functions in the UNICODE package: STRING=, STRING/=, STRING<, STRING>, STRING<=, and STRING>=. These compare by the Unicode Collation Algorithm rather than code-unit order, using the collation code in unicode-collation.lisp. They take the usual START1/END1/START2/END2 keywords plus a VARIABLE-WEIGHTING keyword (:SHIFTED by default), and return a generalized boolean rather than a mismatch index, since the comparison is on sort keys derived from the whole normalized string. The Default Unicode Collation Element Table is loaded lazily from allkeys.txt (ext-formats: search list) on first use, pending its serialization into unidata.bin. Shadow and export the six comparison symbols in the UNICODE package. - - - - - 2 changed files: - src/code/exports.lisp - src/code/unicode-collation.lisp Changes: ===================================== src/code/exports.lisp ===================================== @@ -2307,7 +2307,13 @@ (:use "COMMON-LISP") (:shadow "STRING-CAPITALIZE" "STRING-DOWNCASE" - "STRING-UPCASE") + "STRING-UPCASE" + "STRING<" + "STRING<=" + "STRING=" + "STRING/=" + "STRING>" + "STRING>=") (:import-from "LISP" "CODEPOINT" "SURROGATES" @@ -2325,6 +2331,12 @@ (:export "STRING-CAPITALIZE" "STRING-DOWNCASE" "STRING-UPCASE" - "STRING-NEXT-WORD-BREAK")) + "STRING-NEXT-WORD-BREAK" + "STRING<" + "STRING<=" + "STRING=" + "STRING/=" + "STRING>" + "STRING>=")) ===================================== src/code/unicode-collation.lisp ===================================== @@ -500,3 +500,93 @@ if after. VARIABLE-WEIGHTING is as in COLLATION-WEIGHTS." (cond ((< (length k1) (length k2)) -1) ((> (length k1) (length k2)) 1) (t 0))))) + + +;;; ------------------------------------------------------------------- +;;; Public collation API (UNICODE package): the Unicode-aware +;;; equivalents of the COMMON-LISP string comparison functions. +;;; +;;; These compare strings by the Unicode Collation Algorithm rather than +;;; by code-point order, so the result reflects linguistic sort order +;;; (after NFD normalization, with contractions, expansions and the +;;; chosen variable-weighting option). The Default Unicode Collation +;;; Element Table is loaded lazily on first use. +;;; +;;; Unlike the COMMON-LISP functions, these return a generalized boolean +;;; (T or NIL) rather than a mismatch index: the comparison is performed +;;; on sort keys derived from the whole normalized string, so there is +;;; no meaningful character index of the first difference to return. +;;; ------------------------------------------------------------------- + +(in-package "UNICODE") + +(defvar *collation-table-path* "ext-formats:allkeys.txt" + "Pathname of the DUCET data file (allkeys.txt) from which the default +collation table is loaded on first use.") + +(defvar *collation-table* nil + "The default Unicode collation table, or NIL if it has not yet been +loaded. Loaded lazily from *COLLATION-TABLE-PATH* the first time a +collation function needs it. Set to NIL to force a reload.") + +(defun collation-table () + "Return the default Unicode collation table, loading it from +*COLLATION-TABLE-PATH* on first use." + (or *collation-table* + (setf *collation-table* (lisp::load-ducet *collation-table-path*)))) + +(defun %collation-compare (string1 string2 start1 end1 start2 end2 + variable-weighting) + "Three-way collation comparison of the designated substrings of +STRING1 and STRING2: returns a negative integer, zero, or a positive +integer as the first sorts before, equal to, or after the second." + (let ((s1 (string string1)) + (s2 (string string2))) + (when (or (/= start1 0) end1) + (setf s1 (subseq s1 start1 end1))) + (when (or (/= start2 0) end2) + (setf s2 (subseq s2 start2 end2))) + (lisp::collation-compare (collation-table) s1 s2 variable-weighting))) + +(defmacro %def-collation-predicate (name test docstring) + "Define a collation comparison predicate NAME whose result is (TEST c) +where c is the three-way comparison of the two string arguments." + `(defun ,name (string1 string2 &key (start1 0) end1 (start2 0) end2 + (variable-weighting :shifted)) + ,docstring + (let ((c (%collation-compare string1 string2 + start1 end1 start2 end2 + variable-weighting))) + (,test c)))) + +(%def-collation-predicate string= zerop + "Return true if STRING1 and STRING2 collate as equal under the Unicode +Collation Algorithm. Note that this is collation equality, not +code-point identity: canonically equivalent strings, and strings that +differ only in collation-ignorable ways, compare equal. START1, END1, +START2 and END2 bound the substrings compared; VARIABLE-WEIGHTING is +:SHIFTED (the default) or :NON-IGNORABLE.") + +(%def-collation-predicate string/= (lambda (c) (not (zerop c))) + "Return true if STRING1 and STRING2 do not collate as equal. See +UNICODE:STRING= for the meaning of the keyword arguments.") + +(%def-collation-predicate string< minusp + "Return true if STRING1 collates before STRING2 under the Unicode +Collation Algorithm. See UNICODE:STRING= for the meaning of the keyword +arguments.") + +(%def-collation-predicate string> plusp + "Return true if STRING1 collates after STRING2 under the Unicode +Collation Algorithm. See UNICODE:STRING= for the meaning of the keyword +arguments.") + +(%def-collation-predicate string<= (lambda (c) (not (plusp c))) + "Return true if STRING1 collates before or equal to STRING2 under the +Unicode Collation Algorithm. See UNICODE:STRING= for the meaning of the +keyword arguments.") + +(%def-collation-predicate string>= (lambda (c) (not (minusp c))) + "Return true if STRING1 collates after or equal to STRING2 under the +Unicode Collation Algorithm. See UNICODE:STRING= for the meaning of the +keyword arguments.") View it on GitLab: https://gitlab.common-lisp.net/cmucl/cmucl/-/commit/f05b7f0e709186eefd49b5d… -- View it on GitLab: https://gitlab.common-lisp.net/cmucl/cmucl/-/commit/f05b7f0e709186eefd49b5d… You're receiving this email because of your account on gitlab.common-lisp.net. Manage all notifications: https://gitlab.common-lisp.net/-/profile/notifications | Help: https://gitlab.common-lisp.net/help

1 0

[Git][cmucl/cmucl][rtoy-unicode-collation-ducet] Speed up collation conformance tests
by Raymond Toy (＠rtoy) 17 Jun '26

17 Jun '26

Raymond Toy pushed to branch rtoy-unicode-collation-ducet at cmucl / cmucl Commits: ca615266 by Raymond Toy at 2026-06-16T19:34:27-07:00 Speed up collation conformance tests The tests ran in ~110s, dominated by COLLATION-HEX-LIST: profiling showed it consing 3.5 GB across 2.2M calls, almost all of it inside PARSE-INTEGER. Accumulate the (16-bit, fixnum) hex values directly instead. The suite now runs in ~14s with identical results. Also compile all the helper functions because they do a lot of processing of the test file and the each test file has over 200K tests. - - - - - 1 changed file: - tests/unicode-collation.lisp Changes: ===================================== tests/unicode-collation.lisp ===================================== @@ -26,16 +26,26 @@ (defun collation-hex-list (string) "Parse all space-separated hexadecimal numbers in STRING into a list of integers, in order. Non-hex runs are skipped." - (let ((result nil) (i 0) (n (length string))) + (let ((result nil) + (i 0) + (n (length string))) (loop - (loop while (and (< i n) (not (digit-char-p (char string i) 16))) + ;; Skip any non-hexadecimal characters. + (loop while (and (< i n) + (null (digit-char-p (char string i) 16))) do (incf i)) (when (>= i n) (return)) - (let ((j i)) - (loop while (and (< j n) (digit-char-p (char string j) 16)) - do (incf j)) - (push (parse-integer string :start i :end j :radix 16) result) - (setf i j))) + ;; Accumulate one hexadecimal number. PARSE-INTEGER is avoided + ;; here because it conses, and this runs several times per line + ;; over hundreds of thousands of conformance lines; the values are + ;; 16-bit and fit in a fixnum. + (let ((val 0) + (d nil)) + (loop while (and (< i n) + (setf d (digit-char-p (char string i) 16))) + do (setf val (+ (* val 16) d)) + (incf i)) + (push val result))) (nreverse result))) (defun collation-split-on-bar (string) @@ -133,3 +143,18 @@ must match the expected key in the line's comment." (:tag :unicode) (run-collation-conformance (ducet) *collation-non-ignorable-test* :non-ignorable)) + +;; A DEFINE-TEST body is stored as source and run interpreted, and the +;; test runner (tests/run-tests.lisp) loads this file as source, so its +;; functions would otherwise run interpreted. The per-line parsing and +;; string building run on every one of several hundred thousand +;; conformance lines, so interpreted they make the suite about ten times +;; slower. Compile the hot functions on load. +(eval-when (:load-toplevel :execute) + (dolist (name '(collation-hex-list + collation-split-on-bar + collation-parse-expected-key + collation-parse-test-line + collation-test-string + run-collation-conformance)) + (compile name))) View it on GitLab: https://gitlab.common-lisp.net/cmucl/cmucl/-/commit/ca615266d431bc6f92022a1… -- View it on GitLab: https://gitlab.common-lisp.net/cmucl/cmucl/-/commit/ca615266d431bc6f92022a1… You're receiving this email because of your account on gitlab.common-lisp.net. Manage all notifications: https://gitlab.common-lisp.net/-/profile/notifications | Help: https://gitlab.common-lisp.net/help

1 0

[Git][cmucl/cmucl][rtoy-unicode-collation-ducet] Run collation conformance tests compiled, not interpreted
by Raymond Toy (＠rtoy) 17 Jun '26

17 Jun '26

Raymond Toy pushed to branch rtoy-unicode-collation-ducet at cmucl / cmucl Commits: 49a6cb37 by Raymond Toy at 2026-06-16T18:04:38-07:00 Run collation conformance tests compiled, not interpreted A DEFINE-TEST body is stored as source and run through the interpreter, so the per-line loop over a quarter-million conformance lines ran interpreted -- ~115s for the suite. Move the loop into a compiled helper, RUN-COLLATION-CONFORMANCE, called from the test bodies; the suite now runs in ~10s with identical coverage (437930 assertions). Previously it took 115s. - - - - - 1 changed file: - tests/unicode-collation.lisp Changes: ===================================== tests/unicode-collation.lisp ===================================== @@ -89,26 +89,40 @@ the BMP as UTF-16 surrogate pairs." (vector-push-extend (code-char cp) out))) (coerce out 'simple-string))) +(defun run-collation-conformance (ducet file weighting) + "Check every line of the UCA conformance FILE: the four sort-key levels +produced by LISP::COLLATION-WEIGHTS under WEIGHTING must match the +expected key parsed from the line's comment. Each line is a separate +LISP-UNIT assertion. + +This is a plain function rather than inline in the DEFINE-TESTs below +because a DEFINE-TEST body is stored as source and run interpreted; the +per-line work over a quarter-million lines must run compiled, so it +lives here and the tests just call it." + (with-open-file (s file :direction :input :external-format :utf-8) + (loop for line = (read-line s nil nil) + while line + do + (multiple-value-bind (cps e1 e2 e3 e4) + (collation-parse-test-line line) + (when cps + (multiple-value-bind (g1 g2 g3 g4) + (lisp::collation-weights ducet (collation-test-string cps) + weighting) + ;; For :NON-IGNORABLE the comment has no fourth level + ;; and COLLATION-WEIGHTS returns NIL for L4, so the + ;; same four-level comparison serves both options. + (assert-equalp (list e1 e2 e3 e4) + (list g1 g2 g3 g4) + cps))))))) + (define-test unicode.collation-shifted "Test UTS #10 collation sort keys against the UCA SHIFTED conformance data. For each line, the four sort-key levels produced by LISP::COLLATION-WEIGHTS must match the expected key in the line's comment." (:tag :unicode) - (let ((ducet (ducet))) - (with-open-file (s *collation-shifted-test* :direction :input - :external-format :utf-8) - (loop for line = (read-line s nil nil) - while line - do - (multiple-value-bind (cps e1 e2 e3 e4) - (collation-parse-test-line line) - (when cps - (multiple-value-bind (g1 g2 g3 g4) - (lisp::collation-weights ducet (collation-test-string cps)) - (assert-equalp (list e1 e2 e3 e4) - (list g1 g2 g3 g4) - cps)))))))) + (run-collation-conformance (ducet) *collation-shifted-test* :shifted)) (define-test unicode.collation-non-ignorable "Test UTS #10 collation sort keys against the UCA NON_IGNORABLE @@ -117,18 +131,5 @@ their weights and there is no fourth level, so for each line the three weight levels produced by LISP::COLLATION-WEIGHTS with :NON-IGNORABLE must match the expected key in the line's comment." (:tag :unicode) - (let ((ducet (ducet))) - (with-open-file (s *collation-non-ignorable-test* :direction :input - :external-format :utf-8) - (loop for line = (read-line s nil nil) - while line - do - (multiple-value-bind (cps e1 e2 e3) - (collation-parse-test-line line) - (when cps - (multiple-value-bind (g1 g2 g3) - (lisp::collation-weights ducet (collation-test-string cps) - :non-ignorable) - (assert-equalp (list e1 e2 e3) - (list g1 g2 g3) - cps)))))))) + (run-collation-conformance (ducet) *collation-non-ignorable-test* + :non-ignorable)) View it on GitLab: https://gitlab.common-lisp.net/cmucl/cmucl/-/commit/49a6cb375555be2f71453f6… -- View it on GitLab: https://gitlab.common-lisp.net/cmucl/cmucl/-/commit/49a6cb375555be2f71453f6… You're receiving this email because of your account on gitlab.common-lisp.net. Manage all notifications: https://gitlab.common-lisp.net/-/profile/notifications | Help: https://gitlab.common-lisp.net/help

1 0

[Git][cmucl/cmucl][rtoy-unicode-collation-ducet] Add Non-ignorable variable weighting and speed up element lookup
by Raymond Toy (＠rtoy) 16 Jun '26

16 Jun '26

Raymond Toy pushed to branch rtoy-unicode-collation-ducet at cmucl / cmucl Commits: 2bfd281e by Raymond Toy at 2026-06-16T13:59:12-07:00 Add Non-ignorable variable weighting and speed up element lookup Add the Non-ignorable variable-weighting option to the collation code alongside the existing Shifted option. COLLATION-WEIGHTS, COLLATION-SORT-KEY, and COLLATION-COMPARE take an optional VARIABLE-WEIGHTING argument (default :SHIFTED); :NON-IGNORABLE treats variable elements like ordinary elements and produces no fourth level. Both options pass the full UCA conformance suite (CollationTest_SHIFTED and CollationTest_NON_IGNORABLE). In LOAD-DUCET, derive two fast-path tables from the main map: SINGLE, mapping a single codepoint directly to its collation elements, and STARTERS, the set of codepoints that begin a multi-codepoint key. DUCET-ELEMENT-ARRAY uses these so that a codepoint which is not a contraction starter skips the longest-match scan and the discontiguous look-ahead, taking one fixnum-keyed lookup instead. Add tests/unicode-collation.lisp tests for both weighting options. - - - - - 2 changed files: - src/code/unicode-collation.lisp - tests/unicode-collation.lisp Changes: ===================================== src/code/unicode-collation.lisp ===================================== @@ -55,6 +55,15 @@ ;; of collation-elements. Single characters and contractions share ;; this table; the key length distinguishes them. (map (make-hash-table :test 'equalp) :type hash-table) + ;; Fast paths derived from MAP at load time. SINGLE maps a single + ;; codepoint (a fixnum) directly to its collation elements, avoiding an + ;; EQUALP probe with a one-element key for the common case. STARTERS + ;; holds every codepoint that begins some multi-codepoint key; a + ;; codepoint not in STARTERS can neither begin a contraction nor be the + ;; base of a discontiguous match, so the contraction machinery is + ;; skipped for it entirely. + (single (make-hash-table :test 'eql) :type hash-table) + (starters (make-hash-table :test 'eql) :type hash-table) ;; Implicit-weight ranges, in file order. (implicit-ranges nil :type list) ;; Longest key (in codepoints) present in MAP; bounds the contraction @@ -180,6 +189,13 @@ returning an IMPLICIT-RANGE, or NIL if it does not parse." do (setf (ducet-max-variable-primary d) (ce-primary ce)))))))))))) (setf (ducet-implicit-ranges d) (nreverse (ducet-implicit-ranges d))) + ;; Build the fast-path tables: SINGLE for one-codepoint keys, and + ;; STARTERS for the first codepoint of every multi-codepoint key. + (maphash (lambda (key ces) + (if (= (length key) 1) + (setf (gethash (aref key 0) (ducet-single d)) ces) + (setf (gethash (aref key 0) (ducet-starters d)) t))) + (ducet-map d)) ;; Compute, for each base, the smallest range start sharing it, and ;; record it as the BBBB offset origin on every range with that base. (let ((origin (make-hash-table))) @@ -298,74 +314,106 @@ the blocking threshold for later non-starters." (consumed (make-array (max 1 (length cps)) :initial-element nil)) (i 0) (n (length cps)) - (maxlen (ducet-max-key-length d))) + (maxlen (ducet-max-key-length d)) + (map (ducet-map d)) + (single (ducet-single d)) + (starters (ducet-starters d))) (loop while (< i n) do (cond ((aref consumed i) (incf i)) ; folded in by a discontiguous match (t - (let ((best-key nil) (best-len 0) (best-ces nil)) - ;; S2.1: longest contiguous match starting at I, over positions - ;; not already consumed by an earlier discontiguous fold. A - ;; multi-codepoint key may only match an unbroken run of - ;; still-present codepoints, so a span containing a consumed - ;; position (a "hole" left by a fold) is rejected. - (loop for len from (min maxlen (- n i)) downto 1 do - (when (loop for j from i below (+ i len) - never (aref consumed j)) - (let* ((key (subseq cps i (+ i len))) - (ces (gethash key (ducet-map d)))) - (when ces - (setf best-key key best-len len best-ces ces) - (return))))) + (let* ((cp (aref cps i)) + (starter (gethash cp starters)) + (best-key nil) (best-len 0) (best-ces nil)) + ;; S2.1: longest contiguous match starting at I. Only a + ;; codepoint that begins some multi-codepoint key (a STARTER) + ;; can match a contraction, so the multi-length scan is done + ;; only for those; every other codepoint takes the SINGLE + ;; fast path. The scan rejects any span containing a position + ;; already consumed by an earlier discontiguous fold. + (when starter + (loop for len from (min maxlen (- n i)) downto 2 do + (when (loop for j from i below (+ i len) + never (aref consumed j)) + (let* ((key (subseq cps i (+ i len))) + (ces (gethash key map))) + (when ces + (setf best-key key best-len len best-ces ces) + (return)))))) + (unless best-ces + (let ((ces (gethash cp single))) + (when ces + (setf best-ces ces best-len 1) + ;; BEST-KEY is needed only to extend by discontiguous + ;; folds, which apply only to starters. + (when starter + (setf best-key (make-array 1 :initial-element cp)))))) (cond - (best-key + (best-ces ;; S2.1.1-S2.1.3: extend by unblocked following non-starters. + ;; A codepoint that is not a STARTER can begin no multi-key, + ;; so no fold is possible and the scan is skipped entirely. ;; MAXCCC is the highest combining class among still-present ;; codepoints passed since the end of the contiguous match. - (let ((maxccc 0)) - (loop for k from (+ i best-len) below n do - (unless (aref consumed k) - (let ((ccc (lisp::unicode-combining-class (aref cps k)))) - (cond - ((zerop ccc) - (return)) ; starter blocks all further - ((> ccc maxccc) ; unblocked candidate - (let* ((cand (collation-extend-key - best-key (aref cps k))) - (ces (gethash cand (ducet-map d)))) - (if ces - ;; Fold C in and remove it; MAXCCC unchanged - ;; since C is no longer present. - (setf best-key cand - best-ces ces - (aref consumed k) t) - ;; No match: C stays and becomes a blocker. - (setf maxccc ccc)))) - (t - ;; Blocked (ccc <= maxccc): C stays; MAXCCC already - ;; covers it. - nil)))))) + (when (and starter best-key) + (let ((maxccc 0)) + (loop for k from (+ i best-len) below n do + (unless (aref consumed k) + (let ((ccc (lisp::unicode-combining-class (aref cps k)))) + (cond + ((zerop ccc) + (return)) ; starter blocks all further + ((> ccc maxccc) ; unblocked candidate + (let* ((cand (collation-extend-key + best-key (aref cps k))) + (ces (gethash cand map))) + (if ces + ;; Fold C in and remove it; MAXCCC + ;; unchanged since C is no longer present. + (setf best-key cand + best-ces ces + (aref consumed k) t) + ;; No match: C stays and becomes a blocker. + (setf maxccc ccc)))) + (t + ;; Blocked (ccc <= maxccc): C stays; MAXCCC + ;; already covers it. + nil))))))) (loop for ce across best-ces do (push ce result)) (incf i best-len)) (t ;; No entry: derive implicit weights (Section 10.1). - (dolist (ce (derive-implicit-elements d (aref cps i))) + (dolist (ce (derive-implicit-elements d cp)) (push ce result)) (incf i))))))) (nreverse result))) -(defun collation-weights (d string) +(defun collation-weights (d string &optional (variable-weighting :shifted)) "Return four values -- the level-1, level-2, level-3 and level-4 weight -lists for STRING under DUCET D, using the Shifted variable-weighting -option (UTS #10 Section 4). Under Shifted: a variable element con- -tributes nothing at levels 1-3 and its primary at level 4; a non- -variable element takes a level-4 weight of FFFF when it carries level-2 -or level-3 content, but none when it is a primary-only continuation -element ([.XXXX.0000.0000], e.g. the second half of an implicit weight -pair or an expansion tail); a completely ignorable element -([.0000.0000.0000]) contributes nothing anywhere; and a primary- -ignorable element that follows a variable element is shifted away -entirely." +lists for STRING under DUCET D. VARIABLE-WEIGHTING selects the UTS #10 +Section 4 option: + + :SHIFTED (the default) -- a variable element contributes nothing at + levels 1-3 and its primary at level 4; a non-variable element takes + a level-4 weight of FFFF when it carries level-2 or level-3 content, + but none when it is a primary-only continuation element + ([.XXXX.0000.0000], e.g. the second half of an implicit weight pair + or an expansion tail); a completely ignorable element + ([.0000.0000.0000]) contributes nothing anywhere; and a primary- + ignorable element that follows a variable element is shifted away + entirely. + + :NON-IGNORABLE -- variable elements are not treated specially: every + element contributes its non-zero weights at levels 1-3 just like any + other element, and there is no fourth level (the returned L4 is + always NIL)." + (ecase variable-weighting + (:shifted (collation-weights-shifted d string)) + (:non-ignorable (collation-weights-non-ignorable d string)))) + +(defun collation-weights-shifted (d string) + "Compute the four weight levels for STRING under the Shifted option. +See COLLATION-WEIGHTS." (let ((ces (ducet-element-array d (string-to-collation-codepoints string))) (l1 nil) (l2 nil) (l3 nil) (l4 nil) (after-variable nil)) @@ -402,14 +450,35 @@ entirely." (setf after-variable nil))))) (values (nreverse l1) (nreverse l2) (nreverse l3) (nreverse l4)))) -(defun collation-sort-key (d string) - "Compute the UTS #10 sort key for STRING under DUCET D (Shifted -option). Returns a (simple-array (unsigned-byte 16) (*)) holding the -level-1 weights, a 0000 separator, the level-2 weights, 0000, the -level-3 weights, 0000, and the level-4 weights. Binary comparison of -two such keys yields the collation order of their strings." - (multiple-value-bind (l1 l2 l3 l4) (collation-weights d string) - (let* ((weights (append l1 (list 0) l2 (list 0) l3 (list 0) l4)) +(defun collation-weights-non-ignorable (d string) + "Compute the weight levels for STRING under the Non-ignorable option. +Variable elements are treated exactly like ordinary elements, and there +is no fourth level. See COLLATION-WEIGHTS." + (let ((ces (ducet-element-array d (string-to-collation-codepoints string))) + (l1 nil) (l2 nil) (l3 nil)) + (dolist (ce ces) + (let ((p (ce-primary ce)) + (s (ce-secondary ce)) + (te (ce-tertiary ce))) + ;; Every element -- variable or not -- contributes its non-zero + ;; weights at each level; zero weights are passed over. + (unless (zerop p) (push p l1)) + (unless (zerop s) (push s l2)) + (unless (zerop te) (push te l3)))) + (values (nreverse l1) (nreverse l2) (nreverse l3) nil))) + +(defun collation-sort-key (d string &optional (variable-weighting :shifted)) + "Compute the UTS #10 sort key for STRING under DUCET D. Returns a +(simple-array (unsigned-byte 16) (*)) holding the level-1 weights, a 0000 +separator, the level-2 weights, 0000, the level-3 weights, and -- for the +:SHIFTED option -- a further 0000 separator and the level-4 weights. +Binary comparison of two such keys yields the collation order of their +strings. VARIABLE-WEIGHTING is as in COLLATION-WEIGHTS." + (multiple-value-bind (l1 l2 l3 l4) + (collation-weights d string variable-weighting) + (let* ((weights (if (eq variable-weighting :non-ignorable) + (append l1 (list 0) l2 (list 0) l3) + (append l1 (list 0) l2 (list 0) l3 (list 0) l4))) (key (make-array (length weights) :element-type '(unsigned-byte 16)))) (loop for w in weights @@ -417,12 +486,12 @@ two such keys yields the collation order of their strings." do (setf (aref key k) w)) key))) -(defun collation-compare (d s1 s2) +(defun collation-compare (d s1 s2 &optional (variable-weighting :shifted)) "Compare strings S1 and S2 under DUCET D. Returns -1, 0, or 1 like a three-way comparison: negative if S1 sorts before S2, zero if equal, 1 -if after." - (let ((k1 (collation-sort-key d s1)) - (k2 (collation-sort-key d s2))) +if after. VARIABLE-WEIGHTING is as in COLLATION-WEIGHTS." + (let ((k1 (collation-sort-key d s1 variable-weighting)) + (k2 (collation-sort-key d s2 variable-weighting))) (let ((n (min (length k1) (length k2)))) (dotimes (i n) (let ((a (aref k1 i)) (b (aref k2 i))) ===================================== tests/unicode-collation.lisp ===================================== @@ -12,6 +12,8 @@ (defvar *collation-allkeys* "target:i18n/allkeys.txt") (defvar *collation-shifted-test* "target:i18n/CollationTest/CollationTest_SHIFTED.txt") +(defvar *collation-non-ignorable-test* + "target:i18n/CollationTest/CollationTest_NON_IGNORABLE.txt") (defvar *ducet* nil "The Default Unicode Collation Element Table, loaded on first use.") @@ -107,3 +109,26 @@ comment." (assert-equalp (list e1 e2 e3 e4) (list g1 g2 g3 g4) cps)))))))) + +(define-test unicode.collation-non-ignorable + "Test UTS #10 collation sort keys against the UCA NON_IGNORABLE +conformance data. Under the Non-ignorable option variable elements keep +their weights and there is no fourth level, so for each line the three +weight levels produced by LISP::COLLATION-WEIGHTS with :NON-IGNORABLE +must match the expected key in the line's comment." + (:tag :unicode) + (let ((ducet (ducet))) + (with-open-file (s *collation-non-ignorable-test* :direction :input + :external-format :utf-8) + (loop for line = (read-line s nil nil) + while line + do + (multiple-value-bind (cps e1 e2 e3) + (collation-parse-test-line line) + (when cps + (multiple-value-bind (g1 g2 g3) + (lisp::collation-weights ducet (collation-test-string cps) + :non-ignorable) + (assert-equalp (list e1 e2 e3) + (list g1 g2 g3) + cps)))))))) View it on GitLab: https://gitlab.common-lisp.net/cmucl/cmucl/-/commit/2bfd281ed61a2ab06cb89a0… -- View it on GitLab: https://gitlab.common-lisp.net/cmucl/cmucl/-/commit/2bfd281ed61a2ab06cb89a0… You're receiving this email because of your account on gitlab.common-lisp.net. Manage all notifications: https://gitlab.common-lisp.net/-/profile/notifications | Help: https://gitlab.common-lisp.net/help

1 0

[Git][cmucl/cmucl] Pushed new branch rtoy-unicode-collation-ducet
by Raymond Toy (＠rtoy) 16 Jun '26

16 Jun '26

Raymond Toy pushed new branch rtoy-unicode-collation-ducet at cmucl / cmucl -- View it on GitLab: https://gitlab.common-lisp.net/cmucl/cmucl/-/tree/rtoy-unicode-collation-du… You're receiving this email because of your account on gitlab.common-lisp.net. Manage all notifications: https://gitlab.common-lisp.net/-/profile/notifications | Help: https://gitlab.common-lisp.net/help

1 0

[Git][cmucl/cmucl][issue-317-string-compare-by-code-unit] Use issue-317 branch for the new expected failures
by Raymond Toy (＠rtoy) 12 Jun '26

12 Jun '26

Raymond Toy pushed to branch issue-317-string-compare-by-code-unit at cmucl / cmucl Commits: d051b51a by Raymond Toy at 2026-06-12T07:33:08-07:00 Use issue-317 branch for the new expected failures Issue #317 fixes RANDOM-STRING-COMPARISON-TESTS, so use the correct ansi-test branch that removes this test from the expected failures list. - - - - - 1 changed file: - bin/run-ansi-tests.sh Changes: ===================================== bin/run-ansi-tests.sh ===================================== @@ -36,7 +36,7 @@ shift $((OPTIND - 1)) # Use branch cmucl-expected-failures in general since this branch # generally has the list of expected failures. This is the branch to # use on cmucl master in general. -BRANCH=cmucl-expected-failures +BRANCH=cmucl-expected-failures-issue-317 set -x if [ -d ../ansi-test ]; then View it on GitLab: https://gitlab.common-lisp.net/cmucl/cmucl/-/commit/d051b51a054ffd8eb58db94… -- View it on GitLab: https://gitlab.common-lisp.net/cmucl/cmucl/-/commit/d051b51a054ffd8eb58db94… You're receiving this email because of your account on gitlab.common-lisp.net. Manage all notifications: https://gitlab.common-lisp.net/-/profile/notifications | Help: https://gitlab.common-lisp.net/help

1 0

[Git][cmucl/cmucl][master] 2 commits: Fix #512: CI retries select stages
by Raymond Toy (＠rtoy) 03 Jun '26

03 Jun '26

Raymond Toy pushed to branch master at cmucl / cmucl Commits: fad6dd9e by Raymond Toy at 2026-06-02T17:03:21-07:00 Fix #512: CI retries select stages - - - - - 47971b15 by Raymond Toy at 2026-06-02T17:03:21-07:00 Merge branch 'issue-512-ci-retries-select-stages' into 'master' Fix #512: CI retries select stages Closes #512 See merge request cmucl/cmucl!386 - - - - - 1 changed file: - .gitlab-ci.yml Changes: ===================================== .gitlab-ci.yml ===================================== @@ -29,6 +29,7 @@ workflow: # for building. .install: stage: install + retry: 1 artifacts: paths: - snapshot/ @@ -89,6 +90,7 @@ workflow: when: always paths: - ansi-test.out + retry: 1 script: - bin/run-ansi-tests.sh -l dist/bin/lisp after_script: @@ -99,6 +101,7 @@ workflow: # Default configuration for running unit tests. .unit-test: stage: test + retry: 1 artifacts: paths: - test.log View it on GitLab: https://gitlab.common-lisp.net/cmucl/cmucl/-/compare/1b0a461cae6966caacc626… -- View it on GitLab: https://gitlab.common-lisp.net/cmucl/cmucl/-/compare/1b0a461cae6966caacc626… You're receiving this email because of your account on gitlab.common-lisp.net. Manage all notifications: https://gitlab.common-lisp.net/-/profile/notifications | Help: https://gitlab.common-lisp.net/help

1 0

[Git][cmucl/cmucl] Pushed new branch issue-512-ci-retries-select-stages
by Raymond Toy (＠rtoy) 02 Jun '26

02 Jun '26

Raymond Toy pushed new branch issue-512-ci-retries-select-stages at cmucl / cmucl -- View it on GitLab: https://gitlab.common-lisp.net/cmucl/cmucl/-/tree/issue-512-ci-retries-sele… You're receiving this email because of your account on gitlab.common-lisp.net. Manage all notifications: https://gitlab.common-lisp.net/-/profile/notifications | Help: https://gitlab.common-lisp.net/help

1 0

[Git][cmucl/cmucl][issue-511-update-unicode-tests] Cross build needs bootstrap file too
by Raymond Toy (＠rtoy) 02 Jun '26

02 Jun '26

Raymond Toy pushed to branch issue-511-update-unicode-tests at cmucl / cmucl Commits: 4ddcc0c7 by Raymond Toy at 2026-06-02T11:26:06-07:00 Cross build needs bootstrap file too - - - - - 1 changed file: - .gitlab-ci.yml Changes: ===================================== .gitlab-ci.yml ===================================== @@ -7,7 +7,8 @@ variables: download_url: "https://common-lisp.net/project/cmucl/downloads/release/$release" version: "$release-x86" tar_ext: "xz" - bootstrap: "-B boot-2026-06-1" + bootstrap_name: "boot-2026-06-1" + bootstrap: "-B $bootstrap_name" workflow: rules: @@ -229,7 +230,7 @@ linux:cross-build: script: - bin/create-target.sh xtarget $CONFIG - bin/create-target.sh xcross $CONFIG - - bin/cross-build-world.sh -crl xtarget xcross src/tools/cross-scripts/cross-x86-x86.lisp snapshot/bin/lisp + - bin/cross-build-world.sh -B src/bootfiles/21f/$bootstrap_name -crl xtarget xcross src/tools/cross-scripts/cross-x86-x86.lisp snapshot/bin/lisp - bin/build.sh -b xlinux $bootstrap -R -C $CONFIG -o "xtarget/lisp/lisp -lib xtarget/lisp" - bin/make-dist.sh -I xdist xlinux-4 View it on GitLab: https://gitlab.common-lisp.net/cmucl/cmucl/-/commit/4ddcc0c731480b1ba01cfd2… -- View it on GitLab: https://gitlab.common-lisp.net/cmucl/cmucl/-/commit/4ddcc0c731480b1ba01cfd2… You're receiving this email because of your account on gitlab.common-lisp.net. Manage all notifications: https://gitlab.common-lisp.net/-/profile/notifications | Help: https://gitlab.common-lisp.net/help

1 0

[Git][cmucl/cmucl][issue-511-update-unicode-tests] Fix typo in bootfile name
by Raymond Toy (＠rtoy) 02 Jun '26

02 Jun '26

Raymond Toy pushed to branch issue-511-update-unicode-tests at cmucl / cmucl Commits: 418f8372 by Raymond Toy at 2026-06-02T09:49:49-07:00 Fix typo in bootfile name - - - - - 1 changed file: - .gitlab-ci.yml Changes: ===================================== .gitlab-ci.yml ===================================== @@ -7,7 +7,7 @@ variables: download_url: "https://common-lisp.net/project/cmucl/downloads/release/$release" version: "$release-x86" tar_ext: "xz" - bootstrap: "-B boot-2026-06-01" + bootstrap: "-B boot-2026-06-1" workflow: rules: View it on GitLab: https://gitlab.common-lisp.net/cmucl/cmucl/-/commit/418f83725d9eb0798a99fdc… -- View it on GitLab: https://gitlab.common-lisp.net/cmucl/cmucl/-/commit/418f83725d9eb0798a99fdc… You're receiving this email because of your account on gitlab.common-lisp.net. Manage all notifications: https://gitlab.common-lisp.net/-/profile/notifications | Help: https://gitlab.common-lisp.net/help

1 0

[Git][cmucl/cmucl][issue-511-update-unicode-tests] Fix up comments preceding string-next-word-break
by Raymond Toy (＠rtoy) 02 Jun '26

02 Jun '26

Raymond Toy pushed to branch issue-511-update-unicode-tests at cmucl / cmucl Commits: 7fcf5249 by Raymond Toy at 2026-06-02T09:46:57-07:00 Fix up comments preceding string-next-word-break These no longer apply since the algorithm is completely different now. - - - - - 1 changed file: - src/code/unicode.lisp Changes: ===================================== src/code/unicode.lisp ===================================== @@ -200,39 +200,12 @@ (string-downcase-full string :start start :end end)))) -;;; -;;; This is a Lisp translation of the Scheme code from William -;;; D. Clinger that implements the word-breaking algorithm. This is -;;; used with permission. -;;; -;;; This version is modified from the original at -;;; http://www.ccs.neu.edu/home/will/R6RS/ to conform to CMUCL's -;;; implementation of the word break properties. -;;; -;;; -;;; Copyright statement and original comments: -;;; -;;;-------------------------------------------------------------------------------- - -;; Copyright 2006 William D Clinger. -;; -;; Permission to copy this software, in whole or in part, to use this -;; software for any lawful purpose, and to redistribute this software -;; is granted subject to the restriction that all copies made of this -;; software must include this copyright and permission notice in full. -;; -;; I also request that you send me a copy of any improvements that you -;; make to this software so that they may be incorporated within it to -;; the benefit of the Scheme community. - ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;; ;; Word-breaking as defined by Unicode Standard Annex #29. ;; ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; -;; Implementation notes. -;; ;; The string-foldcase, string-downcase, and string-titlecase ;; procedures rely on the notion of a word, which is defined ;; by Unicode Standard Annex 29. @@ -247,64 +220,11 @@ ;; ;; Hence the performance of the word-breaking algorithm should ;; not matter too much for this reference implementation. -;; Word-breaking is more generally useful, however, so I tried -;; to make this implementation reasonably efficient. -;; -;; Word boundaries are defined by 14 different rules in -;; Unicode Standard Annex #29, and by GraphemeBreakProperty.txt -;; and WordBreakProperty.txt. See also WordBreakTest.html. -;; -;; My original implementation of those specifications failed -;; 6 of the 494 tests in auxiliary/WordBreakTest.txt, but it -;; appeared to me that those tests were inconsistent with the -;; word-breaking rules in UAX #29. John Cowan forwarded my -;; bug report to the Unicode experts, and Mark Davis responded -;; on 29 May 2007: -;; -;; Thanks for following up on this. I think you have found a problem in the -;; formulation of word break, not the test. The intention was to break after a -;; Sep character, as is done in Sentence break. So my previous suggestion was -;; incorrect; instead, what we need is a new rule: -;; -;; *Break after paragraph separators.* -;; WB3a. Sep -;; I'll make a propose to the UTC for this. -;; -;; Here is Will's translation of those rules (including WB3a) -;; into a finite state machine that searches forward within a -;; string, looking for the next position at which a word break -;; is allowed. The current state consists of an index i into -;; the string and a summary of the left context whose rightmost -;; character is at index i. The left context is usually -;; determined by the character at index i, but there are three -;; complications: -;; -;; Extend and Format characters are ignored unless they -;; follow a separator or the beginning of the text. -;; ALetter followed by MidLetter is treated specially. -;; Numeric followed by MidNum is treated specially. -;; -;; In the implementation below, the left context ending at i -;; is encoded by the following symbols: -;; -;; CR -;; Sep (excluding CR) -;; ALetter -;; MidLetter -;; ALetterMidLetter (ALetter followed by MidLetter) -;; Numeric -;; MidNum -;; NumericMidNum (Numeric followed by MidNum) -;; Katakana -;; ExtendNumLet -;; other (none of the above) +;; Word-breaking is more generally useful, however. ;; -;; Given a string s and an exact integer i (which need not be -;; a valid index into s), returns the index of the next character -;; that is not part of the word containing the character at i, -;; or the length of s if the word containing the character at i -;; extends through the end of s. If i is negative or a valid -;; index into s, then the returned value will be greater than i. +;; Word boundaries are defined by different rules in Unicode Standard +;; Annex #29, and by GraphemeBreakProperty.txt and +;; WordBreakProperty.txt. See also WordBreakTest.html. ;; ;;;-------------------------------------------------------------------------------- View it on GitLab: https://gitlab.common-lisp.net/cmucl/cmucl/-/commit/7fcf5249d079811b967b959… -- View it on GitLab: https://gitlab.common-lisp.net/cmucl/cmucl/-/commit/7fcf5249d079811b967b959… You're receiving this email because of your account on gitlab.common-lisp.net. Manage all notifications: https://gitlab.common-lisp.net/-/profile/notifications | Help: https://gitlab.common-lisp.net/help

1 0

[Git][cmucl/cmucl][issue-511-update-unicode-tests] Implement Unicode 17.0 word break algorithm (UAX #29)
by Raymond Toy (＠rtoy) 02 Jun '26

02 Jun '26

Raymond Toy pushed to branch issue-511-update-unicode-tests at cmucl / cmucl Commits: 1dec5699 by Raymond Toy at 2026-06-02T09:04:22-07:00 Implement Unicode 17.0 word break algorithm (UAX #29) Rewrite string-next-word-break to follow the Unicode 17.0 word boundary rules. The previous implementation predated several rules and had a dead branch (:extendorformat vs :extend-or-format). Decode the string into codepoints and word-break classes, then scan boundaries applying the WB rules in order. Extend/Format/ZWJ are treated as ignorable (WB4) by comparing against the nearest significant element on each side. New rule support: WB3c (ZWJ x Extended_Pictographic, via unicode-extended-pictographic-p), WB3d (WSegSpace), WB7a/b/c (Hebrew_Letter with quotes), and WB15/16 (Regional_Indicator pair counting). Import unicode-extended-pictographic-p into the UNICODE package, with a bootfile to intern the symbol so the building lisp can read the new package definition. Need a new, very simple, bootstrap file (bootstrap-2026-06-1) to build this. Update .gitlab-ci to use the bootstrap file. All 1944 WordBreakTest.txt cases pass. - - - - - 6 changed files: - .gitlab-ci.yml - bin/build.sh - + src/bootfiles/21f/boot-2026-06-1.lisp - src/code/exports.lisp - src/code/unicode.lisp - src/i18n/locale/cmucl.pot Changes: ===================================== .gitlab-ci.yml ===================================== @@ -7,7 +7,7 @@ variables: download_url: "https://common-lisp.net/project/cmucl/downloads/release/$release" version: "$release-x86" tar_ext: "xz" - bootstrap: "" + bootstrap: "-B boot-2026-06-01" workflow: rules: ===================================== bin/build.sh ===================================== @@ -38,7 +38,7 @@ ENABLE2="yes" ENABLE3="yes" ENABLE4="yes" -version=21e +version=21f SRCDIR=src BINDIR=bin TOOLDIR=$BINDIR ===================================== src/bootfiles/21f/boot-2026-06-1.lisp ===================================== @@ -0,0 +1,13 @@ +;; Bootstrap file for adding the Extended_Pictographic word-break +;; support (Unicode 17.0 word-break rule WB3c). +;; +;; The new function LISP::UNICODE-EXTENDED-PICTOGRAPHIC-P is imported +;; into the UNICODE package by code/exports.lisp. When the new +;; exports.lisp is compiled by the bootstrapping lisp, that symbol does +;; not yet exist in the LISP package, so the (:import-from "LISP" ...) +;; clause would fail. Intern it here first so the package definition +;; can be read. + +(in-package :lisp) + +(intern "UNICODE-EXTENDED-PICTOGRAPHIC-P" "LISP") ===================================== src/code/exports.lisp ===================================== @@ -2321,7 +2321,8 @@ "+UNICODE-CATEGORY-UPPER+" "+UNICODE-CATEGORY-TITLE+" "UNICODE-UPPER" - "UNICODE-WORD-BREAK") + "UNICODE-WORD-BREAK" + "UNICODE-EXTENDED-PICTOGRAPHIC-P") (:export "STRING-CAPITALIZE" "STRING-DOWNCASE" "STRING-UPCASE" ===================================== src/code/unicode.lisp ===================================== @@ -315,178 +315,155 @@ character extends to the end of S. If the index is negative or valid index into S, the returned value will be strictly greater than the index." + ;; Implements the word-boundary rules of UAX #29 (Unicode 17.0). + ;; + ;; Decode S into codepoints (collapsing UTF-16 surrogate pairs), + ;; classify each by its word-break property, then scan boundaries + ;; left to right applying the WB rules, returning the first boundary + ;; whose string index is strictly greater than I. + ;; + ;; WB4 makes Extend, Format and ZWJ "ignorable": they attach to the + ;; preceding context, so most rules compare a candidate against the + ;; nearest non-ignorable element on each side rather than the literal + ;; neighbour. ZWJ is also significant for WB3c, and Extend can sit + ;; between paired Regional_Indicators, so those rules look past + ;; ignorables explicitly. + (declare (type simple-string s)) (let ((n (length s))) - (labels - ((char-word-break-category (c) - ;; Map our unicode word break property into what this - ;; algorithm wants. - (let ((cat (unicode-word-break c))) - (case cat - ((:lf :cr :newline) - :sep) - ((:extend :format) - :extend-or-format) - (otherwise cat)))) - (left-context (i) - ;; Given a valid index i into s, returns the left context - ;; at i. - (multiple-value-bind (c widep) - (codepoint s i n) - (let* ((back - ;; If we're at a regular character or a leading - ;; surrogate, decrementing by 1 gets us the to - ;; previous character. But for a trailing - ;; surrogate, we need to decrement by 2! - (if (eql widep -1) - 2 - 1)) - (cat (char-word-break-category c))) - (case cat - ((:sep) - (if (= c (char-code #\return)) :cr cat)) - ((:midletter :midnumlet) - (let ((i-1 (- i back))) - (if (and (<= 0 i-1) - (eq (left-context i-1) :aletter)) - :aletter-midletter - cat))) - ((:midnum :midnumlet) - (let ((i-1 (- i back))) - (if (and (<= 0 i-1) - (eq (left-context i-1) :numeric)) - :numeric-midnum - cat))) - ((:extendorformat) - (if (< 0 i) - (left-context (- i back)) - :other)) - (otherwise cat))))) - - (index-of-previous-non-ignored (j) - ;; Returns the index of the last non-Extend, non-Format - ;; character within (substring s 0 j). Should not be - ;; called unless such a character exists. - - (let* ((j1 (- j 1))) - (multiple-value-bind (c widep) - (codepoint s j1) - (when (eql widep -1) - ;; Back up one more if we're at the trailing - ;; surrogate. - (decf j1)) - (let ((cat (char-word-break-category c))) - (case cat - ((:extend-or-format) - (index-of-previous-non-ignored j1)) - (otherwise j1)))))) - - (lookup (j context) - ;; Given j and the context to the left of (not including) j, - ;; returns the index at the start of the next word - ;; (or before which a word break is permitted). - - (if (>= j n) - (case context - ((:aletter-midletter :numeric-midnum) - (let ((j (index-of-previous-non-ignored n))) - (if (< i j) j n))) - (otherwise n)) - (multiple-value-bind (c widep) - (codepoint s j) - (let* ((next-j - ;; The next character is either 1 or 2 code - ;; units away. For a leading surrogate, it's - ;; 2; Otherwise just 1. - (if (eql widep 1) - 2 - 1)) - (cat (char-word-break-category c))) - (case cat - ((:extend-or-format) - (case context - ((:cr :sep) j) - (otherwise (lookup (+ j next-j) context)))) - (otherwise - (case context - ((:cr) - (if (= c (char-code #\linefeed)) - ;; Rule WB3: Don't break CRLF, continue looking - (lookup (+ j next-j) cat) - j)) - ((:aletter) - (case cat - ((:aletter :numeric :extendnumlet) - ;; Rules WB5, WB9, ? - (lookup (+ j next-j) cat)) - ((:midletter :midnumlet) - ;; Rule WB6, need to keep looking - (lookup (+ j next-j) :aletter-midletter)) - (otherwise j))) - ((:aletter-midletter) - (case cat - ((:aletter) - ;; Rule WB7 - (lookup (+ j next-j) cat)) - (otherwise - ;; Rule WB6 and WB7 were extended, but the - ;; region didn't end with :aletter. So - ;; backup and break at that point. - (let ((j2 (index-of-previous-non-ignored j))) - (if (< i j2) j2 j))))) - ((:numeric) - (case cat - ((:numeric :aletter :extendnumlet) - ;; Rules WB8, WB10, ? - (lookup (+ j next-j) cat)) - ((:midnum :midnumlet) - ;; Rules WB11, need to keep looking - (lookup (+ j next-j) :numeric-midnum)) - (otherwise j))) - ((:numeric-midnum) - (case cat - ((:numeric) - ;; Rule WB11, keep looking - (lookup (+ j next-j) cat)) - (otherwise - ;; Rule WB11, WB12 were extended, but the - ;; region didn't end with :numeric, so - ;; backup and break at that point. - (let ((j2 (index-of-previous-non-ignored j))) - (if (< i j2) j2 j))))) - ((:midletter :midnum :midnumlet) - ;; Rule WB14 - j) - ((:katakana) - (case cat - ((:katakana :extendnumlet) - ;; Rule WB13, WB13a - (lookup (+ j next-j) cat)) - (otherwise j))) - ((:extendnumlet) - (case cat - ((:extendnumlet :aletter :numeric :katakana) - ;; Rule WB13a, WB13b - (lookup (+ j next-j) cat)) - (otherwise j))) - ((:regional_indicator) - (case cat - ((:regional_indicator) - ;; Rule WB13c - (lookup (+ j next-j) cat)) - (otherwise j))) - (otherwise j))))))))) - (declare (notinline lookup left-context)) - (cond ((< i 0) - ;; Rule WB1 - 0) - ((<= n i) - ;; Rule WB2 - n) - (t - (multiple-value-bind (c widep) - (codepoint s i) - (declare (ignore c)) - (lookup (+ i (if (eql widep 1) 2 1)) (left-context i)))))))) + (cond + ((< i 0) 0) ; Rule WB1 (start of text) + ((>= i n) n) ; Rule WB2 (end of text) + (t + (let ((cls (make-array 16 :fill-pointer 0 :adjustable t)) + (idx (make-array 16 :fill-pointer 0 :adjustable t))) + ;; Decode codepoints and record the string index of each. + (let ((k 0)) + (loop while (< k n) do + (multiple-value-bind (cp widep) (codepoint s k) + (vector-push-extend (unicode-word-break cp) cls) + (vector-push-extend k idx) + (incf k (if (eql widep 1) 2 1))))) + (let ((m (fill-pointer cls))) + (labels + ((class (j) (aref cls j)) + (ah-letter-p (c) (or (eq c :aletter) (eq c :hebrew_letter))) + (mid-letter-q-p (c) ; (MidLetter | MidNumLetQ) + (or (eq c :midletter) (eq c :midnumlet) (eq c :single_quote))) + (mid-num-q-p (c) ; (MidNum | MidNumLetQ) + (or (eq c :midnum) (eq c :midnumlet) (eq c :single_quote))) + (ignorable (c) ; WB4: Extend | Format | ZWJ + (or (eq c :extend) (eq c :format) (eq c :zwj))) + (ext-pict-at (j) + (unicode-extended-pictographic-p (codepoint s (aref idx j)))) + (prev-significant (j) ; last non-ignorable element < J, or -1 + (loop for p from (1- j) downto 0 + unless (ignorable (class p)) return p + finally (return -1))) + (next-significant (j) ; first non-ignorable element > J, or -1 + (loop for q from (1+ j) below m + unless (ignorable (class q)) return q + finally (return -1))) + (ri-count-left (j) + ;; Number of significant Regional_Indicator elements + ;; immediately to the left of J (stopping at the first + ;; non-RI significant element; ignorables are skipped). + (let ((count 0)) + (loop for p = (prev-significant j) then (prev-significant p) + while (and (>= p 0) (eq (class p) :regional_indicator)) + do (incf count)) + count)) + (break-before-p (j) + ;; Does the algorithm allow a break immediately before + ;; element J (1 <= J < M)? + (let* ((c (class j)) + (lit (class (1- j))) ; literal previous element + (pj (prev-significant j))) + (cond + ;; WB3: CR x LF + ((and (eq lit :cr) (eq c :lf)) nil) + ;; WB3a: (Newline | CR | LF) div + ((member lit '(:newline :cr :lf)) t) + ;; WB3b: div (Newline | CR | LF) + ((member c '(:newline :cr :lf)) t) + ;; WB3c: ZWJ x \p{Extended_Pictographic} + ((and (eq lit :zwj) (ext-pict-at j)) nil) + ;; WB3d: WSegSpace x WSegSpace + ((and (eq lit :wsegspace) (eq c :wsegspace)) nil) + ;; WB4: x (Extend | Format | ZWJ): never break before + ;; an ignorable (covers ignorables after sot or after + ;; another ignorable too). + ((ignorable c) nil) + ;; Nothing significant to the left: break. + ((< pj 0) t) + (t + (let ((p (class pj))) + (cond + ;; WB5: AHLetter x AHLetter + ((and (ah-letter-p p) (ah-letter-p c)) nil) + ;; WB6: AHLetter x (MidLetter|MidNumLetQ) AHLetter + ((and (ah-letter-p p) (mid-letter-q-p c) + (let ((nx (next-significant j))) + (and (>= nx 0) (ah-letter-p (class nx))))) + nil) + ;; WB7: AHLetter (MidLetter|MidNumLetQ) x AHLetter + ((and (ah-letter-p c) (mid-letter-q-p p) + (let ((pp (prev-significant pj))) + (and (>= pp 0) (ah-letter-p (class pp))))) + nil) + ;; WB7a: Hebrew_Letter x Single_Quote + ((and (eq p :hebrew_letter) (eq c :single_quote)) nil) + ;; WB7b: Hebrew_Letter x Double_Quote Hebrew_Letter + ((and (eq p :hebrew_letter) (eq c :double_quote) + (let ((nx (next-significant j))) + (and (>= nx 0) (eq (class nx) :hebrew_letter)))) + nil) + ;; WB7c: Hebrew_Letter Double_Quote x Hebrew_Letter + ((and (eq c :hebrew_letter) (eq p :double_quote) + (let ((pp (prev-significant pj))) + (and (>= pp 0) (eq (class pp) :hebrew_letter)))) + nil) + ;; WB8: Numeric x Numeric + ((and (eq p :numeric) (eq c :numeric)) nil) + ;; WB9: AHLetter x Numeric + ((and (ah-letter-p p) (eq c :numeric)) nil) + ;; WB10: Numeric x AHLetter + ((and (eq p :numeric) (ah-letter-p c)) nil) + ;; WB11: Numeric (MidNum|MidNumLetQ) x Numeric + ((and (eq c :numeric) (mid-num-q-p p) + (let ((pp (prev-significant pj))) + (and (>= pp 0) (eq (class pp) :numeric)))) + nil) + ;; WB12: Numeric x (MidNum|MidNumLetQ) Numeric + ((and (eq p :numeric) (mid-num-q-p c) + (let ((nx (next-significant j))) + (and (>= nx 0) (eq (class nx) :numeric)))) + nil) + ;; WB13: Katakana x Katakana + ((and (eq p :katakana) (eq c :katakana)) nil) + ;; WB13a: (AHLetter|Numeric|Katakana|ExtendNumLet) x ExtendNumLet + ((and (member p '(:aletter :hebrew_letter :numeric + :katakana :extendnumlet)) + (eq c :extendnumlet)) + nil) + ;; WB13b: ExtendNumLet x (AHLetter|Numeric|Katakana) + ((and (eq p :extendnumlet) + (or (ah-letter-p c) + (member c '(:numeric :katakana)))) + nil) + ;; WB15/WB16: in a Regional_Indicator run, break + ;; only between pairs: break before J iff an even + ;; number of RIs precede it. + ((and (eq p :regional_indicator) + (eq c :regional_indicator)) + (evenp (ri-count-left j))) + ;; WB999: otherwise break. + (t t)))))))) + ;; Find the first allowed boundary whose string index is > I. + (loop for j from 1 below m + when (and (> (aref idx j) i) (break-before-p j)) + do (return-from string-next-word-break (aref idx j))) + ;; WB2: otherwise the word extends to end of text. + n))))))) (defun char-titlecase (char) "Returns CHAR converted to title-case if that is possible." ===================================== src/i18n/locale/cmucl.pot ===================================== @@ -2019,8 +2019,8 @@ msgstr "" msgid "Vector of all callbacks." msgstr "" -#: src/compiler/tn.lisp src/compiler/main.lisp src/code/describe.lisp -#: src/code/debug-int.lisp src/code/debug-info.lisp +#: src/compiler/tn.lisp src/compiler/main.lisp src/code/unicode.lisp +#: src/code/describe.lisp src/code/debug-int.lisp src/code/debug-info.lisp #: src/code/foreign-linkage.lisp src/code/reader.lisp src/code/stream.lisp #: src/code/hash-new.lisp src/code/array.lisp src/code/alien-callback.lisp msgid "~S is not an array with a fill-pointer." View it on GitLab: https://gitlab.common-lisp.net/cmucl/cmucl/-/commit/1dec5699a176f10dd48fa16… -- View it on GitLab: https://gitlab.common-lisp.net/cmucl/cmucl/-/commit/1dec5699a176f10dd48fa16… You're receiving this email because of your account on gitlab.common-lisp.net. Manage all notifications: https://gitlab.common-lisp.net/-/profile/notifications | Help: https://gitlab.common-lisp.net/help

1 0

[Git][cmucl/cmucl][issue-511-update-unicode-tests] Add Extended_Pictographic property for word-break rule WB3c
by Raymond Toy (＠rtoy) 02 Jun '26

02 Jun '26

Raymond Toy pushed to branch issue-511-update-unicode-tests at cmucl / cmucl Commits: c6ff297a by Raymond Toy at 2026-06-02T08:06:39-07:00 Add Extended_Pictographic property for word-break rule WB3c Word-break rule WB3c (do not break within emoji ZWJ sequences) needs the Extended_Pictographic property, which comes from emoji-data.txt -- a file we did not previously read. Read emoji-data.txt in build-unidata (only the Extended_Pictographic property; the other emoji properties are ignored) and pack the flag into bit 5 of the word-break table value, alongside the class index in the low 5 bits. On the runtime side, mask the class to 5 bits in unicode-word-break-code and add unicode-extended-pictographic-p to test the flag. unidata.bin regenerated. This adds the data only; string-next-word-break does not yet implement WB3c. - - - - - 3 changed files: - src/code/unidata.lisp - src/i18n/unidata.bin - src/tools/build-unidata.lisp Changes: ===================================== src/code/unidata.lisp ===================================== @@ -1190,11 +1190,21 @@ (gethash (composition-table-key c1 c2) *composition-pair-table* nil)))) (defun unicode-word-break-code (code) + ;; Low 5 bits hold the word-break class index; bit 5 is the + ;; Extended_Pictographic flag (see unicode-extended-pictographic-p). (unless (unidata-word-break *unicode-data*) (load-word-break)) (let* ((data (unidata-word-break *unicode-data*)) (n (qref8 data code))) - n)) + (logand n #x1F))) + +(defun unicode-extended-pictographic-p (code) + ;; The Extended_Pictographic property (from emoji-data.txt) is packed + ;; into bit 5 of the word-break table value. Needed for word-break + ;; rule WB3c. + (unless (unidata-word-break *unicode-data*) + (load-word-break)) + (logbitp 5 (qref8 (unidata-word-break *unicode-data*) code))) (defun unicode-word-break (code) ;; The order of the array here MUST match the order used in ===================================== src/i18n/unidata.bin ===================================== Binary files a/src/i18n/unidata.bin and b/src/i18n/unidata.bin differ ===================================== src/tools/build-unidata.lisp ===================================== @@ -663,6 +663,7 @@ case-fold-full case-fold-simple word-break + ext-pictographic ;; ... ) @@ -874,6 +875,18 @@ as ent = (find-ucd i) do (when ent (setf (ucdent-word-break ent) code)))))) + + ;; Extended_Pictographic (from emoji-data.txt) is needed for + ;; word-break rule WB3c. It is the only emoji property we use; + ;; ignore the others in the file. + (foreach-ucd "emoji-data" + ucd-directory + (lambda (min max prop) + (when (string= prop "Extended_Pictographic") + (loop for i from min to max + as ent = (find-ucd i) do + (when ent + (setf (ucdent-ext-pictographic ent) t)))))) (values vec (make-range :codes range))))) @@ -1008,17 +1021,20 @@ (defun pack-word-break (ucdent) ;; The code is the index in the list. :OTHER is a dummy value and ;; used to represent the default case. - (or (position (ucdent-word-break ucdent) - '(:other :cr :lf :newline :extend :format - :katakana :aletter :midnumlet :midletter :midnum - :numeric :extendnumlet :regional_indicator - ;; Classes added since Unicode 6.2 (6.3: hebrew_letter, - ;; single_quote, double_quote; 9.0: zwj; 11.0: wsegspace). - ;; Appended so existing indices are preserved; the array in - ;; unicode-word-break MUST match this order. - :hebrew_letter :single_quote :double_quote - :zwj :wsegspace)) - 0)) + ;; Low 5 bits: word-break class index (the array in unicode-word-break + ;; MUST match this order). Bit 5 (#x20): Extended_Pictographic, for + ;; word-break rule WB3c. + (logior + (or (position (ucdent-word-break ucdent) + '(:other :cr :lf :newline :extend :format + :katakana :aletter :midnumlet :midletter :midnum + :numeric :extendnumlet :regional_indicator + ;; Classes added since Unicode 6.2 (6.3: hebrew_letter, + ;; single_quote, double_quote; 9.0: zwj; 11.0: wsegspace). + :hebrew_letter :single_quote :double_quote + :zwj :wsegspace)) + 0) + (if (ucdent-ext-pictographic ucdent) #x20 0))) ;; ucd-directory should be the directory where UnicodeData.txt is ;; located. View it on GitLab: https://gitlab.common-lisp.net/cmucl/cmucl/-/commit/c6ff297a87dd6e3e23e594e… -- View it on GitLab: https://gitlab.common-lisp.net/cmucl/cmucl/-/commit/c6ff297a87dd6e3e23e594e… You're receiving this email because of your account on gitlab.common-lisp.net. Manage all notifications: https://gitlab.common-lisp.net/-/profile/notifications | Help: https://gitlab.common-lisp.net/help

1 0