On Sun, 26 Nov 2006 14:55:21 +0200, "Alex Mizrahi" alex.mizrahi@gmail.com wrote:
i have an implementation that reports char-code-limit less than actual -- it's ABCL (working on top of Java), only 256 codes are officially suported, but it uses Java strings, so there's no problem with handling Unicode strings -- i set *regex-char-code-limit* to some 10000 (thanks, Edi!). however, there are characters like 0xFFEF (the BOM), so i should set *regex-char-code-limit* to 65535. i think it's overkill to do that -- i see ppcre creates array of that size to do matching.
how do people cope with it on unicode-enabled lisps? (afaik SteelBank uses UCS-4 char codes, so there's definitely no sane char-code-limit)
does ppcre create that for each scanner? if there's one global array that's ok, but array for each scanner is too much..
does *use-bmh-matchers* affect usage of this array?
Yes. If you set it to NIL, you don't create BMH matchers and that's where the arrays are needed.
The limit is also used in a few cases related to hash tables for character classes, but I think this is not really important.
if so, would it be much slower if i disable it?
BMH matchers will only help you if your regular expression starts or ends with constant strings (the longer, the better) /and/ if your target strings are very long.
HTH, Edi.