Hi,
I just subscribed to this mailing list, which I believe is not only for cl-ppcre but also for cl-unicode. If I am wrong, please point me in the right direction :-)
My name is Juanjo and I am the maintainer of ECL (http://ecls.sourceforge.net). I am currently interested on completing the support for Unicode in ECL which is, more or less, at the level of what SBCL provides and, in my opinion, far from optimal.
I have been pondering several options, but all of them seem like reinventing the wheel, so I finally came to the conclusion that the most sensible strategy would be to turn cl-unicode into a full (optional) replacement of the ANSI Common Lisp functions for dealing with characters and strings, and hope that this would become a de-facto standard. Perhaps that is a too ambitious goal, or maybe it is even futile, given the level of adoption of Unicode among lispers.
My concerns are now centered about several questions.
1) Optimize the database information that is built into cl-unicode. ECL currently uses the SBCL procedure for compressing the database and I believe this can be even optimized further. Instead of binary trees or hashes, this leads to two-stages byte table that encodes the currently 209 different combinations of properties. This is important for ECL because we need it to stay lean and simple and because our procedures for exporting data structures in compiled code are not efficient, due to contrants in C compilers. One possibility is that CL-UNICODE reuses the SBCL and ECL databases. Other possibility is to look for even more efficient data stuctures.
2) Add support for the most important Unicode algorithms, which are canonical decomposition of strings, string upper/lower/titlecasing, and string collation. Ideally this should be transparently incorporated into new Common-Lisp functions that can be used to replace the old ones, such as char-upcase, string-equal, etc. Of course, due to the differences between Unicode and ANSI CL, the specifications would change.
3) Add support for the locales database provided by the Unicode consortium. This is essential for implementing string collation, since the ordering of characters is locale dependent.
4) Integration and shipping of cl-unicode with different implementations, if possible. I would be interested on having CL-UNICODE as a contributed package in the ECL source tree, so that it can be activated with a simple configuration option. I believe there are no license issues, and there is only the problem that CL-UNICODE depends on CL-PPCRE (is this dependency essential? could it be eliminated?)
Well, maybe this is all BS, but I would like to read your opinions on the topic.
Juanjo
-- Instituto de Física Fundamental, CSIC c/ Serrano, 113b, Madrid 28009 (Spain) http://juanjose.garciaripoll.googlepages.com
On Mon, Jan 12, 2009 at 1:56 PM, Juan Jose Garcia-Ripoll juanjose.garciaripoll@googlemail.com wrote:
- Integration and shipping of cl-unicode with different
implementations, if possible. I would be interested on having CL-UNICODE as a contributed package in the ECL source tree, so that it can be activated with a simple configuration option. I believe there are no license issues, and there is only the problem that CL-UNICODE depends on CL-PPCRE (is this dependency essential? could it be eliminated?)
Hi Juanjo,
Sorry for the delay. It's fine with me if you distribute CL-UNICODE with ECL and I also think there should be no licensing issues.
CL-PPCRE is used in a couple of places for parsing. These could be replaced with hand-crafted parsers, but it'd be a bit of work to do that a) correctly, b) without blowing up the code base enormously, and c) without significant sacrifices w.r.t. speed. Having said that, I'm open to accepting patches to get rid of this dependency... :)
Cheers, Edi.
Hi Edi,
On Wed, Jan 14, 2009 at 9:42 PM, Edi Weitz edi@agharta.de wrote:
Sorry for the delay. It's fine with me if you distribute CL-UNICODE with ECL and I also think there should be no licensing issues.
Great to read that.
CL-PPCRE is used in a couple of places for parsing. These could be replaced with hand-crafted parsers, but it'd be a bit of work to do that a) correctly, b) without blowing up the code base enormously, and c) without significant sacrifices w.r.t. speed. Having said that, I'm open to accepting patches to get rid of this dependency... :)
I think I will leave that for the end. I am now learning the normalization algorithms to implement string equality comparisons using cl-unicode as-is. Once it passes the test suites, I hope to move to string collation and then look on things related to dependencies, databases, etc. As I said, my goal is to integrate this in cl-unicode, so if and when I get moving I will send patches either to you or to the mailing list.
Juanjo
cl-ppcre-devel@common-lisp.net