Re: [cl-ppcre-devel] CL-PPCRE Split behaviour

27 Aug 2007

      Sébastien Saint-Sevin <seb-cl-mailist@matchix.com> writes:
...
While using cl-ppcre:split recently, I discover that when the regex
match at pos 0, the function returns an empty string in first pos,
where I think it should not as I do not consider the empty string
being a substring of the original one.
Ex : (cl-ppcre:split "\\s+" " foo bar baz ") ==>  ("" "foo" "bar" "baz")
It is an interesting question, but I believe that the current split
behavior of the returning the leading empty string is the rational
behavior.  In mind my in comes down to the definition of split
"returns a list of the substrings between the matches".  

Having said that I often have real-world needs to *not* have the
leading string around.  I wish there were explicit keyword args to
omit any leading and trailing empty strings.  If I get motivated, I
might even make a patch!  Perl's version of split doesn't have keyword
args so it tries to fit several behavior changes into its arguments.

Here's some more practical advice: If you know your problem domain
well, you can try the inverse match trick.  Instead of calling SPLIT,
call ALL-MATCHES-AS-STRINGS with the inverse regex.  In this case:

  (all-matches-as-strings "\\S+" " foo  bar  baz ") => ("foo" "bar" "baz")

(This will skip internal empty strings in the general case, but
doesn't matter for your example case.)

It's also easy to also write your own split that does what you want.
An untested version is below.

Cheers,
Chris Dean

(defun simple-split (regex target-string)
  "A simple version of split that doesn't handle registers in any
   special way and discards leading and trailing empty matches.  
   Untested!"
  (let ((res nil)                       ; The result
        (last-end 0))                   ; The end positon of the last match
    (cl-ppcre:do-matches (mstart mend regex target-string)
      (unless (zerop mstart)
        (push (subseq target-string last-end mstart) res))
      (setf last-end mend))
    (when (< last-end (length target-string))
      (push (subseq target-string last-end) res))
    (nreverse res)))