The String-extensions Library

Designed by the Gwydion Project

Table of Contents


1. Introduction


String-extensions is a library of routines for working with characters and strings. String-extensions exports these modules:

Conversions
This module consists of various useful conversions involving strings.
Character-type
This module is a Dylanized version of the C library ctype.h
String-hacking
This module exports miscellanous functions and data structures that are useful when working with strings and characters.
Regular-expressions
This module contains various functions that deal with regular expressions (regexps).
Substring-search
This module contains methods for searching for fixed substrings rather than general regular expressions.

2. The Conversions Module


The Conversions module consists of various useful conversions involving strings. They are:

string-to-integer(string, #key base) => integer [Function]
integer-to-string(integer, #key base) => string [Function]
digit-to-integer(character) => integer [Function]
integer-to-digit(integer) => character [Function]

Base defaults to 10, and is the radix for the number system to convert from/to. Bases below 2 are errors, as are bases above 36. When converting from a string, the string must exactly describe a number, with no excess characters. Digit-to-integer will signal an error if the digit is non-alphanumeric. Errors will be signalled for all invalid input.

as(<string>, character) [G.F. Method]

Turns a character into the appropriate string of length one.

3. The Character-type Module


Character-type is a Dylanized version of the C library ctype.h It contains the following functions:

------------------------------------------------------------------------
Function and Argument     Returns #t for these characters                 

Type                                                                     
------------------------------------------------------------------------
alpha?(character)         a-zA-Z                                          
digit?(character)         0-9                                             
alphanumeric?(character)  a-zA-Z0-9                                       
whitespace?(character)    Space, tab, newline, formfeed, carriage return  
uppercase?(character)     A-Z                                             
lowercase?(character)     a-z                                             
hex-digit?(character)     0-9a-f                                          
punctuation?(character)   ,./<>?;\:"|'[]{}!@#$%^&*()-=_+`~            
graphic?(character)       alphanumeric or punctuation                     
printable?(character)     graphic or whitespace                           
control?(character)       not printable                                   
------------------------------------------------------------------------

4. The String-hacking Module


The String-hacking module exports miscellanous functions and data structures that are useful when working with strings and characters.

add-last(stretchy-sequence, object) => stretchy-sequence [Generic Function]
add-last(string, character) => string [G.F. Method]

Like add except it's guarenteed to add the character to the end of the string.

predecessor(character) => character [Function]

Get the character before this character. Equivalent to
as(<character>, -1 + as(<integer>, character))

successor(character) => character [Function]

Get the character after this character. Equivalent to
as(<character>, 1 + as(<integer>, character))

case-insensitive-equal(object1, object2) [Generic Function]
case-insensitive-equal(string1, string2) [G.F. Method]
case-insensitive-equal(character1, character2) [G.F. Method]

Does a case insensitive equality test. Methods are provided only for strings and characters, not general collections.

<character-set> [Sealed Abstract Class]
<case-sensitive-character-set> [Class]
<case-insensitive-character-set> [Class]

A <character-set> is a non-mutable subclass of <collection>, and is conceptually an unordered set of characters. Dylan collection elements always have keys, so to fit sets into Dylan, the key of an element of a character set is the element itself. There are two instantiable subclasses of <character-set>, <case-sensitive-character-set> and <case-insensitive-character-set>. <character-set> is not instantiable; one must always specify one of the instantiable subclasses when creating a character set.
There are two ways of making a character set. The first is a method for make using the description: keyword. The value that follows the description: keyword is a string that describes the set using a notation like a regular expression character set, except without the `[` and `]' delimiters. For example,
make(<case-sensitive-character-set>, description: "a-z")
would be the set of all lowercase alphabetic characters.
A second way to create character sets is to use an as method. The as method basically takes a collection of characters and discards the keys of these characters. Example:
as(<case-insensitive-character-set>, "abcdefghijklmnopqrstuvwxyz")
is again the set of all lowercase alphabetic characters. It is important to realize that the as method does not take a description:
as(<case-sensitive-character-set>, "a-z")
returns the set of `a', `-', and `z', not the set of all alphabetic characters.
The most useful operation on character sets is member?, which does what one would expect. Another useful operation is the forward-iteration-protocol. This basically calls member? on every possible character until it finds a character that is a member of the set. This means that in a <case-insensitive-character-set>, both `a' and `A' will come up.

<byte-character-table> [Class]

A byte-character-table is a vector that uses byte characters as indices instead of integers. The following are equivalent:
regular-vector[as(<integer>, character)] byte-character-table[character]
<byte-character-table> has absolutely no relation to <table>. It is simply a <mutable-explicit-key-collection>.

5. The Regular-expressions Module


The Regular-expressions module contains various functions that dealwith regular expressions (regexps). The module is based on Perl (version 4), and has the same semantics unless otherwise noted. The syntax for Perl-style regular expressions can be found on page 103 of Programming Perl by Larry Wall and Randal L. Schwartz. There are some differences in the way String-extensions handles regular expressions. The biggest difference is that regular expressions in Dylan are case insensitive by default. Also, when given an unparsable regexp, String-extensions will produce undefined behavior while Perl would give an error message.

A regular expression that is gramatically correct may still be illegal if it contains an infinitely quantified sub-regexp that may match the empty string. That is, if R is a regexp that can match the empty string, then any regexp containing R*, R+, and R{n,} are all illegal. In this case, the Regular-expressions library will signal an <illegal-regexp> error when the regexp is parsed. Note: Perl also has this restriction, even though it isn't mentioned in Programming Perl.

There is some work involved in analyzing a regular expression, and if the same regexp is used repeatly with different target strings, this will result in wasted computation. For this reason, each basic function in the Regular-expression module comes with a companion function that makes using a regular expression more efficient when it is used more than once. For example, the regexp-replace function has the make-regexp-replacer companion function. There is one exception; the join function has no make-joiner function. The "make-fooer" will analyze the regular expression exactly once, and provide a function that makes use of this pre-analyzed regular expression. For example, the following two pieces of code yield the same result:

            regexp-position("This is a string", "is");            let is-finder = make-regexp-positioner("is");
            is-finder("This is a string");

However, the second form is more efficient if is-finder is called multpile times.

regexp-position [Generic Function] (big-string, regexp, #key start, end, case-sensitive)
=> variable-number-of-marks-or-#f

This function returns the index of the start of the regular expression in the big-string, or #f if the regular expression is not found. As a second value, it returns the index of the end of the regular expression in the big-string (assuming it was found; otherwise there is no second value). These values are called marks, and they come in pairs, a start-mark and an end-mark. If there are groups in the regular expression, regexp-position will return an additional pair of marks (a start and an end) for each group. If the group is matched, these marks will be integers; if the group is not matched, the marks will be #f. So
regexp-position("This is a string", "is");
returns values(2, 4), and
regexp-position("This is a string", "(is)(.*)ing");
returns values(2, 16, 2, 4, 4, 13), while
regexp-position("This is a string", "(not found)(.*)ing");
returns #f. Marks are always given relative to the start of big-string, not relative to the start: keyword.
Start: and end: specify what part of big-string to look at, and they default to the beginning and end of the string, respectively. Case-sensitive defaults to false.

make-regexp-positioner [Generic Function] (regexp, #key byte-characters-only, need-marks, maximum-compile, case-sensitive)
=> an anonymous positioner
method (big-string, #key start, end)

Make-regexp-positioner can return several different types of positioners, and it is up to the user to specify what kind of positioner the user wants. By default, it returns a positioner that works like regexp-position. However, if need-marks is #f, it may give a positioner that only returns #t or #f, with no marks. (And then again, it may still return marks) If byte-characters-only is specified, the positioner may only work on big-strings that consist only of byte characters (characters whose numerical value is between 0 and 255, inclusive). And if maximum-compile is #t, it will take a long time to return a positioner, but the positioner will run really fast.

regexp-replace [Generic Function] (big-string, search-for-regexp, replace-with-string, #key count, case-sensitive, start, end)
=> new-string

This replaces all occurences of regexp in big-string with replace-string. If count: is specified, it replaces only the first count occurences of regexp. (This is different from Perl, which replaces only the first occurence unless /g is specified) Replace-string can contain backreferences to the regexp. For instance,
regexp-replace("The rain in spain and some other text", "the (.*) in (\\w*\\b)", "\\2 has its \\1")
returns "spain has its rain and some other text". If the subgroup referred to by the backreference was not matched, the reference is interpretted as the null string. For instance,
regexp-replace("Hi there", "Hi there(, Bert)?", "What do you think\\1?")
returns "What do you think?" because ", Bert" wasn't found.

make-regexp-replacer [Generic Function] (regexp, #key replace-with, case-sensitive)
=> an anonymous replacer function that is either
method (big-string, #key count, start, end)
or
method (big-string, replace-string, #key count, start, end)

The first form is returned if the replace-with: keyword isn't supplied, otherwise the second form is returned. (There is no efficiency gained by supplying the replace-with string, but it might be convenient)

translate [Generic Function] (big-string, from-string, to-string, #key delete, start, end)
=> new-string

This is equivalent to Perl's tr/// construct. From-string is a string specification of a character set, and to-string is another character set. Translate converts big-string character by character, according to the sets. For instance,
translate("any string", "a-z", "A-Z")
will convert "any string" to all uppercase: "ANY STRING".
Like Perl, character ranges are not allowed to be "backwards". The following is not legal:
translate("any string", "a-z", "z-a")
(This restriction may be removed in future releases) Unlike Perl's tr///, translate doesn't return the number of characters translated.
If delete: is true, any characters in the from-string that don't have matching characters in the to-string are deleted. The following will remove all vowels from a string and convert periods to commas:
translate("any string", ".aeiou", ",", delete: #t)
Delete: is false by default. If delete: is false and there aren't enough characters in the to-string, the last character in the to-string is reused as many times as necessary. The following converts several punctuation characters into spaces:
translate("any string", ",./:;[]{}()", " ");
Start: and end: indicate which part of the string. They default to the entire string.
Caveats: Translate is always case sensitive.

translate [G.F. Method] (big-byte-string, from-byte-string, to-byte-string, #key delete, start, end)
=> new-string

The only method of translate operates only on byte strings.

make-translator [Generic Function] (from-string, to-string, #key delete)
=> an anonymous translator
method (big-string, #key start, end) => new-string

Does what you'd expect it to.

make-translator [G.F. Method] (from-byte-string, to-byte-string, #key delete)
=> an anonymous translator
method (big-string, #key start, end) => new-byte-string

Again, the existing method on make-translator only handles byte strings.

split [Generic Function] (regexp, big-string, #key count, remove-empty-items, case-sensitive, start, end)
=> a variable number of strings

This is like Perl's split function. It searchs big-string from occurences of regexp, and returns substrings that were delimited by that regexp. For instance,
split("-", "long-dylan-identifier")
returns values("long", "dylan", "identifier"). Note that what matched the regexp is left out. Remove-empty-items, which defaults to true, magically skips over empty items, so that
split("-", "long--with--multiple-dashes)
returns values("long", "with", "multiple", "dashes"). Count is the maximum number of strings to return. If there are n strings and count is specified, the first count - 1 strings are returned as usual, and the count'th string is the remainder, unsplit. So
split("-", "really-long-dylan-identifier", count: 3)
returns values("really", "long", "dylan-identifier"). If remove-empty-items is true, empty items aren't counted.
Case sensitive determines if the regexp for the delimiter should be considered case sensitive or not; it defaults to case-insensitive. Start: and end: indicate what part of the big string should be looked at for delimiters. They default to the entire string. For instance,
split("-", "really-long-dylan-identifier", start: 8)
returns values("really-long", "dylan", "identifier").
Caveat: Unlike Perl, empty regular expressions are never legal regular expressions, so there is no way to split a string into a #rest sequence-of-characters. Of course, in Dylan this is not a useful thing to do, so this is not really a problem.

make-splitter [Generic Function] (pattern :: <string>, #key case-sensitive)
=> an anonymous splitter
method (big-string, #key count, remove-empty-items, start, end) => buncha-strings

Does what you would expect.

join [Generic Function] (delimiter :: <string>, #rest strings) => big-string

Does the opposite of a split.
join(":", word1, word2, word3)
is equivalent to
concatenate(word1, ":", word2, ":", word3)
(and no more efficient) Note that there is no make-joiner.

<illegal-regexp> [Class]

When an illegal regular expression is parsed, an error of this type will be signalled.

6. The Substring-search Module


Substring-search contains methods for searching for fixed substrings rather than general regular expressions. It is as similar to the regular expression module as we could make it. Substring functions work only on byte strings, and are always case sensitive.

substring-position [Generic Function] (big-string, search-for-string, #key start, end)
=> position-or-false;

Returns the position of the search-for-string in the big-string (or that portion of the big-string specified by start: and end:). This search is always case sensitive.
This function uses the Boyer-Moore algorithm for long strings, and a simple dumb search for short strings. It should yield good performance under all circumstances.

make-substring-positioner [Generic Function] (search-for-string) => an anonymous positioner
method (big-string, #key start, end) => position-or-false

Does the obvious.

substring-replace [Generic Function] (big-string, search-for-string, replace-with-string, #key count, start, end)
=> replaced-string

Replaces the substring, or the first count instances of it if count: is specified. Note this function does not support start: or end:.

make-substring-replacer [Generic Function] (search-for :: <byte-string>, #key replace-with)
=> an anonymous function replacer that is either
method (big-string, #key count, start, end) => new-string
or
method (big-string, replace-with-string, #key count, start, end)

Does the obvious.

7. Known bugs


Regular-expressions will do unpredictable things if given bad arguments. (ie, a string that isn't a legal regular expression) Sometimes it'll crash, and sometimes it'll merily chug away and return crazy answers.