- Header
- Namespace
- Error handling
- Basics
- Construction
- Platform-specific conversions
- Adding Strings
- Comparing Strings
- Iterating over string content
- Substrings
- Accessing C strings
- Accessing storage as a C array
- Building and modifications
- Utility methods
Everything needed to use sys_string
and all related functionality is declared in sys_string/sys_string.h
header.
An #include <sys_string/sys_string.h>
directive is assumed in all the examples.
Everything in this library is under namespace sysstr
.
A using namespace sysstr
directive is assumed in all the examples.
Most methods in this library are noexcept
. Methods that aren't, usually involve memory allocation or creation of system string types that can fail (usually also due to memory allocation failure). In such cases errors are reported via exceptions. All thrown exceptions derive from std::exception
. When it is certain that the failure is due to memory allocation std::bad_alloc
is raised. Otherwise, the library throws a more generic std::runtime_error
The two main classes provided by this library are sys_string
and sys_string_builder
. These are actually aliases (typedefs) to instantiations
of template<class Storage> class sys_string_t
and template<class Storage> class sys_string_builder_t
with a Storage
chosen based on platform
and compilation options. A Storage
is a policy class that defines what kind of system string type is stored inside a sys_string
.
On every supported platform there is one or more storage types available. You can always use other storage types in addition to
the default one. The following table describes the available sys_string_t
instantiations, their availability and when they are made to be
the default sys_string
Type | Interoperates with | Availability | Default sys_string when |
---|---|---|---|
sys_string_generic |
Basic char * |
Always | On non Apple Unix or anywhere if SYS_STRING_USE_GENERIC macro is set to 1 |
sys_string_cfstr |
CFStringRef /NSString * |
Apple platforms | No other macros set |
sys_string_win_generic |
Basic wchar_t * |
Windows | No other macros set |
sys_string_bstr |
BSTR |
Windows | If SYS_STRING_WIN_BSTR macro is set to 1 |
sys_string_hstring |
HSTRING |
Windows | If SYS_STRING_WIN_HSTRING macro is set to 1 |
sys_string_pystr |
Python PyObject * |
If SYS_STRING_ENABLE_PYTHON or SYS_STRING_USE_PYTHON are set to 1 |
If SYS_STRING_USE_PYTHON is set to 1 |
sys_string_android |
jstring |
Android | No other macros set |
sys_string_emscripten |
Emscripten JavaScript strings (via EM_VAL ) |
Emscripten | No other macros set |
Apart from storage specific conversions all of these types have the same interface and are simply described below as being a sys_string
.
In the simplest case a sys_string
literal can be created statically, that is, having its content embedded into executable image.
const sys_string & str = S("Hello!");
S
unfortunately has to be a macro. If such a short macro name offends you or collides with another macro in your program you can disable it via #define SYS_STRING_NO_S_MACRO 1
. In this case you can use verbose:
const sys_string & str = SYS_STRING_STATIC("Hello!");
You can also create a string by converting it from a single character, C string, C string + length, std::basic_tring
, std::basic_string_view
, std::vector
, std::span
or any other std::ranges::contiguous_range
of characters.
The type of character can be one of:
char
,signed char
,unsigned char
(these are always assumed to be UTF-8)char8_t
char16_t
char32_t
.
sys_string str1('a');
assert(str1 == S("a"));
sys_string str2(u"Hello!");
assert(str2 == S("Hello!"));
sys_string str3(U"Hello!", 2);
assert(str3 == S("He"));
std::string cppstr = "World";
sys_string str4(cppstr);
assert(str4 == S("World"));
std::vector<char16_t> vec = {u'a', u'b', u'c'};
sys_string str5(vec);
assert(str5 == S("abc"));
For the constructor taking a character pointer, nullptr
is a valid value and results in an empty sys_string
.
A sys_string_t
internally stores data in one of: UTF-8 (char
), UTF-16 (char16_t
) or UTF-32 (char32_t
) encodings, depending on configuration. The specific type used is represented by sys_string_t::storage_type
typedef. What happens when you create a sys_string_t
from external data that contains invalid Unicode? It depends on whether the external data is in the same encoding as the storage type.
- If the type is the same then the data is copied into
sys_string_t
verbatim. This allows you to store and manipulate invalid Unicode in platform specific way. For example Windows filenames, while ostensibly in UTF-16, can contain unpaired surrogates.sys_string_t
allows you to handle these but you need to make sure you pass them in as the OS gives them - inwchar_t *
orchar16_t *
- If the type is not the same then the result is unspecified but valid Unicode. In practice invalid content will be replaced by one or more
'REPLACEMENT CHARACTER' (U+FFFD) - similar to how a text editor or a browser would deal with it. The exact number of characters and the replacement algorithm in general are deliberately left unspecified. Thus portable code (that is code that is agnostic to the type of
sys_string_t
storage) cannot rely on invalid Unicode having a specific representation - this is by design.
In addition to portable conversions described above, sys_string
supports platform-specific conversions to and from various system string types.
Those are described on the following pages.
You can combine strings using +
operator.
sys_string str1 = S("a");
sys_string str2 = S("b");
sys_string str3 = S("c");
sys_string str = str1 + str2 + str3;
assert(str == S("abc"));
Unlike std::string
addition of sys_string_t
uses expression templates to avoid creation of temporaries. This means that addition is cheap and you have no reason to avoid it.
You can add sys_string_t
, single char32_t
characters and any range of any characters this way.
For example:
using namespace std::literals;
sys_string result = S("a") + U'b' + "cd" + "ef"s + u"gh"sv + U"ij" + std::vector{'k', 'l'};
assert(result == S("abcdefghijkl"));
Warning
You must not use auto
to declare the addition result. The result of an addition is a special temporary that only performs actual concatenation when converted to sys_string_t
. Using auto
declares a variable of that temporary type which will result in dangling pointers.
auto res = sys_string("abc") + sys_string("xyz"); //Bad!
//using res1 here is undefined behavior - it refers to already destroyed temporaries
if (res == S("abcxyz"))
...
sys_string_t
supports the normal inventory of comparison operators: ==
, !=
and <=>
(from which <
, <=
etc. are auto-generated). Comparison of strings is always performed as a binary comparison of internal representations, not Unicode codepoints and not normalizing Unicode. That is:
//these strings are canonically equivalent but not the same
assert(S("Cafe\u0301") != S("Café"));
The comparisons are defined this way for performance. If you do need canonical equivalence see normalize()
method below.
assert(S("Cafe\u0301").normalize(normalization::nfc) == S("Café").normalize(normalization::nfc));
This implies that while identical strings will compare equal and all strings are strongly ordered, the specific order might differ from the one a human would expect. If you do require human-oriented ordering, the correct thing to do is to use a full-fledged collator from another library such as ICU or Windows NLS.
There is also a free function compare
that behaves identically to <=>
operator.
Unlike std::string
, sys_string_t
also supports case insensitive comparisons.
This is done by binary comparing result of Unicode case folding.
assert(compare_no_case(S("maße"), S("MASSE")) == std::ordering::equal);
Both compare
and compare_no_case
return std::strong_ordering
.
You can iterate over sys_string
in 5 possible ways:
- Over a sequence of UTF-32,
char32_t
codepoints. This iteration is unidirectional, either forward or reverse and 'sanitizing'. - Over a sequence of UTF-16,
char16_t
units. This iteration is unidirectional, either forward or reverse and 'sanitizing'. - Over a sequence of UTF-8,
char
units. This iteration is unidirectional, either forward or reverse and 'sanitizing'. - Over a sequence of "storage units". This iteration is random-access and NOT sanitizing.
- Over a sequence of "grapheme clusters" (or graphemes) for short. This iteration is different from the others and is described fully below.
A "storage unit" is a platform dependent type (either char
, char16_t
or char32_t
) that is used to actually store the data in sys_string
.
It is represented by type sys_string::storage_type
. Iterating over storage units is the fastest way to iterate and the only way to get access
to "invalid Unicode" that might be stored in sys_string
.
All other types of iteration are "sanitizing". It is possible for a string to contain invalid Unicode - see Conversions from invalid Unicode. When this happens a 'sanitizing' iteration will replace invalid units with one or more replacement characters U+FFFD
in an unspecified manner. Note that it is not even guaranteed that forward and reverse iteration will produce the same replacements. Thus, when dealing with invalid Unicode, all you can expect is that sanitizing iteration will always produce valid Unicode but you cannot specifically guarantee what it would be. If you do need to predictably handle invalid Unicode content you need to iterate over storage units.
To iterate over string storage units you first need to create a sys_string::char_access
object. This object is cheap to create and acts as a "reference" to the actual sys_string
. It is a "reference object" and is only valid while the original string exists. It is meant to be used on stack, during computation and then discarded rather than stashed away for subsequent use.
sys_string::char_access
is a random_access_range but not a view and not even a viewable_range.
Thus in order to use this range in view pipelines you need to have it as an lvalue and access it via std::ranges::views::all
or std::ranges::ref_view
. For example
using namespace std::ranges;
using namespace std::views;
sys_string str = S("abc");
sys_string::char_access access(str);
assert(equal(all(access) | take(1), std::array{sys_string::storage_type('a')}))
sys_string::char_access
is a read-only range and is the fastest way to go over the sys_string
content.
The size of the range is given by size()
member (it is a sized_range).
There is also an operator[]
that allows direct indexing access to the sequence.
sys_string str("Hello");
for(sys_string::storage_type c: sys_string::char_access(str)) {
...
}
sys_string::char_access access(str);
std::cout << "storage size is " << access.size() << '\n';
std::cout << "unit at [0] is " << access[0] << '\n';
On some platforms sys_string::char_access
has a data()
member that returns a pointer to a contiguous array of storage units. Combination of data()
and size()
allow fast interop with platform specific code.
To perform UTF iteration you need to construct one of: utf32_access
, utf16_access
or utf8_access
from sys_string
.
These are read-only forward_ranges but not views and not even viewable_ranges.
sys_string str("Hello");
for(char32_t c: sys_string::utf32_access(str)) {
...
}
for(char16_t c: sys_string::utf16_access(str)) {
...
}
for(char c: sys_string::utf8_access(str)) {
...
}
Note that all these ranges have sentinel types (results of end()
, cend()
etc.) different from their iterator types. Thus they are not
common_ranges and generally require
new range algorithms from std::ranges
namespace rather than traditional ones.
For example:
sys_string str = S("abc");
sys_string::utf32_access view(str);
std::vector<char32_t> dest;
//This won't compile
//std::copy(view.begin(), view.end(), std::back_inserter(dest));
//But this will
std::ranges::copy(view.begin(), view.end(), std::back_inserter(dest));
//Or simply
std::ranges::copy(view, std::back_inserter(dest));
Similar to sys_string::char_access
(which they actually store inside) or std::string_view
these views are reference types. Their instances are only valid as long as the original string exists.
Despite being forward_range
as far as standard library is concerned these ranges also allow reverse iteration. That is, they
have reverse_iterator type and rbegin
/rend
methods. There is currently no standard concept that describes such a range so this library provides
sysstr::reverse_traversable_range
.
The fact these ranges are reversible rather than bidirectional is due to the two issues.
The first is efficiency. Allowing a decoding iterator to move in the opposite direction will increase its size and won't be as efficient as forward move.
The second is the 'sanitizing' iteration. Code using bidirectional iterators usually assumes that:
iterator it1 = ...;
iterator it2 = it1;
it2 = ++(--it2);
assert(it2 == it1);
However, with sanitizing iteration this might not necessarily be the case. If a substitution is made for illegal characters the amount of movement and returned values may differ for forward and reverse iteration. Thus to avoid subtle issues UTF range iterators disallow --
. If you want to iterate in reverse simply use returned by rbegin()
/rend()
. You can also use convert forward to reverse iterators and vice versa as follows: sysstr::ranges::make_reverse_iterator(view, it)
to obtain a reverse iterator from a forward one and vice-versa.
utf32_access access(...);
utf32_access::iterator it = ...;
utf32_access::reverse_iterator rit = access.reverse(it);
//or
//utf32_access::reverse_iterator rit = ranges::make_reverse_iterator(access, it);
utf32_access::iterator it1 = access.reverse(rit);
//or
//utf32_access::iterator it1 = ranges::make_reverse_iterator(access, rit);
assert(it1 == it2);
//you can also do reverse on sentinels
utf32_access::reverse_iterator rfirst = access.reverse(access.end());
utf32_access::iterator first = access.reverse(access.rend());
Since the internal facility to perform UTF iteration is quite generic this library exposes it to allow you to perform UTF iteration over any C++ input range of compatible characters (char
, char8_t
, char16_t
, char32_t
, and possibly wchar_t
on platforms where it is encoded in UTF-16 or UTF-32). At the time of this writing there is a work on including something similar to C++ standard library but, even if eventually approved, it will be a long time before it will become available.
You access UTF sequences by using view adapters. For example:
std::string str = "😀😜";
for (char16_t c: as_utf16(str)) {
...
}
for (char32_t c: as_utf32(str)) {
...
}
for (char c: as_utf8(u"😀😜")) {
...
}
The as_utf8
, as_utf16
and as_utf32
adapters produce view objects (e.g. fulfilling std::ranges::view concept) that can be used like any standard view. You can iterate over them in a for-each loop, pass their iterators or themselves to range algorithms etc.
If your standard library supports user-supplied range adapter closures these views can also be used in view pipelines:
as_utf8(u"😀😜") | std::views::take(1) | ...
Sometimes even UTF-32 iteration is not what you need. Many user perceived "characters" are actually composed from multiple UTF-32 codepoints. Unicode standard defines grapheme cluster as what corresponds to a user notion of a character. A single grapheme cluster, or grapheme for short, can contain one or more Unicode codepoints.
This library allows you to easily iterate over grapheme clusters in sys_string_t
content as well as in any C++
forward_range of compatible character type.
To iterate over graphemes you need to construct an instance of grapheme_view
directly or use graphemes
view adapter. In either case
you need to supply a view of characters to iterate over. The view can be a reference to sys_string_t::char_access
, sys_string_t::utfX_access
or any other compatible forward view.
The "values" returned from grapheme_view
are std::ranges::subrange
of the underlying view containing graphemes.
To put it all in context here is how you can iterate over all graphemes in a sys_string
.
sys_string str = S("क्त्य"); //6 Unicode codepoints but one grapheme!
sys_string::char_access access(str);
for (auto grapheme_range: graphemes(access)) {
//grapheme_range is a subrange of sys_string::char_access::iterator
sys_string grapheme(grapheme_range);
}
A grapheme_view
is reversible, that is it can be iterated in both directions. Here is how to accomplish a common task -
safely remove the last "character" from a string (see Substrings below for details on how to obtain parts of a string):
sys_string str = S("abक्त्य");
sys_string::char_access access(str);
auto gr_view = graphemes(access);
if (auto rit = gr_view.rbegin(); rit != gr_view.rend()) {
auto grapheme = *rit;
str = sys_string(access.begin(), grapheme.begin());
}
assert(str == S("ab"));
You can easily extend this to removing any number of trailing characters.
You can obtain a substring of a sys_string
in two ways:
- From a pair of
sys_string::utf32_access::iterator
s (or a range that provides them). This will "cut" the string at Unicode codepoint boundary ensuring that the output is well formed. This is the preferred way to slice strings.
const sys_string & str = S("🤢abc");
sys_string::utf32_access view(str);
auto first = view.begin();
auto last = std::advance(first, 1);
sys_string substring(first, last);
assert(substring == S("🤢"));
- From a pair of
sys_string::char_access::iterator
s (or a range that provides them). Sincechar_access
iterates over the actual storage units this allows you to "cut" the string anywhere. This way of slicing is dangerous, since you can end up slicing the string in a middle of a Unicode character. Unless this is exactly what you want to do for some reason, this would likely result in bugs in your code. Remember that sanitizing UTF iteration over such string will produce substitution characters.
const sys_string & str1 = S("abc");
sys_string::char_access access1(str1);
auto first = access1.begin();
auto last = first;
++last;
sys_string substring1(first, last);
//This works because 'a' is a single character in any UTF encoding
assert(substring1 == S("a"));
const sys_string & str2 = S("🤢bc");
sys_string::char_access access2(str2);
first = access2.begin();
last = first;
++last
sys_string substring2(first, last);
//Unless the interal storage is UTF-32 (which is currently not the case with any storage) the
//substring will NOT be equal to the first character
assert(substring2 != S("🤢"));
What if you are iterating over the string using utf16_access
or utf8_access
and want to slice it using those views' iterators? This is also possible since all of the utf_access
iterators provide storage_start()
method that returns a char_access::iterator
that points to the start of the Unicode codepoint in which the current utf8/utf16 iterator is in.
There is also storage_next()
which points to the points to the end of the Unicode codepoint in which the current utf8/utf16 iterator is in. Finally there is storage_last()
which return the sentinel at the end of storage sequence. These can be used to slice the string too (see also the second method below)
const sys_string & str = S("🤢abc");
sys_string::utf8_access view(str);
auto first = view.begin();
//The 🤢 character takes 4 UTF-8 bytes
assert(sys_string(first.storage_start(), std::advance(first, 1).storage_start()) == S(""));
assert(sys_string(first.storage_start(), std::advance(first, 2).storage_start()) == S(""));
assert(sys_string(first.storage_start(), std::advance(first, 3).storage_start()) == S(""));
assert(sys_string(first.storage_start(), std::advance(first, 4).storage_start()) == S("🤢"));
[!TIP] Note that for all the slicing methods you can also use reverse iterators in a similar fashion, if needed.
Ok, all this is great, but what if you do need to get simple C const char *
out of sys_string
in a portable fashion? Perhaps you need to call a C API or one that expects std::string_view
on all platforms. To support this scenario sys_string::char_access
exposes c_str()
method:
const sys_string & str = S("abc");
sys_string::char_access access(str);
const char * cstr = access.c_str();
assert(strcmp(cstr, "abc") == 0);
The returned pointer is valid as long as char_access
object is alive. The returned string is always in UTF-8, not in your locale's "narrow" encoding. All calls to c_str()
on the same char_access
object return the same pointer - that is the result is memoized.
Note that if the underlying string storage is not made of char
s the first call to c_str()
will result in memory allocation and conversion of sys_string
data to UTF-8. If you end up doing the above a lot and it becomes a performance issue this is a sign that sys_string
is not a right class for your problem domain and std::string
might be a better choice.
Sometimes you might want to access underlying storage units of sys_string
as a C array. To do so there are 3 methods that are usually need to be called together in portable code. The first one is data()
. It returns a direct pointer to underlying storage if the underlying storage is, in fact, a contiguous array. Otherwise it returns nullptr
. In such case you will need to allocate contiguous storage yourself - the size is given by storage_size()
- and copy data into it using copy_data()
. For example:
const sys_string & str = S("abc");
const sys_string::size_type size = str.storage_size();
if (const sys_string::storage_type * data = str.data())
{
use_data(data, size);
}
else
{
std::vector<sys_string::storage_type> buffer(str.storage_size());
str.copy_data(0, buffer.data(), str.storage_size());
use_data(buffer.data(), size);
}
Note that the content returned by data()
may be not null-terminated.
The copy_data()
method also allows you to copy starting from any index (first argument) and any valid size (third argument).
Instances of sys_string
are immutable. When you do need to construct a sys_string
from parts you have 3 ways of doing so
- Use
+
operator. This work well for concatenation of existingsys_string
s but is not optimal if the items you need to concatenate aren'tsys_string
orchar32_t
and need conversion tosys_string
. For example consider
std::string str1 = ..., str2 = ...;
sys_string result = sys_string(str1) + sys_string(str2)
This might perform 2 allocation for addends to store the conversion results and then add them and discard the temporaries.
-
Use
format()
orstd_format
methods. The first method works likeprintf
except it returns the formatted result in asys_string
. The second does the same usingstd::format
facility (if available) and uses its syntax. This works well for formatting but it is very slow and inefficient. -
Use
sys_string_builder
class. This is the most general, performant and flexible way.
To use sys_string_builder
you create an instance of it, perform various methods that populate its content and then call the build()
method to produce a sys_string
.
The build()
method is noexcept
and is very low or 0 cost - there are no allocations happening in it. Everything that needed to be allocated is already done by the time you call it.
Generally, sys_string_builder
can be used in two ways. First it exposes the conventional "builder" idiom where you can do things like:
std::string cppstr = ...;
sys_string str = sys_string_builder()
.append(U'🟣')
.append("abc")
.append(u"xyz")
.append(cppstr)
.build();
This is more efficient than concatenating sys_string
s because conversions happen only once and re-allocations follow the amortized cost algorithm pattern used by std::vector
or std::string
.
Second, sys_string_builder
exposes interface similar to the one of std::vector<char32_t>
. You can do things like push_back(char32_t)
, pop_back()
, insert(iterator, char32_t)
, erase(iterator)
, erase(iterator, iterator)
or clear()
.
The iterators used in these methods are returned by the regular begin()
, end()
etc. methods.
For example you can do things like:
sys_string_builder builder;
builder.insert(builder.end(), U'a');
builder.erase(builder..begin());
std::ranges::copy(U"bc"sv, std::back_inserter(builder));
assert(builder.build() == S("bc"));
However, note that sys_string_builder
isn't really a container of char32_t
s. Its internal storage unit is the same as sys_string
and is platform dependent. The interface deliberately forces you to use char32_t
s to prevent constructing invalid Unicode.
The iterators do not return references to the actual underlying storage and you cannot make modifications via iterators' operator*
. That is the following won't work
sys_string_builder builder;
builder.append("abc");
*builder.begin() = 'q'; //This won't work!
Another difference from std::vector
is that there is no size()
, resize()
etc. methods - the internal storage size is not the same as size in char32_t
s. Instead there are storage_size()
, resize_storage()
and reserve_storage()
methods.
In addition, sys_string_builder
supports as_utf8
, as_utf16
and as_utf32
adapters allowing you to iterate over its contents in any of these ways. The iterators exposed by as_utf32
are exactly the same as sys_string_builder
iterators.
auto empty() const noexcept -> bool
sys_string
supports common transformations:
to_lower()
which returns the original string converted to lowercaseto_upper()
which returns the original string converted to uppercasenormalize(normalization)
which normalizes the string to standard NFC or NFD normalization forms.
to_lower()
/to_upper()
implement default Unicode case conversion (see sections 3.13 and 5.18 of the Unicode standard). This
conversion is non-locale specific. (E.g. final Greek Σ
will convert to ς
but uppercase of i
is always I
regardless of locale).
assert(S("βους").to_upper() == S("ΒΟΥΣ"));
assert(S("ΒΟΥΣ").to_lower() == S("βους"));
assert(S("maße").to_upper() == S("MASSE"))
For normalization only NFC and NFD forms are supported being of practical importance to most applications. If you require other normalization forms you are likely to be doing some non-trivial text processing and likely already have a library such as ICU that can help with those.
assert(S("\u00C5").normalize(normalization::nfd) == S("\u0041\u030A"));
assert(S("\u0041\u030A").normalize(normalization::nfc) == S("\u00C5"));
static auto format(const char * spec, ...) -> sys_string;
static auto formatv(const char * spec, va_list vl) -> sys_string;
The allowed format syntax is the exactly the same as for printf
on your platform.
template<class... Args>
static auto std_format(std::format_string<Args...> fmt, Args &&... args) -> sys_string_t;
static auto std_vformat(std::string_view fmt, std::format_args args) -> sys_string_t;
This uses std::format
format string syntax and provides modern and type-safe alternative to format
.
However, these methods are only available if your standard library supports std::format
facility.
friend auto operator<<(std::ostream & str, const sys_string & val) -> std::ostream &;
Prints the string content as UTF-8 into std::ostream
. If the storage type of sys_string
is char
its content is printed verbatim.
Otherwise it is first converted to UTF-8.
Warning
If the output is converted to UTF-8 and goes somewhere that doesn't use UTF-8 encoding (Windows console or Unix terminal with non UTF-8 locale, for example) the output will be garbled.
On Windows or on any platform that defines __STDC_ISO_10646__
macro there is also
friend auto operator<<(std::wostream & str, const sys_string & val) -> std::wostream &;
that behaves similarly for wchar_t
.
Note
The operator<<
currently completely ignores width, precision and fill settings of the stream - the entire string
is printed out as is. This might change in future versions.
You can use sys_string
instances in std::format
. Currently there are no allowed format flags. As with operator<<
the output is produced in UTF-8.
std::string str = std::format("Hello {}", S("abc"));
assert(str = "Hello abc");
The following methods trim the string on both ends, left or right. By default space characters are removed. You can pass a predicate functor to change that.
The default sysstr::isspace
predicate detects as space characters those having Unicode property White_Space.
template<std::predicate<char32_t> Pred = sysstr::isspace>
auto trim(Pred pred = Pred()) const -> sys_string;
template<std::predicate<char32_t> Pred = sysstr::isspace>
auto ltrim(Pred pred = Pred()) const -> sys_string;
template<std::predicate<char32_t> Pred = sysstr::isspace>
auto rtrim(Pred pred = Pred()) const -> sys_string;
Many sys_string_t
method described below can take either a sys_string_t
as an argument or a single char32_t
.
The concept that describes such argument is named sys_string_or_char
in this text.
The following methods split the string into a sequence of strings on a given separator
template<sys_string_or_char StringOrChar, std::output_iterator<sys_string> OutIt>
auto split(OutIt dest, const StringOrChar & sep, size_t max_split = std::numeric_limits<size_t>::max()) const -> OutIt;
template<std::invocable<Search, typename sys_string_t::utf32_access::iterator, std::default_sentinel_t> Search,
std::output_iterator<sys_string> OutIt>
auto split(OutIt dest, Search pred) const -> OutIt;
The dest
argument is an output iterator to the destination. Each split part will be written into it as sys_string
.
The second argument can be
- A
char32_t
or asys_string
in which case it is the separator to split on. The optional 3rd argument specifies maximum number of splits to make. - A searcher functor that is given a pair
(begin, end)
ofutf32_view
iterators. It is responsible to locate the first occurrence of a separator in this sequence and return a pair of iterators (std::pair
,std::tuple
or anything that can be destructured) to beginning and end of the separator. To finish splitting it must return an empty range.
The rules for splitting on char32_t
or a sys_string
are as follows:
- If the string is empty the output is a single empty string
- If the separator is not found the output is a single string equal to source
- An empty separator is never found.
- At most
max_split
splits are made so the output containsmax_split + 1
elements. Ifmax_split
is 0 the output is a single string equal to source. - If the source starts with separator the output will start with an empty string (e.g. empty "prefix"). Similarly, if the source ends with separator the output will end with an empty string (e.g. empty "suffix"). Thus, if the source is equal to separator the output will be 2 empty strings.
This is the reverse of splitting. The following method joins a sequence of sys_strings
using the source
template<std::input_iterator It, std::sentinel_for<It> EndIt>
auto join(It first, EndIt last) const -> sys_string;
template<std::ranges::input_range Range>
auto join(const Range & range) const -> sys_string;
The passed iterators/range should produces something that can be appended (via one of the .append()
methods) to a sys_string_builder
.
You use it like this
std::vector<sys_string> parts = {S("a"), S("b")};
sys_string joined = S(',').join(parts);
assert(joined == S("a,b"));
The rules for joining are as follows
- If the sequence is empty (
first == last
), the output is an empty string. - If the sequence contains one string, the output is that string.
You can check whether a string starts with/ends with/contains another string or a single char32_t
using
template<sys_string_or_char StringOrChar>
auto starts_with(const StringOrChar & prefix) const -> bool;
template<sys_string_or_char StringOrChar>
auto ends_with(const StringOrChar & suffix) const -> bool;
template<sys_string_or_char StringOrChar>
auto contains(const StringOrChar & inner) const -> bool;
If you have multiple prefixes/suffixes/infixes to test you can use the following which is more efficient than calling the above in a loop
template<std::input_iterator It, std::sentinel_for<It> EndIt>
auto find_prefix(It first, EndIt last) const -> It;
template<std::ranges::input_range Range>
auto find_prefix(Range && range) const -> It;
template<std::input_iterator It, std::sentinel_for<It> EndIt>
auto find_suffix(It first, EndIt last) const -> It;
template<std::ranges::input_range Range>
auto find_suffix(Range && range) const -> It;
template<std::input_iterator It, std::sentinel_for<It> EndIt>
auto find_contained(It first, EndIt last) const -> It;
template<std::ranges::input_range Range>
auto find_contained(Range && range) const -> It;
The iterators should return the sequence of sys_string
or char32_t
to attempt to match. The return value is the iterator to the first found item or end of the sequence.
Sometimes, it is convenient to find whether a given substring is present and get the parts before, after or both in one step. The following methods enable this functionality:
template<sys_string_or_char StringOrChar>
auto suffix_after_first(const StringOrChar & divider) const -> std::optional<sys_string_t>;
template<sys_string_or_char StringOrChar>
auto prefix_before_first(const StringOrChar & divider) const -> std::optional<sys_string_t>;
template<sys_string_or_char StringOrChar>
auto suffix_after_last(const StringOrChar & divider) const -> std::optional<sys_string_t>;
template<sys_string_or_char StringOrChar>
auto prefix_before_last(const StringOrChar & divider) const -> std::optional<sys_string_t>;
template<sys_string_or_char StringOrChar>
auto partition_at_first(const StringOrChar & divider) const -> std::optional<std::pair<sys_string_t, sys_string_t>>;
template<sys_string_or_char StringOrChar>
auto partition_at_last(const StringOrChar & divider) const -> std::optional<std::pair<sys_string_t, sys_string_t>>;
All these methods return an std::optional
result which is not set if the divider isn't present. If it is present the
result contains the looked for part: suffix, prefix or an std::pair
of both.
You can find and remove prefix or suffix in one call using
template<sys_string_or_char StringOrChar>
auto remove_prefix(const StringOrChar & prefix) const -> sys_string;
template<sys_string_or_char StringOrChar>
auto remove_suffix(const StringOrChar & prefix) const -> sys_string;
Sometimes it is
For all of these method an empty string argument is always "found".
You can replace content of the string using
template<sys_string_or_char StringOrChar1, sys_string_or_char StringOrChar2>
auto replace(const StringOrChar1 & old, const StringOrChar2 & new_,
size_t max_count = std::numeric_limits<size_t>::max()) const -> sys_string>;
This replaces at most max_count
of occurrences of old
with the new_
and returns the resultant string.
The rules for finding old
are as follows
- An empty string is never found.
- Search is never done in replacement. That is after finding an instance of
old
and replacing i withnew_
the search continues after the replacement never taking it into account.
Note that unlike similar method in other languages replace
doesn't currently support regular expressions.