Skip to content

Strings chapter again (and again) #36

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
{
"hash": "4876597e0d5a972f10de82702b5692bc",
"result": {
"engine": "knitr",
"markdown": "---\ntitle: \"Character Strings\"\n---\n\n::: {.cell}\n\n:::\n\n\nThe R runtime performs [string interning](https://en.wikipedia.org/wiki/String_interning) to\nall of its string elements. This means, that whenever R encounters a new string,\nit adds it to its internal string intern pool. Therefore, it is unsound to\naccess R strings mutably.\n\n::: {.callout-tip }\nA string intern pool can be thought of as a container that stores all distinct\nstrings, and then provides a lightweight reference counted variable back to it.\nAn example of such a string interner is the [`lasso`](https://crates.io/crates/lasso) crate.\n:::\n\nLet's look at a concrete example:\n\n\n::: {.cell}\n\n```{.rust .cell-code}\n#[extendr]\nfn hello_world() -> &'static str {\n \"Hello world!\"\n}\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\n.Internal(inspect(hello_world()))\n#> @11c4bd628 16 STRSXP g0c1 [] (len=1, tl=0)\n#> @119641448 09 CHARSXP g0c2 [REF(2),gp=0x60,ATT] [ASCII] [cached] \"Hello world!\"\n```\n:::\n\n\nThen, any time R encounters `\"Hello world!\"`, it retrieves it from the pool, rather\nthan re-instantiate it\n\n\n::: {.cell}\n\n```{.r .cell-code}\n.Internal(inspect(\"Hello world!\"))\n#> @11b2f0780 16 STRSXP g0c1 [REF(2)] (len=1, tl=0)\n#> @119641448 09 CHARSXP g0c2 [MARK,REF(3),gp=0x60,ATT] [ASCII] [cached] \"Hello world!\"\n```\n:::\n\n\nThe `STRSXP` is different, due to R's clone semantics, but the underlying\nstring `CHARSXP` is the same. Thus, equality is determined if two strings\nhave the same pointer, rather than if they have the same bytes.\n\nTherefore, `extendr` does not provide mutable access to an R string, because it breaks\nthe assumption that all strings are the immutable.",
"supporting": [],
"filters": [
"rmarkdown/pagebreak.lua"
],
"includes": {},
"engineDependencies": {},
"preserve": {},
"postProcess": true
}
}
1 change: 1 addition & 0 deletions user-guide/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
_drafts/*
174 changes: 174 additions & 0 deletions user-guide/type-mapping/characters.qmd
Original file line number Diff line number Diff line change
@@ -1,3 +1,177 @@
---
title: "Character Strings"
---

```{r}
#| echo: false
library(rextendr)
```

The standard type for a UTF-8 encoded string type is `String`. An example of
instantiating such a type

```{extendr, echo=TRUE}
let mut rust_string = String::new();
rust_string.push_str("Hello world!");
rust_string
```

A direct translation of this to R is
```{r}
r_string <- "Hello world!"
r_string
```

Indeed, these are the same as they contain the same utf-8 bytes

```{r}
charToRaw(r_string)
```

```{extendr}
let bytes = String::from("Hello world!");
let bytes = bytes.as_bytes().to_owned();
bytes
```


Let us investigate the address of these two identical snippets of data

```{extendrsrc}
#[extendr]
fn hello_world() -> &'static str {
let hello_world = "Hello World!";
rprintln!("Address of the Rust `hello_world`: {:p}", hello_world.as_ptr());
hello_world
}
```


```{r}
hello_world()
```

And the adress of `hello_world`, once it is part of the R runtime:

```{r}
.Internal(inspect(hello_world()))
```

::::: {.callout-note}
The return type of `hello_world` need not be `'static str`. The life-time can be made
arbitrary, such as `fn hello_world<'a>() -> &'a str`.
:::

A `character`-vector in R could be compared to a `Vec<String>` in Rust. However, there is an important distinction, that we'll illustrate with an example.

```{extendr}
let states = ["Idaho", "Texas", "Maine"]; // 5 letter states in USA
let b_states = states.into_iter().map(|x| x.as_bytes()).flatten().collect::<Vec<_>>();
b_states
```

And in R

```{r}
# charToRaw(c("Idaho", "Texas", "Maine")) // only uses first argument
vapply(c("Idaho", "Texas", "Maine"), charToRaw, FUN.VALUE = raw(5))
```

But what about identity and permanence? Let us first look at an array of string types, but with repeated strings:

```{extendr}
let sample_states = ["Texas", "Maine", "Maine", "Idaho", "Maine", "Maine"];
sample_states.into_iter()
.map(|x| format!("{:p}", x.as_ptr())).collect::<Vec<_>>()
```

and in R

```{r}
sample_states <- c("Texas", "Maine", "Maine", "Idaho", "Maine", "Maine");
.Internal(inspect(sample_states))
```

Thus, `[&str]` and `character` behave similarly. Let's investigate `&[String]`:

<!-- @co-authors: This was the only way I could write this code without rustc optimising it out.. -->

```{extendr}
[
"Texas".to_string(),
"Maine".to_string(),
"Maine".to_string(),
"Idaho".to_string(),
"Maine".to_string(),
"Maine".to_string(),
]
.iter()
.map(|x| format!("{:p}", x.as_ptr()))
.collect::<Vec<_>>()
```

<!-- @co-authors: the snippet below is an alternative to the above snippet -->

```{extendr, echo=FALSE, eval=FALSE}
let sample_states = [
"Texas",
"Maine",
"Maine",
"Idaho",
"Maine",
"Maine",
];
let mut state_ptrs = Vec::with_capacity(sample_states.len());
let mut state_strings = Vec::with_capacity(sample_states.len());
for state in sample_states {
let mut x_string = String::with_capacity(5);
x_string.push_str(state);
state_ptrs.push(format!("{:p}", x_string.as_ptr()));
state_strings.push(x_string);
}
state_ptrs
```

The memory addresses of all the items are different, even for those entries that have the same value.

Thus, R's `character` is actually more resembling that of `[&str]`, rather than a container of `String`.

<!-- TODO: mention that direct indexing in utf-8 is difficult... -->

The R runtime performs [string interning](https://en.wikipedia.org/wiki/String_interning) to
all of its string elements. This means, that whenever R encounters a new string,
it adds it to its internal string intern pool. Therefore, it is unsound to
access R strings mutably.

::: {.callout-tip }
A string intern pool can be thought of as a container that stores all distinct
strings, and then provides a lightweight reference counted variable back to it.
An example of such a string interner is the [`lasso`](https://crates.io/crates/lasso) crate.
:::

Let's look at a concrete example:

```{extendrsrc}
#[extendr]
fn hello_world() -> &'static str {
"Hello world!"
}
```

```{r}
.Internal(inspect(hello_world()))
```

Then, any time R encounters `"Hello world!"`, it retrieves it from the pool, rather
than re-instantiate it

```{r}
.Internal(inspect("Hello world!"))
```

The `STRSXP` is different, due to R's clone semantics, but the underlying
string `CHARSXP` is the same. Thus, equality is determined if two strings
have the same pointer, rather than if they have the same bytes.

Therefore, `extendr` does not provide mutable access to an R string, because it breaks
the assumption that all strings are the immutable.