Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String_200 #32

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
60 changes: 60 additions & 0 deletions R/split_var_call.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
R_SPLIT = function(domain_dataset,max_length_out = 200){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Please use tidyver styleguid and snake case https://style.tidyverse.org/syntax.html
  • Please add roxygen documentation, with examples
  • Please add tests cases
  • Please add namesacpes:: to all functions calls from tidyverse packages

out_n = outt = NULL

#filtering columns > 200
char_200 = domain_dataset %>% select_if(~ max(nchar(.)) >= max_length_out)

#string split function
split_var <- function(string,max_length_out = 200) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not keep function inside function.


# Pattern spot
pattern = names(which.max(table(str_extract_all(string, "[:punct:]|[:blank:]")))) %>%
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pattern can be used for split, but it will not work as sep in paste function.

ifelse(is.null(.), "",.)

# Split the input string into a vector
split_vector <- unlist(stringr::str_split(string, pattern))

# Function to concatenate strings and split when length exceeds 200
split_when_needed <- function(result, item, sep, max_length_out) {
current <- utils::tail(result, 1)
if (nchar(paste0(current, sep, item)) <= (max_length_out - 1)) {
result[length(result)] <- paste0(current, sep, item)
} else {
if (!identical(sep, " ")) result[length(result)] <- paste0(current, sep)
result <- c(result, item)
}
result
}

# Use reduce to apply the function across the vector
split_vector <- split_vector[-1] |>
purrr::reduce(split_when_needed, .init = list(split_vector[1]), pattern, max_length_out) |>
unlist()

# Fix case where sentence do not exceed max_length_out
last_two <- paste0(utils::tail(split_vector, n = 2), collapse = if (identical(pattern, " ")) " " else "")
if (nchar(last_two) <= max_length_out) {
split_vector <- c(
utils::head(split_vector, n = -1),
last_two
)
}
return(as.list(split_vector))
}

#FUNCTION CALL
outt <- map(char_200, ~ {
split_list <- map(.x, ~ {
cv <- as.data.frame(split_var(.x, max_length_out))
names(cv) <- seq_along(cv)
cv
})
split_df <- bind_rows(split_list)
split_df
}) %>% imap(.,~set_names(.x,.y)) %>% bind_cols()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we consider use of tidyr?


names(outt) = sub("....$","",names(outt)) %>% make.unique(., sep = "_")
dataset_OUT = bind_cols(domain_dataset %>% select(-names(char_200)),outt)
return(dataset_OUT)
}