Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't need map() for summarizing models #33

Open
Aariq opened this issue Apr 7, 2022 · 17 comments
Open

Don't need map() for summarizing models #33

Aariq opened this issue Apr 7, 2022 · 17 comments

Comments

@Aariq
Copy link

Aariq commented Apr 7, 2022

mtcars %>%

This works fine:

mtcars %>% 
  group_by(cyl) %>% 
  summarize(r.sq = summary(lm(mpg ~ wt))$r.squared)

The problem here is not having to learn new paradigms just to do it, it's that you can't easily save intermediate steps because summarize wants the right hand side to be a vector, not a model object.

For example, the following code errors:

mtcars %>% 
  group_by(cyl) %>% 
  summarize(m = lm(mpg ~ wt))

And to get it to work, you have to start dealing with list-columns, which is a whole thing:

#this works, but makes a data frame with a list-column
mtcars %>% 
  group_by(cyl) %>% 
  summarize(m = list(lm(mpg ~ wt)))
@sda030
Copy link

sda030 commented Aug 6, 2022

This is to me beautiful, simple and logical. Split the dataset by cyl. For each part, run the following functions: run model, extract model fit, and finally bind it all together, while preserving name of splitted variable in output. Oh, and all the output one could need.

library(dplyr)
library(broom)
mtcars %>% 
    group_by(cyl) %>%
    group_map(.f=~lm(mpg ~ wt, data=.x) %>% glance()) %>%
    bind_rows(.id = "cyl")
#> # A tibble: 3 × 13
#>   cyl   r.squared adj.r…¹ sigma stati…² p.value    df logLik   AIC   BIC devia…³
#>   <chr>     <dbl>   <dbl> <dbl>   <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl>
#> 1 1         0.509   0.454  3.33    9.32  0.0137     1 -27.7   61.5  62.7   99.9 
#> 2 2         0.465   0.357  1.17    4.34  0.0918     1  -9.83  25.7  25.5    6.79
#> 3 3         0.423   0.375  2.02    8.80  0.0118     1 -28.7   63.3  65.2   49.2 
#> # … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
#> #   variable names ¹​adj.r.squared, ²​statistic, ³​deviance

Created on 2022-08-06 by the reprex package (v2.0.1)

@matloff
Copy link
Owner

matloff commented Mar 11, 2023

A point I make in the essay that intermediate steps are GOOD for beginning coders, the group who my essay focuses on.

@andresnecochea
Copy link

I'm a tidyverse advocate. But I have to concede that by() is far easier for grouped statistical test.

by(
  mtcars,
  mtcars$cyl,
  \(x) summary(lm(mpg ~ wt, data = x))
)

And use S3 class for make list with customized print methods (like htest or lm) makes much more sense to me.

If you want the results in a data.frame (tidy style) it can be done but why.

Anyway, here is the code:

by(
  mtcars,
  mtcars$cyl,
  \(x) {
    res <- summary(lm(mpg ~ wt, data = x))
    c(
      res[c("r.squared", "adj.r.squared")],
      "fstatistic"=res$fstatistic[1],
      "p-value"=res$coefficients[2, "Pr(>|t|)"]
    )
  }
) |> array2DF()

# output 
#   mtcars$cyl r.squared adj.r.squared fstatistic.value    p-value
# 1          4 0.5086326     0.4540362         9.316233 0.01374278
# 2          6 0.4645102     0.3574122         4.337245 0.09175766
# 3          8 0.4229655     0.3748793         8.795985 0.01179281

@dusadrian
Copy link

The \(x) part is a nice trick I did not know. But I still find this more intuitive and a whole lot easier to teach beginners:

admisc::using(
    warpbreaks,
    coef(lm(breaks ~ wool)),
    split.by = tension
)

#   (Intercept)    woolB   
# L    44.556     -16.333  
# M    24.000       4.778  
# H    24.556      -5.778

@matloff
Copy link
Owner

matloff commented Feb 24, 2025

People keep forgetting the central point of my Tidyverse Skeptic essay: The Tidyverse is an awful environment for R learners who lack coding background. This discussion here, which debates whether one complex Tidyverse solution is better than another, ignores that basic fact.

@Aariq
Copy link
Author

Aariq commented Feb 24, 2025

Right, but the evidence you point to to support your claim in this section of the readme is (purposefully?) misleading. Shouldn't you want to correct that? Your "evidence" here is specifically that in order to use the tidyverse to solve this problem, you must use 3 "different"1 map() functions.

The R learner here must learn two different FP map functions for this
particular example. This is an excellent example of Tidy's cognitive
overload problem.

That is incorrect. You need not use any. In fact, it is not recommended that you use any.

Footnotes

  1. This is also misleading in two ways. First, map(), and map_dbl() have the exact same interface and only differ in what they return so there is very little to learn here, by design. Second, even if you did have to learn map(), you don't need map_dbl() there! Just use map() and get your output as a list!

@matloff
Copy link
Owner

matloff commented Feb 24, 2025 via email

@dusadrian
Copy link

People keep forgetting the central point of my Tidyverse Skeptic essay: The Tidyverse is an awful environment for R learners who lack coding background. This discussion here, which debates whether one complex Tidyverse solution is better than another, ignores that basic fact.

Not sure I understand, because the by() function, as well as the split.by argument in the function using() from package admisc are both not part of the Tidyverse: they are traditional R based solutions to common problems people (including beginners) search solutions for.

@matloff
Copy link
Owner

matloff commented Feb 25, 2025

My own view -- different people have different views -- is that in teaching R learners who lack prior coding background, one should keep it as simple as possible. Abstractions that are second nature to experienced coders are not easy for such learners to understand, let alone use. So to me, just because a construct is part of base-R does not mean it is appropriate for these learners.

@andresnecochea
Copy link

If the problem is that some FP approaches are too complex for begginers because overly complicated abstractions demand you to bend your mind into a tesseract, then yes I totally agree, overly complex abstractions (as the ones in purrr package) should be avoided to begginers.

If the problem is that FP is abstract, if we stretch that argument we can say that everything is abstract.

If you let me put another argument. Take the dataset Chile in the carData package. If we assume that the survey is representative on national and regional level, now you need to perfom a t test for the difference of the mean of income of man and woman for all the country and for every region ¿how you do this on Stata?

For all the country:

ttest income, by(sex)

For every region:

bysort region: ttest income, by(sex)

This is a very basic task that you would learn in every introductory course of Stata (and SPSS and every other statistical package).

¿How you do the same task on R?

For the full country is pretty straightforward:

t.test(income ~ sex, data = Chile)

But now we need to repeat the same for every region ¿how you should learn this to a begginer?

a) by() approach:

by(Chile, Chile$region, \(x) t.test(income ~ sex, data =  x))

b) for() loop approach:

Chile_split <- split(Chile, Chile$region)
for(Region in Chile_split) {
  print(t.test(income ~ sex, data = Region))
}

c) purrr::map() and broom::glance()approach:

library(tidyverse)
Chile %>% 
  group_by(region) %>%
  nest() %>% 
  mutate(t_test = map(data, ~t.test(income ~ sex, data = .x) |> broom::glance())) %>% 
  unnest(t_test)

I totally agree than c) is a total mindbending confusion to any begginer. I would never recommend to teach this to a non coder.

But, feel free to disagree, I think a) is far easier than b) for a non coder. a) is almost the same than the Stata code. The most complex part is the use of \(x) for anonnymus functions and, yes, a lot of begginers get confused on why the data = Chile argument changes to data = x, but I still thinking that a noncoder can manage this.

I'm a sociologist and a non coder myself and I have teach a couple of sociologist, so I can tell, from my experience, than a social scientist, begginer in R and non coder can manage the level of complexity of the a) option.

@matloff
Copy link
Owner

matloff commented Feb 28, 2025

Someone once said, "Programming is the management and design of abstraction." I totally agree. But of course there are various levels of abstraction.

You may not find your example (b) to be aesthetically pleasing, but I submit that it involves the least amount of abstraction.

@andresnecochea
Copy link

Maybe you are right. Maybe is a mistake to try to teach R the same way of Stata or SPSS. And maybe is better teach the b) option first and the a) option later as a time saving trick. I think that has some benefits, like introduce the concept of loops.

A better aproach could be teach loops first, named functions later, anonymus functions with the full function(x) later and finally teach that you can use \(x) to abreviate an anonymus function. Incrementally bulding the level of abstraction.

But I still thinking that anonymus functions, pipes and dplyr verbs are a kind of necessary evil. My life is a lot easier with them. I can code faster and I can understand the code when I read it several months later. And I can spend more time doing what a sociologist is supposed to do, analyse data through a sociological reference framework.

@matloff
Copy link
Owner

matloff commented Feb 28, 2025

As far as I know, Stata and SPSS are not programming languages, so a reasonable comparison is not possible. Instead, they are very much like the way dplyr is taught to beginners -- do a few simple use cases, then do tons of examples using those use cases.

For experienced coders, FP solutions can be more compact and clearer than a loop.

@dusadrian
Copy link

I still find this line as having the absolute least amount of abstraction:

using(Chile, t.test(income ~ sex), split.by = region)

It is clear, it is straightforward, and easy to use.

@sda030
Copy link

sda030 commented Mar 1, 2025 via email

@dusadrian
Copy link

... Why learn == equivalence and is.na when testing set
membership (albeit slightly slower) that handles missingness without
shooting oneself in the foot. ...

You don't need to. The fact that a language offers dozens of possibilities is a sign of flexibility, not a teaching weakness.
When teaching set membership, I use only one function: is.element() that handles everything.

Non-programmers dislike learning more than necessary or multiple ways to
solve a problem. That's why I often end up jumping to tidyverse.

If having a single, simple way to solve things is the issue, I suggest going back to SPSS. But I am totally against fixing beginners into a single mindset framework that would give the false impression that is the (only) "standard".

@matloff
Copy link
Owner

matloff commented Mar 1, 2025

If one is looking at the long term, in which former beginners now tackle more complex settings, sometimes there is no good way to avoid loops. So for those who think beginners should be equipped with "advanced" tools, one such tool is loops. The tidyverse people, on the other hand, tell learners that loops are Bad Things.

My own view is that we should aim to quickly bring beginners up to a level where they can handle real problems. I believe that in general, FP slows down this process, even though one can point to special cases in which FP might be clearer. It may be a good idea to bring in FP a little bit at a time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants