Don't need map() for summarizing models #33

Aariq · 2022-04-07T14:01:43Z

Line 641 in a8ad4dd

mtcars %>%

This works fine:

mtcars %>% 
  group_by(cyl) %>% 
  summarize(r.sq = summary(lm(mpg ~ wt))$r.squared)

The problem here is not having to learn new paradigms just to do it, it's that you can't easily save intermediate steps because summarize wants the right hand side to be a vector, not a model object.

For example, the following code errors:

mtcars %>% 
  group_by(cyl) %>% 
  summarize(m = lm(mpg ~ wt))

And to get it to work, you have to start dealing with list-columns, which is a whole thing:

#this works, but makes a data frame with a list-column
mtcars %>% 
  group_by(cyl) %>% 
  summarize(m = list(lm(mpg ~ wt)))

The text was updated successfully, but these errors were encountered:

sda030 · 2022-08-06T18:11:59Z

This is to me beautiful, simple and logical. Split the dataset by cyl. For each part, run the following functions: run model, extract model fit, and finally bind it all together, while preserving name of splitted variable in output. Oh, and all the output one could need.

library(dplyr)
library(broom)
mtcars %>% 
    group_by(cyl) %>%
    group_map(.f=~lm(mpg ~ wt, data=.x) %>% glance()) %>%
    bind_rows(.id = "cyl")
#> # A tibble: 3 × 13
#>   cyl   r.squared adj.r…¹ sigma stati…² p.value    df logLik   AIC   BIC devia…³
#>   <chr>     <dbl>   <dbl> <dbl>   <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl>
#> 1 1         0.509   0.454  3.33    9.32  0.0137     1 -27.7   61.5  62.7   99.9 
#> 2 2         0.465   0.357  1.17    4.34  0.0918     1  -9.83  25.7  25.5    6.79
#> 3 3         0.423   0.375  2.02    8.80  0.0118     1 -28.7   63.3  65.2   49.2 
#> # … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
#> #   variable names ¹adj.r.squared, ²statistic, ³deviance

^{Created on 2022-08-06 by the reprex package (v2.0.1)}

matloff · 2023-03-11T05:08:40Z

A point I make in the essay that intermediate steps are GOOD for beginning coders, the group who my essay focuses on.

andresnecochea · 2025-02-23T21:32:06Z

I'm a tidyverse advocate. But I have to concede that by() is far easier for grouped statistical test.

by(
  mtcars,
  mtcars$cyl,
  \(x) summary(lm(mpg ~ wt, data = x))
)

And use S3 class for make list with customized print methods (like htest or lm) makes much more sense to me.

If you want the results in a data.frame (tidy style) it can be done but why.

Anyway, here is the code:

by(
  mtcars,
  mtcars$cyl,
  \(x) {
    res <- summary(lm(mpg ~ wt, data = x))
    c(
      res[c("r.squared", "adj.r.squared")],
      "fstatistic"=res$fstatistic[1],
      "p-value"=res$coefficients[2, "Pr(>|t|)"]
    )
  }
) |> array2DF()

# output 
#   mtcars$cyl r.squared adj.r.squared fstatistic.value    p-value
# 1          4 0.5086326     0.4540362         9.316233 0.01374278
# 2          6 0.4645102     0.3574122         4.337245 0.09175766
# 3          8 0.4229655     0.3748793         8.795985 0.01179281

dusadrian · 2025-02-24T06:49:52Z

The \(x) part is a nice trick I did not know. But I still find this more intuitive and a whole lot easier to teach beginners:

admisc::using(
    warpbreaks,
    coef(lm(breaks ~ wool)),
    split.by = tension
)

#   (Intercept)    woolB   
# L    44.556     -16.333  
# M    24.000       4.778  
# H    24.556      -5.778

matloff · 2025-02-24T21:29:48Z

People keep forgetting the central point of my Tidyverse Skeptic essay: The Tidyverse is an awful environment for R learners who lack coding background. This discussion here, which debates whether one complex Tidyverse solution is better than another, ignores that basic fact.

Aariq · 2025-02-24T21:42:48Z

Right, but the evidence you point to to support your claim in this section of the readme is (purposefully?) misleading. Shouldn't you want to correct that? Your "evidence" here is specifically that in order to use the tidyverse to solve this problem, you must use 3 "different"¹ map() functions.

The R learner here must learn two different FP map functions for this
particular example. This is an excellent example of Tidy's cognitive
overload problem.

That is incorrect. You need not use any. In fact, it is not recommended that you use any.

This is also misleading in two ways. First, map(), and map_dbl() have the exact same interface and only differ in what they return so there is very little to learn here, by design. Second, even if you did have to learn map(), you don't need map_dbl() there! Just use map() and get your output as a list! ↩

matloff · 2025-02-24T23:52:47Z

May I ask that you not use words like "purposely"? Among other things, it seems to imply that I have some sort of enmity towards the Tidyverse people, which is certainly not the case. I try to dispel that notion in the introduction. Your saying that is recommended not to use map functions is quite interesting, and seems to be at odds with everything I've heard the Tidyverse people say, including Hadley, as well as general functional programming tenets. Sounds like you might have references that I might look at. Please provide links, thanks.

…

On Mon, Feb 24, 2025, 1:43 PM Eric R. Scott ***@***.***> wrote: Right, but the evidence you point to to support your claim in this section of the readme is (purposefully?) misleading. Shouldn't you want to correct that? Your "evidence" here is specifically that in order to use the tidyverse to solve this problem, you *must* use 3 "different"1 <#m_-8589839883679646234_user-content-fn-1-a0e325d37bc6ad1a1c168b715af6ed71> map() functions. The R learner here must learn two different FP map functions for this particular example. This is an excellent example of Tidy's cognitive overload problem. That is incorrect. You need not use any. In fact, it is not recommended that you use any. Footnotes 1. This is also misleading in two ways. First, map(), and map_dbl() have the exact same interface and only differ in what they return so there is very little to learn here, by design. Second, even if you did *have* to learn map(), you don't need map_dbl() there! Just use map() and get your output as a list! ↩ <#m_-8589839883679646234_user-content-fnref-1-a0e325d37bc6ad1a1c168b715af6ed71> — Reply to this email directly, view it on GitHub <#33 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABZ34ZLMRSMZ6WIMBSPTOFD2ROG7DAVCNFSM6AAAAABXWWRYUWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNZZG4YTOOBZGA> . You are receiving this because you commented.Message ID: ***@***.***> [image: Aariq]*Aariq* left a comment (matloff/TidyverseSkeptic#33) <#33 (comment)> Right, but the evidence you point to to support your claim in this section of the readme is (purposefully?) misleading. Shouldn't you want to correct that? Your "evidence" here is specifically that in order to use the tidyverse to solve this problem, you *must* use 3 "different"1 <#m_-8589839883679646234_user-content-fn-1-a0e325d37bc6ad1a1c168b715af6ed71> map() functions. The R learner here must learn two different FP map functions for this particular example. This is an excellent example of Tidy's cognitive overload problem. That is incorrect. You need not use any. In fact, it is not recommended that you use any. Footnotes 1. This is also misleading in two ways. First, map(), and map_dbl() have the exact same interface and only differ in what they return so there is very little to learn here, by design. Second, even if you did *have* to learn map(), you don't need map_dbl() there! Just use map() and get your output as a list! ↩ <#m_-8589839883679646234_user-content-fnref-1-a0e325d37bc6ad1a1c168b715af6ed71> — Reply to this email directly, view it on GitHub <#33 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABZ34ZLMRSMZ6WIMBSPTOFD2ROG7DAVCNFSM6AAAAABXWWRYUWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNZZG4YTOOBZGA> . You are receiving this because you commented.Message ID: ***@***.***>

dusadrian · 2025-02-25T07:27:37Z

People keep forgetting the central point of my Tidyverse Skeptic essay: The Tidyverse is an awful environment for R learners who lack coding background. This discussion here, which debates whether one complex Tidyverse solution is better than another, ignores that basic fact.

Not sure I understand, because the by() function, as well as the split.by argument in the function using() from package admisc are both not part of the Tidyverse: they are traditional R based solutions to common problems people (including beginners) search solutions for.

matloff · 2025-02-25T17:37:24Z

My own view -- different people have different views -- is that in teaching R learners who lack prior coding background, one should keep it as simple as possible. Abstractions that are second nature to experienced coders are not easy for such learners to understand, let alone use. So to me, just because a construct is part of base-R does not mean it is appropriate for these learners.

andresnecochea · 2025-02-28T19:45:48Z

If the problem is that some FP approaches are too complex for begginers because overly complicated abstractions demand you to bend your mind into a tesseract, then yes I totally agree, overly complex abstractions (as the ones in purrr package) should be avoided to begginers.

If the problem is that FP is abstract, if we stretch that argument we can say that everything is abstract.

If you let me put another argument. Take the dataset Chile in the carData package. If we assume that the survey is representative on national and regional level, now you need to perfom a t test for the difference of the mean of income of man and woman for all the country and for every region ¿how you do this on Stata?

For all the country:

ttest income, by(sex)

For every region:

bysort region: ttest income, by(sex)

This is a very basic task that you would learn in every introductory course of Stata (and SPSS and every other statistical package).

¿How you do the same task on R?

For the full country is pretty straightforward:

t.test(income ~ sex, data = Chile)

But now we need to repeat the same for every region ¿how you should learn this to a begginer?

a) by() approach:

by(Chile, Chile$region, \(x) t.test(income ~ sex, data =  x))

b) for() loop approach:

Chile_split <- split(Chile, Chile$region)
for(Region in Chile_split) {
  print(t.test(income ~ sex, data = Region))
}

c) purrr::map() and broom::glance()approach:

library(tidyverse)
Chile %>% 
  group_by(region) %>%
  nest() %>% 
  mutate(t_test = map(data, ~t.test(income ~ sex, data = .x) |> broom::glance())) %>% 
  unnest(t_test)

I totally agree than c) is a total mindbending confusion to any begginer. I would never recommend to teach this to a non coder.

But, feel free to disagree, I think a) is far easier than b) for a non coder. a) is almost the same than the Stata code. The most complex part is the use of \(x) for anonnymus functions and, yes, a lot of begginers get confused on why the data = Chile argument changes to data = x, but I still thinking that a noncoder can manage this.

I'm a sociologist and a non coder myself and I have teach a couple of sociologist, so I can tell, from my experience, than a social scientist, begginer in R and non coder can manage the level of complexity of the a) option.

matloff · 2025-02-28T21:47:29Z

Someone once said, "Programming is the management and design of abstraction." I totally agree. But of course there are various levels of abstraction.

You may not find your example (b) to be aesthetically pleasing, but I submit that it involves the least amount of abstraction.

andresnecochea · 2025-02-28T22:49:04Z

Maybe you are right. Maybe is a mistake to try to teach R the same way of Stata or SPSS. And maybe is better teach the b) option first and the a) option later as a time saving trick. I think that has some benefits, like introduce the concept of loops.

A better aproach could be teach loops first, named functions later, anonymus functions with the full function(x) later and finally teach that you can use \(x) to abreviate an anonymus function. Incrementally bulding the level of abstraction.

But I still thinking that anonymus functions, pipes and dplyr verbs are a kind of necessary evil. My life is a lot easier with them. I can code faster and I can understand the code when I read it several months later. And I can spend more time doing what a sociologist is supposed to do, analyse data through a sociological reference framework.

matloff · 2025-02-28T22:55:26Z

As far as I know, Stata and SPSS are not programming languages, so a reasonable comparison is not possible. Instead, they are very much like the way dplyr is taught to beginners -- do a few simple use cases, then do tons of examples using those use cases.

For experienced coders, FP solutions can be more compact and clearer than a loop.

dusadrian · 2025-03-01T07:29:29Z

I still find this line as having the absolute least amount of abstraction:

using(Chile, t.test(income ~ sex), split.by = region)

It is clear, it is straightforward, and easy to use.

sda030 · 2025-03-01T07:33:29Z

I find that teaching R to non-coders is easier when jumping straight to the best practices. Going first for base R and then later to tidyverse often creates frustration that they feel they have wasted time earlier, or that there are multiple ways of doing things. Loops are actually really easy for people to grasp. I do agree that closures and anonymous functions are tricky, but once learned it can be used in many places. In the same vein: Why learn == equivalence and is.na when testing set membership (albeit slightly slower) that handles missingness without shooting oneself in the foot. Easier to stick to case_when than ifelse/if_else since the former allows easy extension. Non-programmers dislike learning more than necessary or multiple ways to solve a problem. That's why I often end up jumping to tidyverse.

…

On Fri, 28 Feb 2025, 23:55 Norm Matloff, ***@***.***> wrote: As far as I know, Stata and SPSS are not programming languages, so a reasonable comparison is not possible. Instead, they are very much like the way dplyr is taught to beginners -- do a few simple use cases, then do tons of examples using those use cases. For experienced coders, FP solutions can be more compact and clearer than a loop. — Reply to this email directly, view it on GitHub <#33 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADE3365GZMESK2FPISVLYTT2SDSPHAVCNFSM6AAAAABXWWRYUWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOJRGY4DEMRRG4> . You are receiving this because you commented.Message ID: ***@***.***> [image: matloff]*matloff* left a comment (matloff/TidyverseSkeptic#33) <#33 (comment)> As far as I know, Stata and SPSS are not programming languages, so a reasonable comparison is not possible. Instead, they are very much like the way dplyr is taught to beginners -- do a few simple use cases, then do tons of examples using those use cases. For experienced coders, FP solutions can be more compact and clearer than a loop. — Reply to this email directly, view it on GitHub <#33 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADE3365GZMESK2FPISVLYTT2SDSPHAVCNFSM6AAAAABXWWRYUWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOJRGY4DEMRRG4> . You are receiving this because you commented.Message ID: ***@***.***>

dusadrian · 2025-03-01T07:42:22Z

... Why learn == equivalence and is.na when testing set
membership (albeit slightly slower) that handles missingness without
shooting oneself in the foot. ...

You don't need to. The fact that a language offers dozens of possibilities is a sign of flexibility, not a teaching weakness.
When teaching set membership, I use only one function: is.element() that handles everything.

Non-programmers dislike learning more than necessary or multiple ways to
solve a problem. That's why I often end up jumping to tidyverse.

If having a single, simple way to solve things is the issue, I suggest going back to SPSS. But I am totally against fixing beginners into a single mindset framework that would give the false impression that is the (only) "standard".

matloff · 2025-03-01T17:34:06Z

If one is looking at the long term, in which former beginners now tackle more complex settings, sometimes there is no good way to avoid loops. So for those who think beginners should be equipped with "advanced" tools, one such tool is loops. The tidyverse people, on the other hand, tell learners that loops are Bad Things.

My own view is that we should aim to quickly bring beginners up to a level where they can handle real problems. I believe that in general, FP slows down this process, even though one can point to special cases in which FP might be clearer. It may be a good idea to bring in FP a little bit at a time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't need map() for summarizing models #33

Don't need map() for summarizing models #33

Aariq commented Apr 7, 2022

sda030 commented Aug 6, 2022

matloff commented Mar 11, 2023

andresnecochea commented Feb 23, 2025

dusadrian commented Feb 24, 2025

matloff commented Feb 24, 2025

Aariq commented Feb 24, 2025

matloff commented Feb 24, 2025 via email

dusadrian commented Feb 25, 2025

matloff commented Feb 25, 2025

andresnecochea commented Feb 28, 2025

matloff commented Feb 28, 2025

andresnecochea commented Feb 28, 2025

matloff commented Feb 28, 2025

dusadrian commented Mar 1, 2025

sda030 commented Mar 1, 2025 via email

dusadrian commented Mar 1, 2025

matloff commented Mar 1, 2025

Don't need map() for summarizing models #33

Don't need map() for summarizing models #33

Comments

Aariq commented Apr 7, 2022

sda030 commented Aug 6, 2022

matloff commented Mar 11, 2023

andresnecochea commented Feb 23, 2025

dusadrian commented Feb 24, 2025

matloff commented Feb 24, 2025

Aariq commented Feb 24, 2025

Footnotes

matloff commented Feb 24, 2025 via email

dusadrian commented Feb 25, 2025

matloff commented Feb 25, 2025

andresnecochea commented Feb 28, 2025

matloff commented Feb 28, 2025

andresnecochea commented Feb 28, 2025

matloff commented Feb 28, 2025

dusadrian commented Mar 1, 2025

sda030 commented Mar 1, 2025 via email

dusadrian commented Mar 1, 2025

matloff commented Mar 1, 2025