-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.Rmd
768 lines (593 loc) · 42.5 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
---
title: "Response to Pachter’s Review"
author: "Joshua Paik and Igor Rivin"
date: '1/22/2020, Last Update: 1/28/2020 6:00 P.M. G.M.T.'
output:
pdf_document:
keep_tex: yes
toc: yes
html_document:
toc: yes
---
>*"The value of preprints is in their ability to accelerate research via the rapid dissemination of methods and discoveries."*
> - [Lior Pachter](https://liorpachter.wordpress.com/2019/10/21/zero-data-rna-seq/)
```{r, echo=FALSE, include=FALSE}
knitr::opts_chunk$set(fig.width=6, fig.height=4)
```
```{r, echo = FALSE, warning = FALSE, include = FALSE}
library(readr)
library(dplyr)
library(ggplot2)
library(MASS)
library(naniar)
#cleaning
df1 <- read_csv("./withAMS.csv")
df1$age <- (2020- df1$year)
df1$citperyear <- df1$citations/df1$age
df1$amscitperyear <- df1$amscit/df1$age
df1$lettergroup <- as.factor(df1$lettergroup)
df1$gender <- as.factor(df1$gender)
df1$role <- as.factor(df1$role)
df1$highered <- as.factor(df1$highered)
df1$institution <- as.factor(df1$institution)
df1$research <- as.factor(df1$research)
df1$country <- as.factor(df1$country)
df1$security <- as.factor(df1$security)
df1$field <- as.factor(df1$field)
df1$simplefield <- as.factor(df1$simplefield)
```
## Introduction
Lior Pachter published a [review](https://liorpachter.wordpress.com/2020/01/17/diversity-matters/) of our [paper](https://arxiv.org/pdf/2001.00670.pdf) claiming that age was the greatest contributor to the observed difference in citations between signers of Letters A, B, and C. Any characterization of our paper which says we do not account for age is *false*. In this response, we will clarify a few points in our paper, repeat the relevant analyses, and show that citations per year is an age agnostic metric to compare mathematicians. We will also go to some length to address potential objections to this new analysis, and also show they are categorically false. **To clarify, when comparing mean and median citations per year amongst R1 Math Professors, A < B < C.** This result still stands, as does the rest of our analysis, post some revision to our data. Finally, we will clearly demonstrate Professor Pachter's mistake, namely parameter tuning (hypertuning) of his arbitrary age cutoff, to acheive a result which supports his incorrect interpretation.
We appreciate Professor Pachter's review - it gives us a chance to make our analysis stronger. We would note that a revision was already in the works and that in a normal review process we would have had three months to respond. However, as our character and ability as scientists were attacked, we thought it was appropriate to reply as quickly as possible.
## Corrections and Clarifications
We would like to thank Pachter for finding the bug in our appendix which pushes the mean Google Scholar citations of B further away from A. We agree that the sentence - “while this is not optimal, a quick sample size calculation shows that one needs 303 samples or 21% of the data to produce statistics at a 95% confidence level and a 5% confidence interval.” - is ridiculous.
We should explain exactly how the data collection took place. We used the [scholarly api](https://pypi.org/project/scholarly/) to initially collect our Google Scholar citations data. But the issues were that the scraper did not accurately differentiate between those who had a generic name and the observed fact that older mathematicians (like Cheeger or Gromov) do not have Google Scholar citations. To assure data quality, we manually checked the google scholar citations of every single letter signer, comparing publications when necessary.
However, the empirical difference in citations was staggering and we could predict an objection. More professors from R2 (teaching focused universities) signed A, so it could have pushed the average down. We had already spent so much effort collecting Google Scholar citations, so we made a choice to only collect MathSciNet data on R1 Math Professors, which is why Professor Pachter did not have MathSciNet citations in our dataset. This choice was not made explicitly enough in our first version. Let us look at the NaN values of those who are full math professors at R1 universities.
```{r, echo = FALSE}
restrictedData1 <- filter(df1, (institution=="domesticr1"&role=="professor"&field=="math"))
vis_miss(restrictedData1)
```
One sees that 17.34% of the Math Sci Net citations data is missing. It appears there was some sort of systematic but unintentional error in the data collection from MathSciNet. We report 3 NaNs on A and B, 3 on A only, 0 on B and C, 50 on B Only, and 0 on C only. We manually checked the missing data and find that all but [Marta Civil](https://www.math.arizona.edu/~civil/) and [Jeffrey X Watt](https://science.iupui.edu/people/watt-jeffrey) (who are math educators) have Math Sci Net entries. The remaining omissions are fixed and we visually check for NaNs again.
```{r, warning=FALSE, echo = FALSE, include = FALSE}
table(restrictedData1$lettergroup[is.na(restrictedData1$amscit)])
```
```{r, echo = FALSE, include = FALSE}
df <- read_csv("./revised_data.csv")
df$age <- (2020- df$year)
df$citperyear <- df$citations/df$age
df$amscitperyear <- df$amscit/df$age
df$lettergroup <- as.factor(df$lettergroup)
df$gender <- as.factor(df$gender)
df$role <- as.factor(df$role)
df$highered <- as.factor(df$highered)
df$institution <- as.factor(df$institution)
df$research <- as.factor(df$research)
df$country <- as.factor(df$country)
df$security <- as.factor(df$security)
df$field <- as.factor(df$field)
df$simplefield <- as.factor(df$simplefield)
```
```{r, echo = FALSE}
restrictedData <- filter(df, (institution=="domesticr1"&role=="professor"&field=="math"))
vis_miss(restrictedData)
```
```{r, echo = FALSE, include = FALSE}
sum(is.na(restrictedData$amscit))
sum(is.na(restrictedData$amscitperyear))
nrow(restrictedData)
```
65/323 is empty for AMS citations per year. While visually the nans appear uniform, we will impose a stricter signifance level, say 2%, to assess the difference in AMS citations per year.
Now that we are comparing apples to apples, we reperform the main results.
## The Main Result of Paik-Rivin: R1 Math Professors Citations and Citations per Year
We will compare the mean number of citations and citations per year (that is years elapsed since completion of PhD) between signers of Letters A, B, and C. We will validate the significance of the difference between signers using a [permutation test](https://en.wikipedia.org/wiki/Resampling_(statistics)#Permutation_tests).
A permutation test is a non-parametric means of assessing the significance test of a population. Throughout this section, we will be comparing the mean citations of two populations, X and Y. We will work under the assumption that our null hypothesis is $H_0: \mu(X) = \mu(Y)$, and our alternative is $H_1: \mu(X) < \mu(Y)$.
A permutation test works as follows. Let $X$ and $Y$ be our relevant populations, of size $n_X$ and $n_Y$. We would like to know whether we can accept that the observed difference in means was not due to chance. We record the observed difference in means as $\delta = \mu(X) - \mu(Y)$. We then take the union of our two population, $Z = X \cup Y$, and randomly partition $Z$ into two new sets $A$ and $B$, where $|A| = n_X$ and $|B| = n_Y$. We store $\mu(X)-\mu(Y)$ and induce a distribution $D$ of potential differences and repeat the process $n=10,000$ times. We can induce the p-value, or the probability that our observed difference was due to chance by the probability $p = |\{d:d\leq \delta, \forall d \in D\}|/n$.
```{r, echo = FALSE}
# input data and the number of permutations
meanPermutation <- function(Data, n,n_a){
output <- matrix(NA, ncol = 1, nrow = n)
p<-n_a/length(Data)
for(i in 1:n){
#sample p% of the data
X_index <- sample(1:length(Data), floor(p * length(Data)))
Y_index <- setdiff(1:length(Data), X_index)
X <- Data[X_index]
Y <- Data[Y_index]
#calculate the difference
diff <- mean(X)-mean(Y)
#store
output[i, ] <- diff
}
return(output)
}
```
```{r, echo = FALSE}
# input data and the number of permutations
medianPermutation <- function(Data, n,n_a){
output <- matrix(NA, ncol = 1, nrow = n)
p<-n_a/length(Data)
for(i in 1:n){
#sample p% of the data
X_index <- sample(1:length(Data), floor(p * length(Data)))
Y_index <- setdiff(1:length(Data), X_index)
X <- Data[X_index]
Y <- Data[Y_index]
#calculate the difference
diff <- median(X)-median(Y)
#store
output[i, ] <- diff
}
return(output)
}
```
### MathSciNet Citations for R1 Math Professors
```{r, warning=FALSE, echo = FALSE}
plot4 <- ggplot(df, aes(x=amscit))+
geom_histogram(data=filter(df, ((lettergroup == "A Only"|lettergroup == "A and B")&institution=="domesticr1"&field=="math"&(role=="professor"))),fill = "red", alpha = 0.2, binwidth = 1000) +
geom_histogram(data=filter(df, ((lettergroup == "B and C"|lettergroup == "B Only"|lettergroup == "A and B")&(role=="professor")&institution=="domesticr1"&field=="math")),fill = "blue", alpha = 0.2, binwidth = 1000) +
geom_histogram(data=filter(df,( (lettergroup == "C Only"|lettergroup == "B and C")&institution=="domesticr1"&field=="math"&(role=="professor"))),fill = "green", alpha = 0.2, binwidth = 1000) +
geom_vline(xintercept = mean(filter(df, ((lettergroup == "A Only"|lettergroup == "A and B")&(role=="professor")&institution=="domesticr1"&field=="math"))$amscit, na.rm = TRUE), linetype="dotted", color = "red", size=1.5) +
geom_vline(xintercept = mean(filter(df,( (lettergroup == "B and C"|lettergroup == "B Only"|lettergroup == "A and B")&(role=="professor")&institution=="domesticr1"&field=="math"))$amscit, na.rm = TRUE), linetype="dotted", color = "blue", size=1.5) +
geom_vline(xintercept = mean(filter(df, ((lettergroup == "C Only"|lettergroup == "B and C")&(role=="professor")&institution=="domesticr1"&field=="math"))$amscit, na.rm = TRUE), linetype="dotted", color = "green", size=1.5) +
ggtitle("Math Sci Net Citations Comparison Professors (A=red), (B=blue), (C=green)")
plot4
```
```{r, echo = FALSE,include=FALSE}
mean(filter(df, ((lettergroup == "A Only"|lettergroup == "A and B")&institution=="domesticr1"&(role=="professor")&(field=="math")))$amscit, na.rm = TRUE)
mean(filter(df, ((lettergroup == "B and C"|lettergroup == "B Only"|lettergroup == "A and B")&institution=="domesticr1"&(role=="professor")&(field=="math")))$amscit, na.rm = TRUE)
mean(filter(df, ((lettergroup == "C Only"|lettergroup == "B and C")&institution=="domesticr1"&(role=="professor")&(field=="math")))$amscit, na.rm = TRUE)
median(filter(df, ((lettergroup == "A Only"|lettergroup == "A and B")&(role=="professor")&institution=="domesticr1"&(field=="math")))$amscit, na.rm = TRUE)
median(filter(df, ((lettergroup == "B and C"|lettergroup == "B Only"|lettergroup == "A and B")&(role=="professor")&institution=="domesticr1"&(field=="math")))$amscit, na.rm = TRUE)
median(filter(df, ((lettergroup == "C Only"|lettergroup == "B and C")&(role=="professor")&institution=="domesticr1"&(field=="math")))$amscit, na.rm = TRUE)
```
The mean number of citations for signers of letter A is 397 and the median is 261.
The mean number of citations for signers of letter B is 1435 and the median is 915.
The mean number of citations for signers of letter C is 2177 and the median is 1353.
The three hypotheses we would like to assess are:
1. $H_0: \mu(A) = \mu(B), H_1: \mu(A) < \mu(B)$
2. $H_0: \mu(B) = \mu(C), H_1: \mu(B) < \mu(C)$
3. $H_0: \mu(A) = \mu(C), H_1: \mu(A) < \mu(C)$
```{r, include = FALSE, echo = FALSE}
muA <- mean(filter(df, ((lettergroup == "A Only"|lettergroup == "A and B")&institution=="domesticr1"&(field=="math")&(role=="professor")))$amscit, na.rm = TRUE)
muB <- mean(filter(df, ((lettergroup == "B and C"|lettergroup == "B Only"|lettergroup == "A and B")&institution=="domesticr1"&(field=="math")&(role=="professor")))$amscit, na.rm = TRUE)
muC <- mean(filter(df, ((lettergroup == "C Only"|lettergroup == "B and C")&institution=="domesticr1"&(field=="math")&(role=="professor")))$amscit, na.rm = TRUE)
val1 <- muA - muB
val2 <- muB - muC
val3 <- muA - muC
set.seed(0)
pop <- filter(df, (institution=="domesticr1"&field=="math"&role=="professor"))
n_a <- length(na.omit(filter(df, institution=="domesticr1"&field=="math"&(lettergroup == "A Only"|lettergroup == "A and B")&(role=="professor"))$amscit))
n_b <- length(na.omit(filter(df, institution=="domesticr1"&field=="math"&(lettergroup == "B and C"|lettergroup == "B Only"|lettergroup == "A and B")&(role=="professor"))$amscit))
n_c <- length(na.omit(filter(df, institution=="domesticr1"&field=="math"&(lettergroup == "C Only"|lettergroup == "B and C")&(role=="professor"))$amscit))
dist1 <- meanPermutation(na.omit(filter(df,(lettergroup == "A Only"|lettergroup == "A and B"|lettergroup == "B and C"|lettergroup == "B Only"))$amscit),10000,n_a)
dist2 <- meanPermutation(na.omit(filter(df,lettergroup == "B and C"|lettergroup == "B Only"|lettergroup == "A and B"|lettergroup == "C Only")$amscit),10000,n_b)
dist3 <- meanPermutation(na.omit(filter(df,lettergroup == "A Only"|lettergroup == "A and B"|lettergroup == "C Only"|lettergroup == "B and C")$amscit),10000,n_c)
sum(dist1 < val1)/length(dist1)
sum(dist2 < val2)/length(dist2)
sum(dist3 < val3)/length(dist3)
```
The induced p-value for hypothesis 1 is 0. The induced p-value for hypothesis 2 is 0.0016. The induced p-value for hypothesis 3 is 0. Hence we reject all three null hypotheses in favor of the alternative, and $\mu(A) < \mu(B) < \mu(C)$.
### MathSciNet Citations per Year for R1 Math Professors
Of course, we considered the fact that citations grow with age, so we calculated citations per year. There may be objections to this - one could hypothesize that citations per year grow with age - but we will soon thoroughly reject this claim.
```{r , warning=FALSE, echo = FALSE}
plot9 <- ggplot(df, aes(x=amscitperyear))+
geom_histogram(data=filter(df, (lettergroup == "A Only"|lettergroup == "A and B")&institution=="domesticr1"&field=="math"&role=="professor"),fill = "red", alpha = 0.2, binwidth = 50) +
geom_histogram(data=filter(df, (lettergroup == "B and C"|lettergroup == "B Only"|lettergroup == "A and B")&institution=="domesticr1"&field=="math"&role=="professor"),fill = "blue", alpha = 0.2, binwidth = 50) +
geom_histogram(data=filter(df, (lettergroup == "C Only"|lettergroup == "B and C")&institution=="domesticr1"&field=="math"&role=="professor"),fill = "green", alpha = 0.2, binwidth = 50) +
geom_vline(xintercept = mean(filter(df, (lettergroup == "A Only"|lettergroup == "A and B")&institution=="domesticr1"&field=="math"&role=="professor")$amscitperyear, na.rm = TRUE), linetype="dotted", color = "red", size=1.5) +
geom_vline(xintercept = mean(filter(df, (lettergroup == "B and C"|lettergroup == "B Only"|lettergroup == "A and B")&institution=="domesticr1"&field=="math"&role=="professor")$amscitperyear, na.rm = TRUE), linetype="dotted", color = "blue", size=1.5) +
geom_vline(xintercept = mean(filter(df, (lettergroup == "C Only"|lettergroup == "B and C")&institution=="domesticr1"&field=="math"&role=="professor")$amscitperyear, na.rm = TRUE), linetype="dotted", color = "green", size=1.5) +
ggtitle("MathSciNet citperyear Comparison (A=red), (B=blue), (C=green)")
plot9
```
```{r, echo = FALSE,include=FALSE}
mean(filter(df, (lettergroup == "A Only"|lettergroup == "A and B")&institution=="domesticr1"&field=="math"&role=="professor")$amscitperyear, na.rm = TRUE)
mean(filter(df, (lettergroup == "B and C"|lettergroup == "B Only"|lettergroup == "A and B")&institution=="domesticr1"&field=="math"&role=="professor")$amscitperyear, na.rm = TRUE)
mean(filter(df, (lettergroup == "C Only"|lettergroup == "B and C")&institution=="domesticr1"&field=="math"&role=="professor")$amscitperyear, na.rm = TRUE)
median(filter(df, (lettergroup == "A Only"|lettergroup == "A and B")&institution=="domesticr1"&field=="math"&role=="professor")$amscitperyear, na.rm = TRUE)
median(filter(df, (lettergroup == "B and C"|lettergroup == "B Only"|lettergroup == "A and B")&institution=="domesticr1"&field=="math"&role=="professor")$amscitperyear, na.rm = TRUE)
median(filter(df, (lettergroup == "C Only"|lettergroup == "B and C")&institution=="domesticr1"&field=="math"&role=="professor")$amscitperyear, na.rm = TRUE)
```
The mean number of citations per year for signers of letter A is 16 and the median is 11.
The mean number of citations per year for signers of letter B is 42 and the median is 27.
The mean number of citations per year for signers of letter C is 55 and the median is 42.
The three hypotheses we would like to assess are:
1. $H_0: \mu(A_{citperyear}) = \mu(B_{citperyear}), H_1: \mu(A_{citperyear}) < \mu(B_{citperyear})$
2. $H_0: \mu(B_{citperyear}) = \mu(C_{citperyear}), H_1: \mu(B_{citperyear}) < \mu(C_{citperyear})$
3. $H_0: \mu(A_{citperyear}) = \mu(C_{citperyear}), H_1: \mu(A_{citperyear}) < \mu(C_{citperyear})$
```{r, include = FALSE, echo = FALSE}
muA <- mean(filter(df, (lettergroup == "A Only"|lettergroup == "A and B")&institution=="domesticr1"&field=="math"&(role=="professor"))$amscitperyear, na.rm = TRUE)
muB <- mean(filter(df, (lettergroup == "B and C"|lettergroup == "B Only"|lettergroup == "A and B")&institution=="domesticr1"&field=="math"&(role=="professor"))$amscitperyear, na.rm = TRUE)
muC <- mean(filter(df, (lettergroup == "C Only"|lettergroup == "B and C")&institution=="domesticr1"&field=="math"&(role=="professor"))$amscitperyear, na.rm = TRUE)
val1 <- muA - muB
val2 <- muB - muC
val3 <- muA - muC
set.seed(0)
n_a <- length(na.omit(filter(df, (lettergroup == "A Only"|lettergroup == "A and B")&institution=="domesticr1"&field=="math"&(role=="professor"))$amscitperyear))
n_b <- length(na.omit(filter(df, (lettergroup == "B and C"|lettergroup == "B Only"|lettergroup == "A and B")&institution=="domesticr1"&field=="math"&(role=="professor"))$amscitperyear))
n_c <- length(na.omit(filter(df, (lettergroup == "C Only"|lettergroup == "B and C")&institution=="domesticr1"&field=="math"&(role=="professor"))$amscitperyear))
pop<- filter(df, ((role=="professor")&field=="math"&institution=="domesticr1"))
dist1 <- meanPermutation(na.omit(filter(pop,(lettergroup == "A Only"|lettergroup == "A and B"|lettergroup == "B and C"|lettergroup == "B Only"))$amscitperyear),10000,n_a)
dist2 <- meanPermutation(na.omit(filter(pop,lettergroup == "B and C"|lettergroup == "B Only"|lettergroup == "A and B"|lettergroup == "C Only")$amscitperyear),10000,n_b)
dist3 <- meanPermutation(na.omit(filter(pop,lettergroup == "A Only"|lettergroup == "A and B"|lettergroup == "C Only"|lettergroup == "B and C")$amscitperyear),10000,n_c)
sum(dist1 < val1)/length(dist1)
sum(dist1 < val2)/length(dist2)
sum(dist1 < val3)/length(dist3)
```
The induced p-value for hypothesis 1 is 0. The induced p-value for hypothesis 2 is 0.07788. The induced p-value for hypothesis 3 is 0. Hence we reject hypothesis 1 and 3, and conclude that $\mu(A_{citperyear}) < \mu(B_{citperyear}) \leq \mu(C_{citperyear})$.
## There is no evidence that Citations per Year grows with age
Let us check whether there is a relationship between age and citations per year in our limited dataset.
```{r, warning = FALSE, echo=FALSE, include=FALSE}
linearmodel1 <- lm(df$amscitperyear ~ df$age, data = df)
summary(linearmodel1)
```
```{r, echo = FALSE, include=FALSE}
par(mfrow=c(2,2)) # Change the panel layout to 2 x 2
plot(linearmodel1)
par(mfrow=c(1,1)) # Change back to 1 x 1
```
```{r, echo = FALSE}
plot(df$age,df$amscitperyear, main = "MSN Citations per Year vs Age Revised Dataset")
abline(linearmodel1, col = "red")
```
```{r, echo=FALSE, include = FALSE}
length(restrictedData$amscitperyear)-sum(is.na(restrictedData$amscitperyear))
```
```{r}
confint(linearmodel1)
```
The slope of the regression line is slightly positive (0.6178, 95% Confidence Interval = (0.115, 1.12)), but the $R^2$ values (Adjusted = 0.01662), are tragically low. So there is really no correlation. However one could object that we do not have enough data (n = 258), to assess that there is no correlation between citations per year and age. We know this, but thought it would more appropriate to analyze this in a separate paper. However, as noted above, our honor and ability as scientists were attacked so...
### Presenting citations data on every R1 Math Professor with MathSciNet citations
(plus the Institute of Advanced Studies and UC Merced)
```{r, include = FALSE}
allR1 <- read_csv("./FullMathR1Anonymized.csv")
```
```{r, echo = FALSE, include = FALSE}
allR1$Gender <- as.factor(allR1$Gender)
allR1$Role <- as.factor(allR1$Role)
allR1$age <- 2020 - allR1$earliest_pub
allR1<- transform(allR1, citperyear = citations/age)
summary(allR1)
```
We manually collected the citations and year of first publication of every R1 full math professor by consulting [wikipedia](https://en.wikipedia.org/wiki/List_of_research_universities_in_the_United_States), going to the relevant faculty pages and then collecting MathSciNet citations. We then anonymized it. The 2787 professors we collected data on is in line with [data](http://www.ams.org/profession/data/annual-survey/2016dp-tableDF1.pdf?fbclid=IwAR1mgI0qSEs5nCGquqye741_0lZU-ez7dlcJ3wZYhDtJUswhH1SX7yeiiak) collected by the AMS, after taking into account that about half of universities in the US are classified R2. Great lengths were taken to assess the accuracy of this data, including correlating publications, PhD years, etc. Of course errors in data collection, especially manual typing errors, happen, but by no means are these errors systematic.
**Exercise 1: Pick your favorite R1 institution, go to MathSciNet, and check how similar our data is to what you determined.**
**Exercise 2: Determine every university without a female professor. We will note that the University of Colorado - Denver has a very strong female professor, but she does not have MathSciNet citations. There should be at least one surprise (and a few non surprises).**
Many aspects of this dataset can, should, and will be analyzed. For now, the following will suffice.
### Citations per Year vs Age
We plot the Citations per Year vs Age for all math R1. We generate a linear regression model and output a 95% confidence interval for the slope.
```{r, echo=FALSE}
linearmodel2 <- lm(allR1$citperyear ~ allR1$age, data = allR1)
plot(allR1$age, allR1$citperyear, main="Citations per Year vs Age All Math R1")
abline(linearmodel2,col="red")
```
```{r}
confint(linearmodel2)
```
So while visually it appears that there is no correlation between year and citations per year, one may object and say, the slope is positive! Which leads to the following question.
**Question: To what power must we raise age to get zero within the confidence interval of slope.**
We object to this question, because the implication of the question is, by how much should we discount the accomplishments of those who are older. Nevertheless, we proceed.
```{r, echo = FALSE, include=FALSE}
allR1<- transform(allR1, citperyearPower = citations/age^1.3)
```
```{r, echo = FALSE}
linearmodel3 <- lm(allR1$citperyearPower ~ allR1$age, data = allR1)
plot(allR1$age, allR1$citperyear, main="Age vs Citations per Year^1.3 All Math R1")
abline(linearmodel3,col="red")
```
```{r}
confint(linearmodel3)
```
It seems raising age to the 1.3 will do the trick.
We will reperform the permutation test comparing citations per year^1.3.
### Citations per Year adjusting for fitted handicap on age
```{r , warning=FALSE, echo = FALSE}
df$amscitperyearpower <- df$amscit/(df$age^1.3)
plot9 <- ggplot(df, aes(x=amscitperyearpower))+
geom_histogram(data=filter(df, (lettergroup == "A Only"|lettergroup == "A and B")),fill = "red", alpha = 0.2, binwidth = 5) +
geom_histogram(data=filter(df, (lettergroup == "B and C"|lettergroup == "B Only"|lettergroup == "A and B")),fill = "blue", alpha = 0.2, binwidth = 5) +
geom_histogram(data=filter(df, (lettergroup == "C Only"|lettergroup == "B and C")),fill = "green", alpha = 0.2, binwidth = 5) +
geom_vline(xintercept = mean(filter(df, (lettergroup == "A Only"|lettergroup == "A and B"))$amscitperyearpower, na.rm = TRUE), linetype="dotted", color = "red", size=1.5) +
geom_vline(xintercept = mean(filter(df, (lettergroup == "B and C"|lettergroup == "B Only"|lettergroup == "A and B"))$amscitperyearpower, na.rm = TRUE), linetype="dotted", color = "blue", size=1.5) +
geom_vline(xintercept = mean(filter(df, (lettergroup == "C Only"|lettergroup == "B and C"))$amscitperyearpower, na.rm = TRUE), linetype="dotted", color = "green", size=1.5) +
ggtitle("MathSciNet citperyear^1.3 Comparison (A=red), (B=blue), (C=green)")
plot9
```
```{r, echo = FALSE, include = FALSE}
mean(filter(df, (lettergroup == "A Only"|lettergroup == "A and B"))$amscitperyearpower, na.rm = TRUE)
mean(filter(df, (lettergroup == "B and C"|lettergroup == "B Only"|lettergroup == "A and B"))$amscitperyearpower, na.rm = TRUE)
mean(filter(df, (lettergroup == "C Only"|lettergroup == "B and C"))$amscitperyearpower, na.rm = TRUE)
median(filter(df, (lettergroup == "A Only"|lettergroup == "A and B"))$amscitperyearpower, na.rm = TRUE)
median(filter(df, (lettergroup == "B and C"|lettergroup == "B Only"|lettergroup == "A and B"))$amscitperyearpower, na.rm = TRUE)
median(filter(df, (lettergroup == "C Only"|lettergroup == "B and C"))$amscitperyearpower, na.rm = TRUE)
```
The mean number of citations per year^1.3 for signers of letter A is 6 and the median is 4.
The mean number of citations per year^1.3 for signers of letter B is 15 and the median is 9.
The mean number of citations per year^1.3 for signers of letter C is 19 and the median is 15.
The three hypotheses we would like to assess are:
1. $H_0: \mu(A_{citperyear^{1.3}}) = \mu(B_{citperyear^{1.3}}), H_1: \mu(A_{citperyear^{1.3}}) < \mu(B_{citperyear^{1.3}})$
2. $H_0: \mu(B_{citperyear^{1.3}}) = \mu(C_{citperyear^{1.3}}), H_1: \mu(B_{citperyear^{1.3}}) < \mu(C_{citperyear^{1.3}})$
3. $H_0: \mu(A_{citperyear^{1.3}}) = \mu(C_{citperyear^{1.3}}), H_1: \mu(A_{citperyear^{1.3}}) < \mu(C_{citperyear^{1.3}})$
```{r, include = FALSE, echo = FALSE}
muA <- mean(filter(df, (lettergroup == "A Only"|lettergroup == "A and B"))$amscitperyearpower, na.rm = TRUE)
muB <- mean(filter(df, (lettergroup == "B and C"|lettergroup == "B Only"|lettergroup == "A and B"))$amscitperyearpower, na.rm = TRUE)
muC <- mean(filter(df, (lettergroup == "C Only"|lettergroup == "B and C"))$amscitperyearpower, na.rm = TRUE)
val1 <- muA - muB
val2 <- muB - muC
val3 <- muA - muC
set.seed(0)
n_a <- length(na.omit(filter(df, (lettergroup == "A Only"|lettergroup == "A and B")&institution=="domesticr1"&field=="math"&(role=="professor"))$amscitperyearpower))
n_b <- length(na.omit(filter(df, (lettergroup == "B and C"|lettergroup == "B Only"|lettergroup == "A and B")&institution=="domesticr1"&field=="math"&(role=="professor"))$amscitperyearpower))
n_c <- length(na.omit(filter(df, (lettergroup == "C Only"|lettergroup == "B and C")&institution=="domesticr1"&field=="math"&(role=="professor"))$amscitperyearpower))
pop<- filter(df, ((role=="professor")&field=="math"&institution=="domesticr1"))
dist1 <- meanPermutation(na.omit(filter(pop,(lettergroup == "A Only"|lettergroup == "A and B"|lettergroup == "B and C"|lettergroup == "B Only"))$amscitperyearpower),10000,n_a)
dist2 <- meanPermutation(na.omit(filter(pop,lettergroup == "B and C"|lettergroup == "B Only"|lettergroup == "A and B"|lettergroup == "C Only")$amscitperyearpower),10000,n_b)
dist3 <- meanPermutation(na.omit(filter(pop,lettergroup == "A Only"|lettergroup == "A and B"|lettergroup == "C Only"|lettergroup == "B and C")$amscitperyearpower),10000,n_c)
sum(dist1 < val1)/length(dist1)
sum(dist2 < val2)/length(dist2)
sum(dist3 < val3)/length(dist3)
```
The induced p-value for hypothesis 1 is 0. The induced p-value for hypothesis 2 is 0.0974. The induced p-value for hypothesis 3 is 0.0002. Hence we fail to reject hypothesis 2 at a 2% significance level and reject hypotheses 1 and 3 in favor of the alternative. We conclude that after adjusting for age $\mu(A) < \mu(B) \leq \mu(C)$.
## One more check that age is irrelevant when comparing citations
This method was suggested by a friend as a final check to eliminate any question that age was the greatest confounder. We want to show that $\mu(A) < \mu(B \cup C)$. We will randomly sample a population of 20 from A, called $X$. For each member $x \in X$, we will find every person from B and C that is within a four year age interval from $x$. We will randomly sample one, and induce a new population $Y$. Then we will compare the means by storing X-Y. We repeat this 1,000 times and plot a histogram of the induced values. If 0 is within this new distribution, then maybe there is a chance, a totally slim one after above, that in fact age is a confounder. If the distribution is primarily negative, then X < Y. Otherwise X > Y. We perform this analysis with both AMS citations and Google Scholar citations.
```{r, echo=FALSE}
#with amscit
set.seed(0)
df3 <- filter(df, (age>0)&(amscit>0))
ageMatchedRandom <- function(){
Xp <- filter(df3,lettergroup=="A Only"|lettergroup=="A and B")
X <- Xp[sample(nrow(Xp),20),]
Yp <- filter(df3,lettergroup=="B Only"|lettergroup=="A and B"|lettergroup=="B and C"|lettergroup=="C Only")
#instantiate matrix
Y <- matrix(0,nrow = 1, ncol = nrow(X))
#for each person in subsample
for (i in 1:nrow(X)){
#get their age
a <- X[i,]$age
#create range
low <- a-2
high <- a+2
#find all persons between low and high and then pick one randomly
agematched <- filter(Yp, (age <= high & age >= low))$amscit
random <- sample(agematched,1)
#store
Y[,i] <- random
}
return(mean(X$amscit)-mean(Y))
}
#perform test 1000 times
Dist <- matrix(0,nrow=1,ncol=1000)
for (i in 1:1000){
Dist[,i]<-ageMatchedRandom()
}
hist(Dist, main="Age Matched Test for MSN Citations")
```
```{r, echo=FALSE, include = FALSE}
sum(Dist>=0)
```
When comparing mathscinet citations with this age matched randomization test, we see that none of the induced distribution is greater than or equal to zero. So when comparing similarly aged apples to apples, $A < B\cup C$
We perform the same analysis with Google Scholar citations.
```{r, echo = FALSE}
#with Google Scholar
set.seed(0)
df3 <- filter(df, (age>0)&(citations>0))
ageMatchedRandom <- function(){
Xp <- filter(df3,lettergroup=="A Only"|lettergroup=="A and B")
X <- Xp[sample(nrow(Xp),20),]
Yp <- filter(df3,lettergroup=="B Only"|lettergroup=="A and B"|lettergroup=="B and C"|lettergroup=="C Only")
#instantiate matrix
Y <- matrix(0,nrow = 1, ncol = nrow(X))
#for each person in subsample
for (i in 1:nrow(X)){
#get their age
a <- X[i,]$age
#create range
low <- a-2
high <- a+2
#find all persons between low and high and then pick one randomly
agematched <- filter(Yp, (age <= high & age >= low))$citations
random <- sample(agematched,1)
#store
Y[,i] <- random
}
return(mean(X$citations)-mean(Y))
}
#perform test 1000 times
Dist <- matrix(0,nrow=1,ncol=1000)
for (i in 1:1000){
Dist[,i]<-ageMatchedRandom()
}
hist(Dist, main="Age Matched Test Google Scholar Citations")
```
```{r, echo=FALSE, include = FALSE}
sum(Dist>=0)
```
When comparing Google Scholar citations with this age matched randomization test, we see that 18.1% of the induced distribution is greater than or equal to zero. So when comparing similarly aged apples to apples, it is inconclusive if $A < B\cup C$. Of course, we wonder if this is actually Lior Pachter.
```{r, echo = FALSE}
#with Google Scholar without Pachter
set.seed(0)
df4 <- filter(df, (age>0)&(citations>0)&(name!="lior pachter"))
ageMatchedRandom <- function(){
Xp <- filter(df4,lettergroup=="A Only"|lettergroup=="A and B")
X <- Xp[sample(nrow(Xp),20),]
Yp <- filter(df4,lettergroup=="B Only"|lettergroup=="A and B"|lettergroup=="B and C"|lettergroup=="C Only")
#instantiate matrix
Y <- matrix(0,nrow = 1, ncol = nrow(X))
#for each person in subsample
for (i in 1:nrow(X)){
#get their age
a <- X[i,]$age
#create range
low <- a-2
high <- a+2
#find all persons between low and high and then pick one randomly
agematched <- filter(Yp, (age <= high & age >= low))$citations
random <- sample(agematched,1)
#store
Y[,i] <- random
}
return(mean(X$citations)-mean(Y))
}
#perform test 1000 times
Dist <- matrix(0,nrow=1,ncol=1000)
for (i in 1:1000){
Dist[,i]<-ageMatchedRandom()
}
hist(Dist, main="Age Matched Test Google Scholar Citations \n without Lior Pachter ")
```
```{r, echo=FALSE, include=FALSE}
sum(Dist>=0)
```
When comparing Google Scholar citations, removing Pachter, with this age matched randomization test, we see that 2.7% of the induced distribution is greater than or equal to zero. So when comparing similarly aged apples to apples, it indeed seems that $A < B\cup C$.
## Pachter's Magic Trick: Hypertuning
A note about Pachter's final, "damning," (it is not), figure. He chose a cutoff of age 36, and compared the average Google Scholar citations of letter signers. He finds that if one does this cutoff, the mean citations of A is greater than B. We found this choice of 36 to be curious and somewhat arbitrary. It smelled like parameter tuning, but we wanted to investigate.
We plot the average citations per year and note with a vertical line, the 36 (PhD) age cutoff.
```{r, echo = FALSE}
# Initialize vectors to hold the mean citations of professors younger than a certain age.
x.youngest.age <- 15
x.oldest.age <-65
x.age.range <- x.youngest.age:x.oldest.age
year.span <- x.oldest.age - x.youngest.age
x.a <- vector(mode="numeric", length=year.span)
x.b <- vector(mode="numeric", length=year.span)
x.c <- vector(mode="numeric", length=year.span)
for (x.age in x.age.range)
{
x.a[1+x.age - x.youngest.age] <- mean(filter(df, ((lettergroup == "A Only"|lettergroup == "A and B")&(role=="professor")&(age <= x.age)))$citations, na.rm = TRUE)
x.b[1+x.age - x.youngest.age] <- mean(filter(df, ((lettergroup == "B Only"|lettergroup == "A and B"|lettergroup == "B and C")&(role=="professor")&(age <= x.age)))$citations, na.rm = TRUE)
x.c[1+x.age - x.youngest.age] <- mean(filter(df, ((lettergroup == "C Only"|lettergroup == "B and C")&(role=="professor")&(age <= x.age)))$citations, na.rm = TRUE)
}
plot(x.age.range,x.c,
type="l",
col="chartreuse3",
ylim=c(0,7000),
main="Average citations of Professors as a function of PhD Age",
ylab="mean citations",
xlab="PhD age",
lwd=2)
#pachter's hyperparameter
abline(v=36)
lines(x.age.range,x.a,
type="l",
col="red",
lwd=2)
legend("topleft",
c("Letter A","Letter C"),
fill=c("red","chartreuse3"))
```
```{r, echo=FALSE, include = FALSE}
max(na.omit(filter(df, (citations>0)&(lettergroup=="A Only"|lettergroup=="A and B"))$age))
```
The maximum age since PhD of a letter signer of A is 49. If he were to cutoff his comparison at that point, clearly $C>A$. If he were to cutoff his comparison at 38, $C>A$. Any further left of 36, he would be accused of being biased.
Notice the spike at age 21. This is caused by Lior Pachter. What would happen if we removed Pachter?
```{r, echo = FALSE}
#removing pachter
a.trim <- tail(sort(filter(df, ((lettergroup == "A Only"|lettergroup == "A and B")&(role=="professor")&(age < 36)))$citations),1)
c.trim <- tail(sort(filter(df, ((lettergroup == "C Only"|lettergroup == "B and C")&(role=="professor")&(age < 36)))$citations),1)
df.a.trimmed <- subset(df, ((lettergroup == "A Only"|lettergroup == "A and B")&(role=="professor")&(age < 100)&(!is.na(citations))&!(citations %in% a.trim)))
df.c.trimmed <- subset(df, ((lettergroup == "C Only"|lettergroup == "B and C")&(role=="professor")&(age < 100)&(!is.na(citations))&!(citations %in% c.trim)))
x.a <- 0
x.b <- 0
x.c <- 0
for (x.age in x.age.range)
{
x.a[1+x.age - x.youngest.age] <- mean(filter(df.a.trimmed, (age <= x.age))$citations, na.rm = TRUE)
x.c[1+x.age - x.youngest.age] <- mean(filter(df.c.trimmed, (age <= x.age))$citations, na.rm = TRUE)
}
plot(x.age.range,x.c,
type="l",
col="chartreuse3",
ylim=c(0,7000),
main="Average citations of Professors as a function of PhD Age \n (Lior Pachter removed)",
ylab="mean citations",
xlab="PhD age",
lwd=2)
lines(x.age.range,x.a,
type="l",
col="red",
lwd=2)
legend("topleft",
c("Letter A","Letter C"),
fill=c("red","chartreuse3"))
a.trim <- tail(sort(filter(df, ((lettergroup == "A Only"|lettergroup == "A and B")&(role=="professor")&(age < 36)))$citations),2)
c.trim <- tail(sort(filter(df, ((lettergroup == "C Only"|lettergroup == "B and C")&(role=="professor")&(age < 36)))$citations),2)
df.a.trimmed <- subset(df, ((lettergroup == "A Only"|lettergroup == "A and B")&(role=="professor")&(age < 100)&(!is.na(citations))&!(citations %in% a.trim)))
df.c.trimmed <- subset(df, ((lettergroup == "C Only"|lettergroup == "B and C")&(role=="professor")&(age < 100)&(!is.na(citations))&!(citations %in% c.trim)))
```
Perhaps removing Lior was a confounder. So we remove the top five mathematicians from C.
```{r, echo = FALSE}
x.youngest.age <- 15
x.oldest.age <-65
x.age.range <- x.youngest.age:x.oldest.age
year.span <- x.oldest.age - x.youngest.age
x.a <- vector(mode="numeric", length=year.span)
x.b <- vector(mode="numeric", length=year.span)
x.c <- vector(mode="numeric", length=year.span)
a.trim <- tail(sort(filter(df, ((lettergroup == "A Only"|lettergroup == "A and B")&(role=="professor")&(age < 100)))$citations),1)
c.trim <- tail(sort(filter(df, ((lettergroup == "C Only"|lettergroup == "B and C")&(role=="professor")&(age < 100)))$citations),5)
df.a.trimmed <- subset(df, ((lettergroup == "A Only"|lettergroup == "A and B")&(role=="professor")&(age < 100)&(!is.na(citations))&!(citations %in% a.trim)))
df.c.trimmed <- subset(df, ((lettergroup == "C Only"|lettergroup == "B and C")&(role=="professor")&(age < 100)&(!is.na(citations))&!(citations %in% c.trim)))
x.a <- 0
x.b <- 0
x.c <- 0
for (x.age in x.age.range)
{
x.a[1+x.age - x.youngest.age] <- mean(filter(df.a.trimmed, (age <= x.age))$citations, na.rm = TRUE)
x.c[1+x.age - x.youngest.age] <- mean(filter(df.c.trimmed, (age <= x.age))$citations, na.rm = TRUE)
}
plot(x.age.range,x.c,
type="l",
col="chartreuse3",
ylim=c(0,7000),
main="Average citations of Professors as a function of age \n (Lior Pachter Removed, \n and 5 highest cited of Letter C removed)",
ylab="mean citations",
xlab="PhD age",
lwd=2)
lines(x.age.range,x.a,
type="l",
col="red",
lwd=2)
legend("topleft",
c("Letter A","Letter C"),
fill=c("red","chartreuse3"))
```
So it is clear that Pachter's analysis was some sort of magic trick, potentially a thought experiment, and a misleading one. It is highly unlikely that a tenured and respected expert in computation and statistics did not know the above result, expecially when a student he suggests take an introductory statistics course immediately spotted it. One may suspect that he purposefully chose his 36 cutoff to try to undermine our results.
## Tier Rankings
In our excel sheet, (which we understand is the bane of reproducibility), and through the magic of pivot tables, we rank R1 departments by calculating average department citations per year (since first publication).
The top 11 departments using this ranking are:
1. Princeton
2. Institute of Advanced Studies
3. Harvard
4. Stanford
5. University of Chicago
6. University of California - Los Angeles
7. Massachussetts Institute of Technology
8. Columbia University
9. New York University
10. University of Miami
11. University of California - Berkeley
We calculate the average citations per year since PhD of letters A, B, and C, and compare them to our ranked list.
```{r, echo = FALSE,include=FALSE}
mean(filter(df, (lettergroup == "A Only"|lettergroup == "A and B")&institution=="domesticr1"&field=="math"&role=="professor")$amscitperyear, na.rm = TRUE)
mean(filter(df, (lettergroup == "B and C"|lettergroup == "B Only"|lettergroup == "A and B")&institution=="domesticr1"&field=="math"&role=="professor")$amscitperyear, na.rm = TRUE)
mean(filter(df, (lettergroup == "C Only"|lettergroup == "B and C")&institution=="domesticr1"&field=="math"&role=="professor")$amscitperyear, na.rm = TRUE)
```
The average Math Sci Net Citations per year (PhD Age) is:
1. For letter A - 15.98726
2. For letter B - 41.86467
3. For letter C - 55.3615
Temple has an average citations per year of 12.33, so we retract our claim that letter A is comparable to Temple. It is closer to the University of Massachusetts - Amherst which has an average citations per year of 16.17. By [US News](https://www.usnews.com/best-graduate-schools/top-science-schools/mathematics-rankings), University of Massachusetts - Amherst's Math Department has a rank of 55. Rutgers has an average citations per year of 35.01, so we retract our claim that letter B is comparable to Rutgers. It is closer to the University of Minesota which has an average citations per year of 42.07 and a US News Ranking of 19. For Letter C, we claimed that it was another tier higher - indeed it is closer to the University of Chicago, which has an average citations per year of 56.27, ranked 6 by US News.
An astute observer would notice we are not exactly comparing apples to apples. Presumably one's first publication could be before one finishes their PhD. So even with the boost, the order amongst letter signers stands.
## Discussion and Conclusion
We have debunked the claim that age is the confounder for the difference in citations and citations per year between signers of Letter A, B, and C. Indeed, the least meritorious of mathematicians as a whole signed letter A, whereas the more meritorious signed letters B and C, with merit judged by citations. If one was not willing to believe citations impose even a small order on merit, one could replace citations with Fields medals, AMS Fellowships, or many other metrics.
In this analysis, we have addressed most of the criticisms in Pachter's review, acknowledging our errors when pointed, while rejecting his false claim that age was the greatest confounder. The only one we have not addressed is his point that, "several p-values are computed and reported without any multiple testing correction." After consultation with a respected statistician, we do not see what the issue is. We reported every p-value and he is welcome to change the set.seed in our code, which he applauds us as easily reproducible.
We conclude by reiterating our thanks to Pachter. We truly appreciated your review.
## Data and Code
All code and data used for this report is available at.
https://github.com/joshp112358/Response-to-Pachter
## References
Lior Pachter's Blog Post - Diversity Matters - January 17, 2020 https://liorpachter.wordpress.com/2020/01/17/diversity-matters/
Chad Topaz's Paper - Version 10 - https://osf.io/preprints/socarxiv/fa4zb/
Our original Paper - Version 1 - https://arxiv.org/pdf/2001.00670.pdf
In Preparation - A Citations Analysis of R1 Math Departments by Joshua Paik and Igor Rivin
## Miscellany
There seems to be some squabbles in the comments of Pachter's blog whether the paper is Paik-Rivin or Rivin-Paik. In mathematics, we follow the Hardy-Littlewood rule, namely all authors are first authors and we list authors alphabetically.