03.1-literate-programming.rmd

## Statistical Computing with R and Python Notebooks; Reproducible code

<div class = rmdcaution> _This section is under development_. </div>

When one of us (Graham) was a graduate student, he was tasked with teaching undergraduates how to do a chi-squared test of archaeological data. This was in the late 1990s. He duly opened up Excel, and began to craft a template there. While it was possible to use Excel's automatic chi-square stastical test on a table of data, simply showing the students how to use the function seemed undesirable. If he used the function, would the students really _know_ what the test was doing? In any event, after several hours of working with the function, he determined that he couldn't actually figure out what exactly the function was doing. The dataset was from Mike Fletcher and Gary R. Lock's very clear _Digging Numbers: Elementary Statistics for Archaeologists_ (spearheads and the presence/absence of a loop feature; Fletcher and Lock, 1991: 115-125). The answer Excel was providing was not the answer that Fletcher and Lock demonstrated. Was there some setting somewhere in Excel that was affecting the function? Time was ticking. Graham elected instead to build the spreadsheet so that each step - each calculation - in the statistic was its own cell. He had the students perform these steps on their own spreadsheet. When even that went awry, he saved his sheet onto a floppy disc, and loaded it one computer at a time for the students. This time, working from the original spreadshet, the results agreed with the text. The association between material and period was stronger than the association of material and presence of a loop.

Excel is a black box. When we use it, we have to take on faith that its statistics do what they say they are doing. Most of the time, this is not an issue - but Excel is a very complicated piece of software. Biologists, for instance, have known for years that Excel's automatic date-time conversions (that is, Excel detects the _kind_ of data in a cell and changes its formatting or _the actual data itself_) could affect gene names (Zeeberg et al 2004) [link](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-5-80).

Marwick discusses additional issues that occur when using Excel to manipulate our archaeological data. Archaeological data can seldom be used in the format in which they are first recorded. Much effort has to be expended to clean the data up into the neat and tidy tables that a spreadsheet requires. What decisions were made in the tidying process? What bits and pieces were left out? What bits and pieces were transformed as they were entered into the spreadsheet? If someone else were to try to confirm the results of the study - to reproduce it - and there is no record of the manipulations of the data (the specific transformations), reproducibility is unlikely. When researchers discuss a new method, without the data being available and the kinds of transformations and analyses not clearly and fully specified, we have created a barrier to moving archaeology forward.

As an exercise right now to understand just how difficult a problem this is, make a table in Excel. Visualize the following data: 12 bronze spear heads, 18 iron spear heads; 6 of the bronze spear heads have a loop, 10 of the iron spear heads have a loop. Write down all of the steps you took to visualize this information. Save your work; close the spreadsheet. Now hand your list over to a peer. Have them follow the steps exactly, but do not clarify or otherwise explain the list.

Compare what they've created, with what you created. How close of a match is there? Did they reproduce your visualization? What elements or steps did you forget to include in your list?

Now you might begin to see some of the issues involved in relying on a point-and-click graphical user interface and a black-box like Excel with your archaeological data.

Marwick (2017) writes with regard to this problem,

> This is a substantial barrier to progress in archaeology, both in establishing the veracity of previous claims and promoting the growth of new interpretations. Furthermore, if we are to contribute to contemporary conversations outside of archaeology (as we are supposedly well-positioned to do, cf. Kintigh et al. (2014)), we need to become more efficient, interoperative, and flexible in our research. We have to be able to invite researchers from other fields into our research pipelines to collaborate in answering interesting and broad questions about past societies.

Marwick lists four principles for reproducible archaeology. Note that 'reproducible' here means being able to re-do the analysis and obtain the same results; 'replicable' means using the same methods on a new data set collected and analyzed with the published methods of an earlier study. The principles that we should follow are:

- make the data and the methods that generated the results openly available
- script the stastical analysis (that is, no more point-and-click statistics). More on this below.
- use version control
- record the computational environment employed

todo:

- how R addresses these issues
- how R works with version control
  - point to Marwick's more complex book on [R for archaeologists](https://benmarwick.github.io/How-To-Do-Archaeological-Science-Using-R/)
- R markdown and jupyter notebooks : literate programming
- what goes in 'discussion' below?

- marwick's compendium

### discussion

### exercises