Skip to content

Codebooks

Miguel Tomas Silva edited this page Nov 15, 2023 · 3 revisions

Home >> Data Measurements >> Codebooks

Change Language
Last update: 15-11-2023

What is a Codebook? [1]

A codebook describes the contents, structure, and layout of a data collection. A well-documented codebook "contains information intended to be complete and self-explanatory for each variable in a data file1."

Codebooks begin with basic front matter, including the study title, name of the principal investigator(s), table of contents, and an introduction describing the purpose and format of the codebook. Some codebooks also include methodological details, such as how weights were computed, and data collection instruments, while others, especially with larger or more complex data collections, leave those details for a separate user guide and/or data collection instrument.

The main body of a codebook contains unambiguous variable level details. These include, as shown in the example below from the National Longitudinal Survey of Youth, 19792, the following:

  • Variable name: The name or number assigned to each variable in the data collection. Some researchers prefer to use mnemonic abbreviations (e.g., EMPLOY1), while others use alphanumeric patterns (e.g., VAR001). For survey data, try to name variables after the question numbers - e.g., Q1, Q2b, etc. [In above example, H40-SF12-2]
  • Variable label: A brief description to identify the variable for the user. Where possible, use the exact question or research wording. ["SF12 - ASSESSMENT OF R'S GENERAL HEALTH"]
  • Question text: Where applicable, the exact wording from survey questions. ["In general, would you say your health is . . ."]
  • Values: The actual coded values in the data for this variable. [1, 2, 3, 4, 5]
  • Value labels: The textual descriptions of the codes. [Excellent, Very Good, Good, Fair, Poor]
  • Summary statistics: Where appropriate and depending on the type of variable, provide unweighted summary statistics for quick reference. For categorical variables, for instance, frequency counts showing the number of times a value occurs and the percentage of cases that value represents for the variable are appropriate. For continuous variables, minimum, maximum, and median values are relevant. Missing data: Where applicable, the values and labels of missing data. Missing data can bias an analysis and is important to convey in study documentation. Remember to describe all missing codes, including "system missing" and blank. [e.g., Refusal (-1)]
  • Universe skip patterns: Where applicable, information about the population to which the variable refers, as well as the preceding and following variables. [e.g., Default Next Question: H00035.00]
  • Notes: Additional notes, remarks, or comments that contextualize the information conveyed in the variable or relay special instructions. For measures or questions from copyrighted instruments, the notes field is the appropriate location to cite the source.

Source

[1] University of Michigan - Institute for Social Research

Clone this wiki locally