Skip to content

Commit b26a2af

Browse files
committed
source commit: e3d865c
0 parents  commit b26a2af

33 files changed

+6899
-0
lines changed

01-introduction.md

+88
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
---
2+
title: Introduction to OpenRefine
3+
teaching: 15
4+
exercises: 0
5+
---
6+
7+
::::::::::::::::::::::::::::::::::::::: objectives
8+
9+
- Explain what the OpenRefine software does
10+
- Explain how the OpenRefine software can help work with data files
11+
12+
::::::::::::::::::::::::::::::::::::::::::::::::::
13+
14+
:::::::::::::::::::::::::::::::::::::::: questions
15+
16+
- What is OpenRefine? What can it do?
17+
18+
::::::::::::::::::::::::::::::::::::::::::::::::::
19+
20+
## What is OpenRefine?
21+
22+
OpenRefine is a desktop application that uses your web browser as a graphical interface. It is described as "a power tool for working with messy data" ([David Huynh](https://web.archive.org/web/20141021040915/http://davidhuynh.net/spaces/nicar2011/tutorial.pdf)) - but what does this mean? It is probably easiest to describe the kinds of data OpenRefine is good at working with and the sorts of problems it can help you or your team solve.
23+
24+
OpenRefine is most useful where you have data in a simple tabular format such as a spreadsheet, a comma separated values file (csv) or a tab delimited file (tsv) but with internal inconsistencies either in data formats, or where data appears, or in terminology used. OpenRefine can be used to standardize and clean data across your file. It can help you:
25+
26+
- Get an overview of a data set
27+
- Resolve inconsistencies in a data set, for example standardizing date formatting
28+
- Help you split data up into more granular parts, for example splitting up cells with multiple authors into separate cells
29+
- Match local data up to other data sets - for example, in matching forms of personal names against name authority records in the Virtual International Authority File (VIAF)
30+
- Enhance a data set with data from other sources
31+
32+
Some common scenarios might be:
33+
34+
- Where you want to know how many times a particular value (name, publisher, subject) appears in a column in your data
35+
- Where you want to know how values are distributed across your whole data set
36+
- Where you have a list of dates which are formatted in different ways, and want to change all the dates in the list to a single common date format. For example:
37+
38+
| Data you have | Desired data |
39+
| ----------------------------------------------------------------------------------------------------------------------- | :------------------------------------------------------------- |
40+
| 1st January 2014 | 2014-01-01 |
41+
| 01/01/2014 | 2014-01-01 |
42+
| Jan 1 2014 | 2014-01-01 |
43+
| 2014-01-01 | 2014-01-01 |
44+
45+
- Where you have a list of names or terms that differ from each other but refer to the same people, places or concepts. For example:
46+
47+
| Data you have | Desired data |
48+
| ----------------------------------------------------------------------------------------------------------------------- | :------------------------------------------------------------- |
49+
| London | London |
50+
| London] | London |
51+
| London,] | London |
52+
| london | London |
53+
54+
- Where you have several bits of data combined together in a single column, and you want to separate them out into individual bits of data with one column for each bit of the data. For example going from a single address field (in the first column), to each part of the address in a separate field:
55+
56+
| Address in single field | Institution | Library name | Address 1 | Address 2 | Town/City | Region | Country | Postcode |
57+
| ----------------------------------------------------------------------------------------------------------------------- | :------------------------------------------------------------- | :------------------------------------------------------------- | :---------------- | :-------- | :---------- | :------------ | :------------- | :------- |
58+
| University of Wales, Llyfrgell Thomas Parry Library, Llanbadarn Fawr, ABERYSTWYTH, Ceredigion, SY23 3AS, United Kingdom | University of Wales | Llyfrgell Thomas Parry Library | Llanbadarn Fawr | | Aberystwyth | Ceredigion | United Kingdom | SY23 3AS |
59+
| University of Aberdeen, Queen Mother Library, Meston Walk, ABERDEEN, AB24 3UE, United Kingdom | University of Abderdeen | Queen Mother Library | Meston Walk | | Aberdeen | | United Kingdom | AB24 3UE |
60+
| University of Birmingham, Barnes Library, Medical School, Edgbaston, BIRMINGHAM, West Midlands, B15 2TT, United Kingdom | University of Birmingham | Barnes Library | Medical School | Edgbaston | Birmingham | West Midlands | United Kingdom | B15 2TT |
61+
| University of Warwick, Library, Gibbett Hill Road, COVENTRY, CV4 7AL, United Kingdom | University of Warwick | Library | Gibbett Hill Road | | Coventry | | United Kingdom | CV4 7AL |
62+
63+
- Where you want to add to your data from an external data source:
64+
65+
| Data you have | Date of Birth from VIAF (Virtual International Authority File) | Date of Death from VIAF (Virtual International Authority File) |
66+
| ----------------------------------------------------------------------------------------------------------------------- | :------------------------------------------------------------- | :------------------------------------------------------------- |
67+
| Braddon, M. E. (Mary Elizabeth) | 1835 | 1915 |
68+
| Rossetti, William Michael | 1829 | 1919 |
69+
| Prest, Thomas Peckett | 1810 | 1879 |
70+
71+
## What Should I Know When Working With OpenRefine?
72+
73+
- No internet connection is needed, and none of the data or commands you enter in OpenRefine are sent to a remote server.
74+
- You are NOT modifying original/raw data.
75+
- Projects are autosaved every five minutes and when OpenRefine is properly shut down (Ctrl+C). See [History in User Manual](https://docs.openrefine.org/manual/running/#history-undoredo) for details.
76+
- Files are saved locally such that if you are working on two computers you will have to export/import files/projects.
77+
78+
:::::::::::::::::::::::::::::::::::::::: keypoints
79+
80+
- OpenRefine is 'a tool for working with messy data'
81+
- OpenRefine works best with data in a simple tabular format
82+
- OpenRefine can help you split data up into more granular parts
83+
- OpenRefine can help you match local data up to other data sets
84+
- OpenRefine can help you enhance a data set with data from other sources
85+
86+
::::::::::::::::::::::::::::::::::::::::::::::::::
87+
88+

02-importing-data.md

+97
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
---
2+
title: Importing data into OpenRefine
3+
teaching: 10
4+
exercises: 5
5+
---
6+
7+
::::::::::::::::::::::::::::::::::::::: objectives
8+
9+
- Successfully import data into OpenRefine
10+
11+
::::::::::::::::::::::::::::::::::::::::::::::::::
12+
13+
:::::::::::::::::::::::::::::::::::::::: questions
14+
15+
- How do I get data into OpenRefine?
16+
17+
::::::::::::::::::::::::::::::::::::::::::::::::::
18+
19+
## Importing data
20+
21+
OpenRefine does not manipulate your data directly.
22+
Instead, the data you import and all the changes you make are stored in a project.
23+
You can stop working on a project and continue later if you like.
24+
When you want to 'refine' a new file, you start by creating a new project.
25+
When you want to continue working on a project, you can open it through "Open Project".
26+
It is also possible to export a project on one computer and continue working on it on a different
27+
computer.
28+
To do so, you transfer the exported files to the new computer and use "Import Project" on the new
29+
computer.
30+
31+
::::::::::::::::::::::::::::::::::::::::: callout
32+
33+
## What kinds of data files can I import?
34+
35+
There are several options for getting your data set into OpenRefine. You can upload or import files in a variety of formats including:
36+
37+
- TSV (tab-separated values)
38+
- CSV (comma-separated values)
39+
- TXT
40+
- Excel
41+
- JSON (javascript object notation)
42+
- XML (extensible markup language)
43+
- Google Spreadsheet
44+
45+
46+
::::::::::::::::::::::::::::::::::::::::::::::::::
47+
48+
::::::::::::::::::::::::::::::::::::::: checklist
49+
50+
## Create your first OpenRefine project (using provided data)
51+
52+
To import the data for the exercise below, follow the instructions in [Setup](https://librarycarpentry.github.io/lc-open-refine/index.html) to download the data and run OpenRefine. *NOTE: If OpenRefine does not open in a browser window, open your browser and type the address [http://127.0.0.1:3333/](https://127.0.0.1:3333/) to take you to the OpenRefine interface.*
53+
54+
1. Once OpenRefine is launched in your browser, click `Create Project` from the left hand menu and select `Get data from This Computer`
55+
2. Click `Choose Files` (or 'Browse', depending on your setup) and locate the file which you have downloaded called `doaj-article-sample.csv`
56+
3. Click `Next»` where the next screen (see below) gives you options to ensure the data is imported into OpenRefine correctly. The options vary depending on the type of data you are importing.
57+
4. Click in the `Character encoding` box and set it to `UTF-8`. This ensures that OpenRefine correctly interprets the imported data as UTF-8 encoded. If you don't select this you may find that some special characters (e.g. smart quotation marks) are not displayed correctly.
58+
5. Ensure the first row is used to create the column headings by checking the box `Parse next 1 line(s) as column headers`
59+
6. OpenRefine will automatically select `Use character " to enclose cells containing column separators` (such as a comma) as part of their data. This will make sure that OpenRefine doesn't misinterpret any commas (or other characters) within the column data as a delimiter. Keep this option selected.
60+
7. From OpenRefine 3.4 onwards there is an option to Trim leading \& trailing whitespace from strings when importing separator-based files. Keeping this checked will ensure that values like `English` and `English `, which differ by a single trailing space, are not treated as different values after the import
61+
8. Make sure the `Attempt to parse cell text into numbers` box is not checked, so OpenRefine doesn't try to automatically detect numbers because this could cause errors such as confusion between date formats (e.g. DD/MM/YYYY vs MM/DD/YYYY).
62+
9. The Project Name box in the upper right corner will default to the title of your imported file. Click in the `Project Name` box to give your project a different name, if desired.
63+
64+
:::::::::::::::::::::::::::::::::::::: instructor
65+
66+
This is a good moment to review the points from [What Should I Know When Working with OpenRefine?](01-introduction.md#what-should-i-know-when-working-with-openrefine)
67+
68+
:::::::::::::::::::::::::::::::::::::::::::::::::
69+
10. Once you have selected the appropriate options for your project, click the `Create project »` button at the top right of the screen. This will create the project and open it for you. Projects are saved as you work on them, there is no need to save copies as you go along.
70+
71+
![Create Project in OpenRefine](fig/openrefine_ui.png){alt="OpenRefine Create Project screen, with highlights for the address bar, mentioned settings and the Create Project button."}
72+
73+
74+
::::::::::::::::::::::::::::::::::::::::::::::::::
75+
76+
To open an existing project in OpenRefine you can click `Open Project` from the main OpenRefine screen (in the left hand menu). When you click this, you will see a list of the existing projects and can click on a project's name to open it.
77+
78+
### Going Further
79+
80+
- Look at the other options on the Import screen - try changing some of these options and see how that changes the Preview and how the data appears after import.
81+
82+
::::::::::::::::::::::::::::::::::::::: instructor
83+
Carefully guide learners on how to revisit OpenRefine's homepage to explore import options when creating new or re-opening existing projects, select the large blue diamond in the upper left corner of the browser window.
84+
85+
::::::::::::::::::::::::::::::::::::::::::::::::::
86+
87+
- Do you have access to JSON or XML data? If so the first stage of the import process will prompt you to select a 'record path' - that is the parts of the file that will form the data rows in the OpenRefine project.
88+
89+
:::::::::::::::::::::::::::::::::::::::: keypoints
90+
91+
- Use the `Create Project` option to import data
92+
- You can control how data imports using options on the import screen
93+
- Several files types may be imported into OpenRefine.
94+
95+
::::::::::::::::::::::::::::::::::::::::::::::::::
96+
97+

0 commit comments

Comments
 (0)