Exercise 7: Open Refine
Open Refine
A. Getting started
- Install the software (https://openrefine.org/docs)
- Run it locally (usually accessible at http://127.0.0.1:3333/)
- Have a look at the different pages and functionalities
- Create a new project by importing any supported files
TipHints for Part A
- OpenRefine runs in your browser but is a local application — no data leaves your machine.
- Try importing a small CSV or Excel file you already have to get familiar with the interface before moving on.
B. Basic Cleaning & Faceting with the Powerhouse Museum Dataset
Create a new project by importing the Powerhouse Museum TSV file. Instead of downloading the file first, use the “Get data from Web Address / URL” option in the import screen and paste this link: https://zenodo.org/records/17047254/files/phm_collection_adapted.tsv. Remember to select “Tab-separated values”.
Explore with Facets:
- Click the dropdown arrow in the
Categoriescolumn header and select Facet > Text filter. - Search “Spacesuits” to filter the view
- Click the dropdown arrow in the
Test the Undo Stack:
- Make a temporary change: Go to the
Object Titlecolumn > Edit cells > Transform and typevalue.toUppercase(). Click OK. - Look at the Undo / Redo tab on the left. Click the previous step in the list to instantly revert the change.
- Goal: Confirm that every action is recorded and reversible.
- Make a temporary change: Go to the
Multi-valued cells: The
categoriescolumn contains multiple values separated by a pipe character (|).- Use Edit cells > Split multi-valued cells to separate them (enter
|as the separator). - Count how many objects have more than 3 categories (hint: use Facet > Custom facets > Facet by length on the split column).
- Once analysed, rejoin the cells using Edit cells > Join multi-valued cells.
- Use Edit cells > Split multi-valued cells to separate them (enter
Text Facet & Regular Expressions: The
Provenance (Production)column contains structured strings separated by pipes (e.g.,Maker: ...; ...; ...|).- Create a Text facet on the
Provenance (Production)column to observe the variations. - Notice that many entries start with specific roles like “Maker:” or “Designer:”.
- Use Edit cells > Transform with a GREL regex to extract just the role. Try the expression:
value.match(/^(Maker|Designer|User):/)[0]. - Alternatively, use Edit column > Add column based on this column to create a new “Role” column extracting the first word before the colon.
- Create a Text facet on the
TipHints for Part B
- For Step 5: Importing directly from URL saves disk space.
- For Step 6: Facets and filters are the most powerful tool in OpenRefine. They allow you to filter data and see errors (like typos or inconsistent capitalization) instantly.
- For Step 8: After splitting, OpenRefine creates multiple rows for a single record. Use the “Record” mode (top left) to count categories per original object if needed.
- For Step 9: Regular Expressions (regex) are powerful for extracting patterns.
^matches the start of the string, and(...)captures the group. If the pattern isn’t found, the result will benull.
C. Manuscript Reconciliation with Biblissima
- Import and Inspect:
- Create a new project by importing the CSV from: https://raw.githubusercontent.com/emmamorlock/workshop/refs/heads/main/exercices/handouts/biblissima.csv
- Observe the columns:
----(Name),coord,Type,YearBirth,YearDeath. - Note: There is no explicit “Author” or “Repository” column. The
Typecolumn identifies if a row is a person, a place, a descriptor, or a shelfmark.
- Filter Non-Person Entities:
- The dataset contains mixed types. We want to focus on cleaning and reconciling persons.
- Click the
Typecolumn > Facet > Text facet. - In the left panel, uncheck
descriptor,place, andshelfmark. Onlypersonshould remain checked. - Result: The grid now shows only rows representing people.
- Clean Name Variations:
- Look at the
----(Name) column. Notice inconsistencies like ” Petrarch” (leading spaces) or “Charles I” (multiple spaces). - Apply Edit cells > Common transforms > Trim leading and trailing whitespace.
- Apply Edit cells > Common transforms > To titlecase to standardize capitalization.
- Optional: Use Edit cells > Common transforms > Collapse consecutive whitespace if available, or use Transform with
value.replace(/\s+/, " ").
- Look at the
- Reconcile with Biblissima:
- Now that names are cleaner, connect them to the official Biblissima authority file.
- Click the
----(Name) column > Reconcile > Start reconciling… - If Biblissima is not in the list, click Add standard service… and enter:
https://data.biblissima.fr/reconcile/en/api - Select the Biblissima service, choose the type Human and add relevant properties to help in the reconciliation.
- Click Start Reconciling.
- Extract Reconciled Data:
- Use Reconcile > Add column with URLs of match entites… to extract sthe Biblissima URI
- This creates a persistent link between your local messy data and the authoritative record.
TipHints for Part C
- For Step 11: Faceting allows you to temporarily hide rows you don’t need. If you want to permanently delete non-person rows, you can use Edit rows > Remove all matching rows while the facet is active.
- For Step 12: Always trim whitespace before reconciling. ” Charles” and “Charles” are treated as different strings by the reconciliation service, reducing your match rate.
- For Step 13: Reconciliation is not just matching; it disambiguates. “Charles I” could be the King of Portugal or England. The service provides candidates so you can choose the correct historical figure based on the context (dates).
- For Step 14: The extracted URI (e.g.,
http://data.biblissima.fr/...) is a stable link you can use to fetch more data or link to other datasets (Linked Open Data).
More information and tutorials on OpenRefine can be found on the following links:
- Library Carpentry website: https://librarycarpentry.org/lc-open-refine/
- University of Nevada Las Vegas: https://guides.library.unlv.edu/open-refine/getting-started
- University of Illinois Urbana-Champaign: https://guides.library.illinois.edu/openrefine