Exercise 7: Open Refine

Author

Affiliations

Julien A. Raemy

docuteam SA

University of Bern

Published

February 1, 2026

Modified

March 4, 2026

Open Refine

A. Getting started

Install the software (https://openrefine.org/docs)
Run it locally (usually accessible at http://127.0.0.1:3333/)
Have a look at the different pages and functionalities
Create a new project by importing any supported files

Hints for Part A

OpenRefine runs in your browser but is a local application — no data leaves your machine.
Try importing a small CSV or Excel file you already have to get familiar with the interface before moving on.

B. Basic Cleaning & Faceting with the Powerhouse Museum Dataset

Create a new project by importing the Powerhouse Museum TSV file. Instead of downloading the file first, use the “Get data from Web Address / URL” option in the import screen and paste this link: https://zenodo.org/records/17047254/files/phm_collection_adapted.tsv. Remember to select “Tab-separated values”.
Explore with Facets:
- Click the dropdown arrow in the Categories column header and select Facet > Text filter.
- Search “Spacesuits” to filter the view
Test the Undo Stack:
- Make a temporary change: Go to the Object Title column > Edit cells > Transform and type value.toUppercase(). Click OK.
- Look at the Undo / Redo tab on the left. Click the previous step in the list to instantly revert the change.
- Goal: Confirm that every action is recorded and reversible.
Multi-valued cells: The categories column contains multiple values separated by a pipe character (|).
- Use Edit cells > Split multi-valued cells to separate them (enter | as the separator).
- Count how many objects have more than 3 categories (hint: use Facet > Custom facets > Facet by length on the split column).
- Once analysed, rejoin the cells using Edit cells > Join multi-valued cells.
Text Facet & Regular Expressions: The Provenance (Production) column contains structured strings separated by pipes (e.g., Maker: ...; ...; ...|).
- Create a Text facet on the Provenance (Production) column to observe the variations.
- Notice that many entries start with specific roles like “Maker:” or “Designer:”.
- Use Edit cells > Transform with a GREL regex to extract just the role. Try the expression: value.match(/^(Maker|Designer|User):/)[0].
- Alternatively, use Edit column > Add column based on this column to create a new “Role” column extracting the first word before the colon.

Hints for Part B

For Step 5: Importing directly from URL saves disk space.
For Step 6: Facets and filters are the most powerful tool in OpenRefine. They allow you to filter data and see errors (like typos or inconsistent capitalization) instantly.
For Step 8: After splitting, OpenRefine creates multiple rows for a single record. Use the “Record” mode (top left) to count categories per original object if needed.
For Step 9: Regular Expressions (regex) are powerful for extracting patterns. ^ matches the start of the string, and (...) captures the group. If the pattern isn’t found, the result will be null.

C. Manuscript Reconciliation with Biblissima

Import and Inspect:
- Create a new project by importing the CSV from: https://raw.githubusercontent.com/emmamorlock/workshop/refs/heads/main/exercices/handouts/biblissima.csv
- Observe the columns: ---- (Name), coord, Type, YearBirth, YearDeath.
- Note: There is no explicit “Author” or “Repository” column. The Type column identifies if a row is a person, a place, a descriptor, or a shelfmark.
Filter Non-Person Entities:
- The dataset contains mixed types. We want to focus on cleaning and reconciling persons.
- Click the Type column > Facet > Text facet.
- In the left panel, uncheck descriptor, place, and shelfmark. Only person should remain checked.
- Result: The grid now shows only rows representing people.
Clean Name Variations:
- Look at the ---- (Name) column. Notice inconsistencies like ” Petrarch” (leading spaces) or “Charles I” (multiple spaces).
- Apply Edit cells > Common transforms > Trim leading and trailing whitespace.
- Apply Edit cells > Common transforms > To titlecase to standardize capitalization.
- Optional: Use Edit cells > Common transforms > Collapse consecutive whitespace if available, or use Transform with value.replace(/\s+/, " ").
Reconcile with Biblissima:
- Now that names are cleaner, connect them to the official Biblissima authority file.
- Click the ---- (Name) column > Reconcile > Start reconciling…
- If Biblissima is not in the list, click Add standard service… and enter: https://data.biblissima.fr/reconcile/en/api
- Select the Biblissima service, choose the type Human and add relevant properties to help in the reconciliation.
- Click Start Reconciling.
Extract Reconciled Data:
- Use Reconcile > Add column with URLs of match entites… to extract sthe Biblissima URI
- This creates a persistent link between your local messy data and the authoritative record.

Hints for Part C

For Step 11: Faceting allows you to temporarily hide rows you don’t need. If you want to permanently delete non-person rows, you can use Edit rows > Remove all matching rows while the facet is active.
For Step 12: Always trim whitespace before reconciling. ” Charles” and “Charles” are treated as different strings by the reconciliation service, reducing your match rate.
For Step 13: Reconciliation is not just matching; it disambiguates. “Charles I” could be the King of Portugal or England. The service provides candidates so you can choose the correct historical figure based on the context (dates).
For Step 14: The extracted URI (e.g., http://data.biblissima.fr/...) is a stable link you can use to fetch more data or link to other datasets (Linked Open Data).

More information and tutorials on OpenRefine can be found on the following links:

Library Carpentry website: https://librarycarpentry.org/lc-open-refine/
University of Nevada Las Vegas: https://guides.library.unlv.edu/open-refine/getting-started
University of Illinois Urbana-Champaign: https://guides.library.illinois.edu/openrefine

References

Reuse

CC BY 4.0