Introduction to Open Data
  • Home
  • Syllabus
  • Exercises
    • Exercise 1: OA Deep Dive
    • Exercise 2: Movements and Principles
    • Exercise 3: Up to date with Linked Data
    • Exercise 4: Discovering ORD Platforms
    • Exercise 5: The Reuser’s Perspective (OGD)
    • Exercise 6: Reading Assignment
    • Exercise 7: Open Refine
    • Exercise 8: IIIF & ML
  • Course Sections
    • Characteristics of Open Data
    • Associated Movements
    • Associated Principles
    • Open Data Platforms and Organisations
    • Assessment, Data Quality, and Best Practices
    • Techniques, Software, and Tools
    • Showcases
  • Recap
  • References
  • About

On this page

  • Open Refine
    • A. Getting started
    • B. Basic Cleaning & Faceting with the Powerhouse Museum Dataset
    • C. Manuscript Reconciliation with Biblissima
  • References
  • Edit this page
  • Report an issue

Exercise 7: Open Refine

Author
Affiliations

Julien A. Raemy

docuteam SA

University of Bern

Published

February 1, 2026

Modified

March 4, 2026

Open Refine

A. Getting started

  1. Install the software (https://openrefine.org/docs)
  2. Run it locally (usually accessible at http://127.0.0.1:3333/)
  3. Have a look at the different pages and functionalities
  4. Create a new project by importing any supported files
TipHints for Part A
  • OpenRefine runs in your browser but is a local application — no data leaves your machine.
  • Try importing a small CSV or Excel file you already have to get familiar with the interface before moving on.

B. Basic Cleaning & Faceting with the Powerhouse Museum Dataset

  1. Create a new project by importing the Powerhouse Museum TSV file. Instead of downloading the file first, use the “Get data from Web Address / URL” option in the import screen and paste this link: https://zenodo.org/records/17047254/files/phm_collection_adapted.tsv. Remember to select “Tab-separated values”.

  2. Explore with Facets:

    • Click the dropdown arrow in the Categories column header and select Facet > Text filter.
    • Search “Spacesuits” to filter the view
  3. Test the Undo Stack:

    • Make a temporary change: Go to the Object Title column > Edit cells > Transform and type value.toUppercase(). Click OK.
    • Look at the Undo / Redo tab on the left. Click the previous step in the list to instantly revert the change.
    • Goal: Confirm that every action is recorded and reversible.
  4. Multi-valued cells: The categories column contains multiple values separated by a pipe character (|).

    • Use Edit cells > Split multi-valued cells to separate them (enter | as the separator).
    • Count how many objects have more than 3 categories (hint: use Facet > Custom facets > Facet by length on the split column).
    • Once analysed, rejoin the cells using Edit cells > Join multi-valued cells.
  5. Text Facet & Regular Expressions: The Provenance (Production) column contains structured strings separated by pipes (e.g., Maker: ...; ...; ...|).

    • Create a Text facet on the Provenance (Production) column to observe the variations.
    • Notice that many entries start with specific roles like “Maker:” or “Designer:”.
    • Use Edit cells > Transform with a GREL regex to extract just the role. Try the expression: value.match(/^(Maker|Designer|User):/)[0].
    • Alternatively, use Edit column > Add column based on this column to create a new “Role” column extracting the first word before the colon.
TipHints for Part B
  • For Step 5: Importing directly from URL saves disk space.
  • For Step 6: Facets and filters are the most powerful tool in OpenRefine. They allow you to filter data and see errors (like typos or inconsistent capitalization) instantly.
  • For Step 8: After splitting, OpenRefine creates multiple rows for a single record. Use the “Record” mode (top left) to count categories per original object if needed.
  • For Step 9: Regular Expressions (regex) are powerful for extracting patterns. ^ matches the start of the string, and (...) captures the group. If the pattern isn’t found, the result will be null.

C. Manuscript Reconciliation with Biblissima

  1. Import and Inspect:
    • Create a new project by importing the CSV from: https://raw.githubusercontent.com/emmamorlock/workshop/refs/heads/main/exercices/handouts/biblissima.csv
    • Observe the columns: ---- (Name), coord, Type, YearBirth, YearDeath.
    • Note: There is no explicit “Author” or “Repository” column. The Type column identifies if a row is a person, a place, a descriptor, or a shelfmark.
  2. Filter Non-Person Entities:
    • The dataset contains mixed types. We want to focus on cleaning and reconciling persons.
    • Click the Type column > Facet > Text facet.
    • In the left panel, uncheck descriptor, place, and shelfmark. Only person should remain checked.
    • Result: The grid now shows only rows representing people.
  3. Clean Name Variations:
    • Look at the ---- (Name) column. Notice inconsistencies like ” Petrarch” (leading spaces) or “Charles I” (multiple spaces).
    • Apply Edit cells > Common transforms > Trim leading and trailing whitespace.
    • Apply Edit cells > Common transforms > To titlecase to standardize capitalization.
    • Optional: Use Edit cells > Common transforms > Collapse consecutive whitespace if available, or use Transform with value.replace(/\s+/, " ").
  4. Reconcile with Biblissima:
    • Now that names are cleaner, connect them to the official Biblissima authority file.
    • Click the ---- (Name) column > Reconcile > Start reconciling…
    • If Biblissima is not in the list, click Add standard service… and enter: https://data.biblissima.fr/reconcile/en/api
    • Select the Biblissima service, choose the type Human and add relevant properties to help in the reconciliation.
    • Click Start Reconciling.
  5. Extract Reconciled Data:
    • Use Reconcile > Add column with URLs of match entites… to extract sthe Biblissima URI
    • This creates a persistent link between your local messy data and the authoritative record.
TipHints for Part C
  • For Step 11: Faceting allows you to temporarily hide rows you don’t need. If you want to permanently delete non-person rows, you can use Edit rows > Remove all matching rows while the facet is active.
  • For Step 12: Always trim whitespace before reconciling. ” Charles” and “Charles” are treated as different strings by the reconciliation service, reducing your match rate.
  • For Step 13: Reconciliation is not just matching; it disambiguates. “Charles I” could be the King of Portugal or England. The service provides candidates so you can choose the correct historical figure based on the context (dates).
  • For Step 14: The extracted URI (e.g., http://data.biblissima.fr/...) is a stable link you can use to fetch more data or link to other datasets (Linked Open Data).

More information and tutorials on OpenRefine can be found on the following links:

  • Library Carpentry website: https://librarycarpentry.org/lc-open-refine/
  • University of Nevada Las Vegas: https://guides.library.unlv.edu/open-refine/getting-started
  • University of Illinois Urbana-Champaign: https://guides.library.illinois.edu/openrefine

References

Back to top

Reuse

CC BY 4.0

Julien A. Raemy | Introduction to Open Data

 
  • Edit this page
  • Report an issue

Content is published under a Creative Commons Attribution 4.0 International licence