Introduction to Open Data
  • Home
  • Syllabus
  • Exercises
    • Exercise 1: OA Deep Dive
    • Exercise 2: Movements and Principles
    • Exercise 3: Up to date with Linked Data
    • Exercise 4: Discovering ORD Platforms
    • Exercise 5: The Reuser’s Perspective (OGD)
    • Exercise 6: Reading Assignment
    • Exercise 7: Open Refine
    • Exercise 8: IIIF & ML
  • Course Sections
    • Characteristics of Open Data
    • Associated Movements
    • Associated Principles
    • Open Data Platforms and Organisations
    • Assessment, Data Quality, and Best Practices
    • Techniques, Software, and Tools
    • Showcases
  • Recap
  • References
  • About

On this page

  • Open Refine
    • A. Getting started
    • B. Basic Cleaning & Faceting with the Powerhouse Museum Dataset
    • C. Authority Control & Reconciliation with the Metropolitan Museum of Art
    • D. Manuscript Reconciliation with Biblissima
    • E. API Integration & JSON with the Smithsonian
    • F. A step further…
  • References
  • Edit this page
  • Report an issue

Exercise 7: Open Refine

Author
Affiliations

Julien A. Raemy

docuteam SA

University of Bern

Published

February 1, 2026

Modified

February 17, 2026

Open Refine

A. Getting started

  1. Install the software (https://openrefine.org/docs)
  2. Run it locally (usually accessible at http://127.0.0.1:3333/)
  3. Have a look at the different pages and functionalities
  4. Create a new project by importing any supported files
TipHints for Part A
  • OpenRefine runs in your browser but is a local application — no data leaves your machine.
  • Try importing a small CSV or Excel file you already have to get familiar with the interface before moving on.

B. Basic Cleaning & Faceting with the Powerhouse Museum Dataset

  1. Create a new project by importing the Powerhouse Museum TSV file (https://zenodo.org/records/17047254/files/phm_collection_adapted.tsv) — remember to select “Tab-separated values”
  2. Whitespace detection: Use Facet by blank on the categories column, then apply Edit cells > Common transforms > Trim leading and trailing whitespace. Observe how “blank” cells are revealed.
  3. Clustering: Create a Text facet on the category column. Click Cluster and use the “key collision” method to merge case inconsistencies (e.g., “Pottery” vs “pottery”).
  4. Multi-valued cells: Split the categories column by the separator | (pipe character). Count how many objects have more than 3 categories, then rejoin the cells.
TipHints for Part B
  • For step 6, the blank facet is under Facet > Customized facets > Facet by blank.
  • For step 7, try different clustering methods (fingerprint, ngram) — they catch different types of inconsistencies.
  • For step 8, use Edit cells > Split multi-valued cells and then Join multi-valued cells to rejoin.

C. Authority Control & Reconciliation with the Metropolitan Museum of Art

  1. Create a new project by importing the Met Museum CSV (https://github.com/metmuseum/openaccess) — use the first 5,000 rows for better performance
  2. Reconcile artists: Select the Artist Display Name column. Start reconciling against Wikidata (service: https://wikidata.reconci.link/en/api, type: Q5 for humans).
  3. Judgment exercise: Auto-match high confidence scores (>90), manually review medium scores (70-90) to distinguish between homonyms (e.g., different “John Smith” artists), and mark low scores (<70) as “None”.
  4. Extract QIDs: Transform reconciled cells to extract just the QID number for external linking.
TipHints for Part C
  • To add a reconciliation service: Reconcile > Start reconciling > Add Standard Service and paste the URL.
  • For step 12, use Edit cells > Transform with GREL: cell.recon.match.id

D. Manuscript Reconciliation with Biblissima

  1. Create a new project by importing the manuscript CSV from https://raw.githubusercontent.com/emmamorlock/workshop/refs/heads/main/exercices/handouts/biblissima.csv
  2. Explore manuscript metadata: Review columns such as shelfmark, repository, author, and title. Create text facets to identify variations in repository names or author attributions.
  3. Reconcile shelfmarks: Use the Biblissima reconciliation endpoint (https://data.biblissima.fr/api/reconcile, type: Manuscript) to match records against the Biblissima authority file. This connects your local shelfmarks to persistent URIs.
  4. Enrich with IIIF (optional extension): After reconciliation, use Edit column > Add column by fetching URLs to retrieve JSON data from the reconciled Biblissima URIs. Parse the JSON to extract iiif_manifest_url fields where available.

Cf. Sajdak (2024).

TipHints for Part D
  • The Biblissima endpoint works like Wikidata’s — add it the same way and select the “Manuscript” type.
  • For step 16, the fetched JSON may contain a iiif_manifest_url key — use value.parseJson()['iiif_manifest_url'] to extract it.

E. API Integration & JSON with the Smithsonian

  1. Create a new project by importing a Smithsonian JSON file (not CSV) from https://github.com/Smithsonian/OpenAccess
  2. Flatten nested JSON: Convert nested paths (e.g., title.content) to flat columns using the transform expression: value.parseJson()['title']
  3. Extract media: Parse the JSON to find objects with images by extracting online_media.mediaCount
  4. Reconcile topics: Match topic fields against Library of Congress Subject Headings (LCSH) using their reconciliation service.
TipHints for Part E
  • When importing JSON, OpenRefine lets you select the record path — pick the array that contains the collection objects.
  • For step 18, use value.parseJson() in Edit cells > Transform to navigate nested structures.

F. A step further…

  1. Export your cleaned datasets in both CSV and JSON formats
  2. Export the project history (OpenRefine project archive) to document your transformation steps
TipHints for Part F
  • Use Export > Comma-separated value and Export > Templating (for JSON) respectively.
  • The project archive is under Export > OpenRefine project archive to file — this saves all your transformation history.

More information and tutorials on OpenRefine can be found on the following links:

  • Library Carpentry website: https://librarycarpentry.org/lc-open-refine/
  • University of Nevada Las Vegas: https://guides.library.unlv.edu/open-refine/getting-started
  • University of Illinois Urbana-Champaign: https://guides.library.illinois.edu/openrefine
Back to top

References

Sajdak, C. (2024). Réconcilier avec OpenRefine. Biblissima+. https://doc.biblissima.fr/api/openrefine/

Reuse

CC BY 4.0

Julien A. Raemy | Introduction to Open Data

 
  • Edit this page
  • Report an issue

Content is published under a Creative Commons Attribution 4.0 International licence