Introduction to Open Data
  • Home
  • Syllabus
  • Exercises
    • Exercise 1: OA Deep Dive
    • Exercise 2: Up do date with Linked Data
    • Exercise 3: Movements and Principles
    • Exercise 4: The Reuser’s Perspective (OGD)
    • Exercise 5: Reading Assignment
    • Exercise 6: Open Refine
    • Exercise 7: IIIF & ML
  • Course Sections
    • Characteristics of Open Data
    • Associated Movements
    • Associated Principles
    • Open Data Platforms and Organisations
    • Assessment, Data Quality, and Best Practices
    • Techniques, Software, and Tools
    • Showcases
  • References
  • About

On this page

  • Open Refine
    • A. Getting started
    • B. Basic Cleaning & Faceting with the Powerhouse Museum Dataset
    • C. Authority Control & Reconciliation with the Metropolitan Museum of Art
    • D. Manuscript Reconciliation with Biblissima
    • E. API Integration & JSON with the Smithsonian
    • F. A step further…
  • References
  • Edit this page
  • Report an issue

Exercise 6: Open Refine

Author
Affiliations

Julien A. Raemy

docuteam SA

University of Bern

Published

February 1, 2026

Modified

February 1, 2026

Open Refine

A. Getting started

  1. Install the software (https://openrefine.org/docs)
  2. Run it locally (accessible at http://127.0.0.1:3333/)
  3. Have a look at the different pages and functionalities
  4. Create a new project by importing any supported files

B. Basic Cleaning & Faceting with the Powerhouse Museum Dataset

  1. Create a new project by importing the Powerhouse Museum TSV file (https://zenodo.org/records/17047254/files/phm_collection_adapted.tsv) — remember to select “Tab-separated values”
  2. Whitespace detection: Use Facet by blank on the categories column, then apply Edit cells > Common transforms > Trim leading and trailing whitespace. Observe how “blank” cells are revealed.
  3. Clustering: Create a Text facet on the category column. Click Cluster and use the “key collision” method to merge case inconsistencies (e.g., “Pottery” vs “pottery”).
  4. Multi-valued cells: Split the categories column by the separator | (pipe character). Count how many objects have more than 3 categories, then rejoin the cells.

C. Authority Control & Reconciliation with the Metropolitan Museum of Art

  1. Create a new project by importing the Met Museum CSV (https://github.com/metmuseum/openaccess) — use the first 5,000 rows for better performance
  2. Reconcile artists: Select the Artist Display Name column. Start reconciling against Wikidata (service: https://wikidata.reconci.link/en/api, type: Q5 for humans).
  3. Judgment exercise: Auto-match high confidence scores (>90), manually review medium scores (70-90) to distinguish between homonyms (e.g., different “John Smith” artists), and mark low scores (<70) as “None”.
  4. Extract QIDs: Transform reconciled cells to extract just the QID number for external linking.

D. Manuscript Reconciliation with Biblissima

  1. Create a new project by importing the manuscript CSV from https://raw.githubusercontent.com/emmamorlock/workshop/refs/heads/main/exercices/handouts/biblissima.csv
  2. Explore manuscript metadata: Review columns such as shelfmark, repository, author, and title. Create text facets to identify variations in repository names or author attributions.
  3. Reconcile shelfmarks: Use the Biblissima reconciliation endpoint (https://data.biblissima.fr/api/reconcile, type: Manuscript) to match records against the Biblissima authority file. This connects your local shelfmarks to persistent URIs.
  4. Enrich with IIIF (optional extension): After reconciliation, use Edit column > Add column by fetching URLs to retrieve JSON data from the reconciled Biblissima URIs. Parse the JSON to extract iiif_manifest_url fields where available.

Cf. Sajdak (2024).

E. API Integration & JSON with the Smithsonian

  1. Create a new project by importing a Smithsonian JSON file (not CSV) from https://github.com/Smithsonian/OpenAccess
  2. Flatten nested JSON: Convert nested paths (e.g., title.content) to flat columns using the transform expression: value.parseJson()['title']
  3. Extract media: Parse the JSON to find objects with images by extracting online_media.mediaCount
  4. Reconcile topics: Match topic fields against Library of Congress Subject Headings (LCSH) using their reconciliation service.

F. A step further…

  1. Export your cleaned datasets in both CSV and JSON formats
  2. Export the project history (OpenRefine project archive) to document your transformation steps

More information and tutorials on OpenRefine can be found on the following links:

  • Library Carpentry website: https://librarycarpentry.org/lc-open-refine/
  • University of Nevada Las Vegas: https://guides.library.unlv.edu/open-refine/getting-started
  • University of Illinois Urbana-Champaign: https://guides.library.illinois.edu/openrefine
Back to top

References

Sajdak, C. (2024). Réconcilier avec OpenRefine. Biblissima+. https://doc.biblissima.fr/api/openrefine/

Reuse

CC BY 4.0

Julien A. Raemy | Introduction to Open Data

 
  • Edit this page
  • Report an issue

Content is published under a Creative Commons Attribution 4.0 International licence