Exercise 6: Open Refine
Open Refine
A. Getting started
- Install the software (https://openrefine.org/docs)
- Run it locally (accessible at http://127.0.0.1:3333/)
- Have a look at the different pages and functionalities
- Create a new project by importing any supported files
B. Basic Cleaning & Faceting with the Powerhouse Museum Dataset
- Create a new project by importing the Powerhouse Museum TSV file (https://zenodo.org/records/17047254/files/phm_collection_adapted.tsv) — remember to select “Tab-separated values”
- Whitespace detection: Use Facet by blank on the
categoriescolumn, then apply Edit cells > Common transforms > Trim leading and trailing whitespace. Observe how “blank” cells are revealed. - Clustering: Create a Text facet on the
categorycolumn. Click Cluster and use the “key collision” method to merge case inconsistencies (e.g., “Pottery” vs “pottery”). - Multi-valued cells: Split the
categoriescolumn by the separator|(pipe character). Count how many objects have more than 3 categories, then rejoin the cells.
D. Manuscript Reconciliation with Biblissima
- Create a new project by importing the manuscript CSV from https://raw.githubusercontent.com/emmamorlock/workshop/refs/heads/main/exercices/handouts/biblissima.csv
- Explore manuscript metadata: Review columns such as
shelfmark,repository,author, andtitle. Create text facets to identify variations in repository names or author attributions. - Reconcile shelfmarks: Use the Biblissima reconciliation endpoint (https://data.biblissima.fr/api/reconcile, type: Manuscript) to match records against the Biblissima authority file. This connects your local shelfmarks to persistent URIs.
- Enrich with IIIF (optional extension): After reconciliation, use Edit column > Add column by fetching URLs to retrieve JSON data from the reconciled Biblissima URIs. Parse the JSON to extract
iiif_manifest_urlfields where available.
Cf. Sajdak (2024).
E. API Integration & JSON with the Smithsonian
- Create a new project by importing a Smithsonian JSON file (not CSV) from https://github.com/Smithsonian/OpenAccess
- Flatten nested JSON: Convert nested paths (e.g.,
title.content) to flat columns using the transform expression:value.parseJson()['title'] - Extract media: Parse the JSON to find objects with images by extracting
online_media.mediaCount - Reconcile topics: Match
topicfields against Library of Congress Subject Headings (LCSH) using their reconciliation service.
F. A step further…
- Export your cleaned datasets in both CSV and JSON formats
- Export the project history (OpenRefine project archive) to document your transformation steps
More information and tutorials on OpenRefine can be found on the following links:
- Library Carpentry website: https://librarycarpentry.org/lc-open-refine/
- University of Nevada Las Vegas: https://guides.library.unlv.edu/open-refine/getting-started
- University of Illinois Urbana-Champaign: https://guides.library.illinois.edu/openrefine
References
Sajdak, C. (2024). Réconcilier avec OpenRefine. Biblissima+. https://doc.biblissima.fr/api/openrefine/