Exercise 7: Open Refine
Open Refine
A. Getting started
- Install the software (https://openrefine.org/docs)
- Run it locally (usually accessible at http://127.0.0.1:3333/)
- Have a look at the different pages and functionalities
- Create a new project by importing any supported files
TipHints for Part A
- OpenRefine runs in your browser but is a local application — no data leaves your machine.
- Try importing a small CSV or Excel file you already have to get familiar with the interface before moving on.
B. Basic Cleaning & Faceting with the Powerhouse Museum Dataset
- Create a new project by importing the Powerhouse Museum TSV file (https://zenodo.org/records/17047254/files/phm_collection_adapted.tsv) — remember to select “Tab-separated values”
- Whitespace detection: Use Facet by blank on the
categoriescolumn, then apply Edit cells > Common transforms > Trim leading and trailing whitespace. Observe how “blank” cells are revealed. - Clustering: Create a Text facet on the
categorycolumn. Click Cluster and use the “key collision” method to merge case inconsistencies (e.g., “Pottery” vs “pottery”). - Multi-valued cells: Split the
categoriescolumn by the separator|(pipe character). Count how many objects have more than 3 categories, then rejoin the cells.
TipHints for Part B
- For step 6, the blank facet is under Facet > Customized facets > Facet by blank.
- For step 7, try different clustering methods (fingerprint, ngram) — they catch different types of inconsistencies.
- For step 8, use Edit cells > Split multi-valued cells and then Join multi-valued cells to rejoin.
D. Manuscript Reconciliation with Biblissima
- Create a new project by importing the manuscript CSV from https://raw.githubusercontent.com/emmamorlock/workshop/refs/heads/main/exercices/handouts/biblissima.csv
- Explore manuscript metadata: Review columns such as
shelfmark,repository,author, andtitle. Create text facets to identify variations in repository names or author attributions. - Reconcile shelfmarks: Use the Biblissima reconciliation endpoint (https://data.biblissima.fr/api/reconcile, type: Manuscript) to match records against the Biblissima authority file. This connects your local shelfmarks to persistent URIs.
- Enrich with IIIF (optional extension): After reconciliation, use Edit column > Add column by fetching URLs to retrieve JSON data from the reconciled Biblissima URIs. Parse the JSON to extract
iiif_manifest_urlfields where available.
Cf. Sajdak (2024).
TipHints for Part D
- The Biblissima endpoint works like Wikidata’s — add it the same way and select the “Manuscript” type.
- For step 16, the fetched JSON may contain a
iiif_manifest_urlkey — usevalue.parseJson()['iiif_manifest_url']to extract it.
E. API Integration & JSON with the Smithsonian
- Create a new project by importing a Smithsonian JSON file (not CSV) from https://github.com/Smithsonian/OpenAccess
- Flatten nested JSON: Convert nested paths (e.g.,
title.content) to flat columns using the transform expression:value.parseJson()['title'] - Extract media: Parse the JSON to find objects with images by extracting
online_media.mediaCount - Reconcile topics: Match
topicfields against Library of Congress Subject Headings (LCSH) using their reconciliation service.
TipHints for Part E
- When importing JSON, OpenRefine lets you select the record path — pick the array that contains the collection objects.
- For step 18, use
value.parseJson()in Edit cells > Transform to navigate nested structures.
F. A step further…
- Export your cleaned datasets in both CSV and JSON formats
- Export the project history (OpenRefine project archive) to document your transformation steps
TipHints for Part F
- Use Export > Comma-separated value and Export > Templating (for JSON) respectively.
- The project archive is under Export > OpenRefine project archive to file — this saves all your transformation history.
More information and tutorials on OpenRefine can be found on the following links:
- Library Carpentry website: https://librarycarpentry.org/lc-open-refine/
- University of Nevada Las Vegas: https://guides.library.unlv.edu/open-refine/getting-started
- University of Illinois Urbana-Champaign: https://guides.library.illinois.edu/openrefine
References
Sajdak, C. (2024). Réconcilier avec OpenRefine. Biblissima+. https://doc.biblissima.fr/api/openrefine/