Characteristics of Open Data

Definitions, History, Licences & Technical Means

Julien A. Raemy (docuteam SA; University of Bern)
ORCID Google Scholar GitHub Mastodon
HES-SO University of Applied Sciences and Arts Western Switzerland
HEG-GE Bachelor Information Science

Definitions: Open(ness)

Open means anyone can freely access, use, modify, and share for any purpose (subject, at most, to requirements that preserve provenance and openness).

Open Knowledge Network — The Open Definition

  • No limitations on access: Accessibility must not depend on cost, authentication, or privileges
  • Free and open: Open knowledge should be shared freely, without cost or barriers
  • Knowledge funded by public mandates must benefit the public without restrictions

Definitions: Data & Metadata

Data

Data at its most basic level as the absence of uniformity, whether in the real world or in some symbolic system. Only once such data have some recognisable structure and are given some meaning can they be considered information.

Floridi (2010)

Metadata

Data whose purpose is to describe and give information about other data.

Oxford English Dictionary (2023)

There is no fixed boundary between “data” and “metadata”, and information viewed as data in one discipline may be metadata in another.

Alter & Gonzalez (2018)

Definitions: Open Data

Open data and content can be freely used, modified, and shared by anyone for any purpose.

Open Knowledge Network — The Open Definition

Two core requirements:

  1. Legally open: available under an open licence that permits anyone to freely access, reuse, and redistribute
  2. Technically open: available for no more than the cost of reproduction, in machine-readable and bulk form

History

  • 1942: Robert K. Merton — each researcher must contribute to the “common pot” and give up intellectual property rights to allow knowledge to move forward
  • 1995: The term “Open Data” first appeared, related to geophysical and environmental data sharing
  • November 2005: Open Knowledge Foundation creates the Open Definition
  • December 2007: The Sebastopol Meeting — 30 Internet thinkers and activists define open public data and identify 8 principles
  • February 2009: Tim Berners-Lee at TED2009 — “Raw Data Now”

The Sebastopol Meeting (2007)

In December 2007, thirty thinkers and activists met in Sebastopol, CA to define open public data.

Key participants:

  • Tim O’Reilly — coined “open source” and “Web 2.0”
  • Lawrence Lessig — Stanford Law, founder of Creative Commons
  • Aaron Swartz — inventor of RSS, free knowledge activist

Others:

  • Adrian Holovaty — founder of EveryBlock
  • Tom Steinberg — founder of FixMyStreet

Together, they created 8 principles to define and evaluate open public data.

Impact on Disciplines

Catalysing Progress

  • Drives interdisciplinary research through universally accessible datasets
  • Fosters collaboration across diverse fields
  • Facilitates advanced, data-driven research

Discipline-specific Impact

  • Information Science: improved data management, retrieval, and dissemination
  • Digital Humanities: new digital approaches to historical, cultural, and linguistic studies
  • Data Science: predictive modeling, ML, big data analytics across health, finance, social sciences

Typology of Open Data

Main Sources

  • Research: Open Research Data (ORD)
  • Government: Open Government Data (OGD)
  • Non-profit Organisations
  • Private Organisations

Disciplines

  • Cultural Heritage
  • Healthcare & Education
  • Transportation & Meteorology
  • Geospatial Information
  • Economic & Finance
  • Legal & Criminal Justice

Open Research Data (ORD)

Research data are the evidence that underpins the answer to the research question, and can be used to validate findings regardless of its form.

DORA (San Francisco Declaration on Research Assessment):

  • Consider the value and impact of all research outputs (including datasets and software)
  • Be open and transparent by providing data and methods used to calculate all metrics
  • Provide data under a licence that allows unrestricted reuse

ORD in Switzerland

SNSF Policy

Funded researchers must:

  • Store research data produced during their work
  • Share data with other researchers (unless bound by legal/ethical clauses)
  • Deposit data and metadata onto public repositories

Swiss National ORD Strategy — Action Areas

  1. Support researchers in adopting ORD practices
  2. Develop sustainable basic infrastructures
  3. Equip researchers for ORD skills development
  4. Build supportive institutional conditions

RDM & DMP

Research Data Management (RDM)

  • Organisation, storage, and preservation of research data
  • Ensures data are well-organised, accessible for current and future research
  • Improves reliability, validity, and reproducibility

Data Management Plan (DMP)

  • Formal document outlining data handling during and after a project
  • Now mandatory for most funding agencies
  • A DMP is essentially a blueprint for RDM

University libraries often provide services and resources to assist researchers in creating DMPs.

Open Government Data (OGD)

The work of government involves collecting huge amounts of data, much of which is not confidential. The value of much of this data can be greatly enhanced by releasing it as open data.

OGD in Switzerland:

  • OGD Masterplan by the Federal Statistical Office
  • LMETA Art. 10, al. 4: Data must be published free of charge, in a timely manner, in machine-readable and open formats
  • National data management (NaDB), i14y Interoperability platform, Digital Switzerland Strategy

Purposes of Open Data

  • Transparency and democratic control
  • Participation
  • Self-empowerment
  • Improved or new private products and services
  • Innovation
  • Improved efficiency and effectiveness of government services
  • Impact measurement of policies
  • New knowledge from combined data sources and patterns in large data volumes

OGD Principles (Sebastopol, 2007)

  1. Complete: All public data is made available
  2. Primary: Data at the source, highest granularity
  3. Timely: Available as quickly as necessary
  4. Accessible: Widest range of users and purposes
  1. Machine processable: Structured for automated processing
  2. Non-discriminatory: No registration required
  3. Non-proprietary: Open formats only
  4. License-free: No copyright, patent, or trade secret restrictions

Creative Commons (CC)

  • CC BY: Attribution
  • CC BY-SA: Attribution, Share Alike
  • CC BY-ND: Attribution, No Derivatives
  • CC BY-NC: Attribution, No Commercial Use
  • CC BY-NC-SA: Attribution, NC, Share Alike
  • CC BY-NC-ND: Attribution, NC, No Derivatives
  • CC0: Public Domain Dedication
  • PDM: Public Domain Mark

CC Spectrum & Other Licences

CC Spectrum

Other Relevant Licences

  • Rights Statements: 12 statements for cultural heritage (In Copyright / No Copyright / Other)
  • ODbL: Copyleft, Attribution & Share-Alike for databases
  • PDDL: Public domain for databases
  • Software: GPL, AGPL, MPL, MIT, Apache
  • RAIL: Responsible AI Licenses (data, apps, models, source code)
  • IHAIL: I Hate AI License (prohibits AI use)

Recommendations for Open Data Licences

Preferred Creative Commons licences for open datasets (Santos, 2020):

  1. CC0 — to the fullest extent allowed by law (complete waiver not feasible under Swiss regulations)
  2. CC BY 4.0
  3. CC BY-SA 4.0

Technical Means: Overview

Key factors in providing structured data for machines:

Data & Formats

  • Text-based formats: TXT, MD, CSV, XML, Turtle, JSON, JSON-LD
  • Binary-encoded formats: images (TIFF, JPEG2000), audio (FLAC, WAV), video (FFV1/MKV), documents (PDF/A)

Supporting Elements

  • Metadata standards / schemas
  • Documentation
  • Protocols
  • Infrastructure

Infrastructure

It is by definition invisible, part of the background for other kinds of work.

Star (1999) identifies nine dimensions:

  • Embeddedness
  • Transparency
  • Reach or scope
  • Learned as part of membership
  • Links with conventions of practice
  • Embodiment of standards
  • Built on an installed base
  • Becomes visible upon breakdown
  • Fixed in modular increments

Metadata Standards & DCAT

Standards vs. Schemas

  • Standards: rules and guidelines (CIDOC-CRM, Dublin Core, MARC, PREMIS)
  • Schemas: specific implementations (EAD, LIDO, MODS)

Standards provide the “what” and “why”; schemas offer the “how”.

DCAT (Data Catalog Vocabulary)

Seven main classes:

  1. dcat:Catalog
  2. dcat:Resource
  3. dcat:Dataset
  4. dcat:Distribution
  5. dcat:DataService
  6. dcat:DatasetSeries
  7. dcat:CatalogRecord

DCAT-AP CH (Swiss profile) is a subprofile of DCAT-AP (European profile)

Protocols

  • API — mechanism enabling two software components to communicate
    • REST: stateless HTTP requests
    • SOAP: structured information exchange with high security
  • FTP — transfer of large data files
  • OAI-PMH — harvesting metadata from digital libraries
  • RSS / Atom Feeds — regular updates for frequently changing data
  • SPARQL — querying and manipulating RDF data