Skip to content

OAI-PMH Harvester

Is a lightweight Python CLI harvester for extracting metadata from OAI-PMH repositories and exporting it into a CSV format suitable for Islandora-style ingest workflows.

The harvester:

  • Connects to an OAI-PMH endpoint
  • Uses ListSets method to discover available collections
  • Prompts the user to choose a set
  • Harvests records using ListIdentifiers and GetRecord
  • Maps harvested metadata into predefined CSV fields
  • Scrapes record pages for image with “Service File” string in filename
  • Automatically populates image-related ingest metadata (custom CB fields)
  • Outputs a CSV file in the current directory

The script uses only Python standard library modules (urllib, xml.etree, csv, etc.) and requires no third-party dependencies. Or VENV...

Requirements

  • Python 3.9+

No additional packages are required.

Author's Note: Calling Python will differ across platforms. Windows uses "python" Mac uses "python3" Linux uses "python"

Installation

Extract the script into your working directory.

Usage

Run the harvester with an OAI-PMH endpoint URL:

python harvester.py "https://myOAIEnabledSite.edu/oaiEndpointHere"

The script will:

  1. Ask for a CSV filename (uses artifacts.csv by default)
  2. Fetch available OAI-PMH sets
  3. Prompt you to select one set
  4. Harvest metadata records
  5. Write a CSV file to the current directory

Workflow

  1. Extract the script to a working directory, or the CollectionBuilder /_data
  2. Delete the existing metadata file
  3. Run harvester.py
  4. Replace the existing or deleted metadata file with the harvester's output
  5. Add metadata manually as needed (3D models, etc.)

Examples

python harvester.py "https://digitalcollections.tricolib.brynmawr.edu/oai/request"

Example session:

Enter CSV file name: artifacts.csv Available sets:

  1. Peace Collection setSpec: peace_collection
  2. Photographs setSpec: photographs Select a set to harvest [1-2]: 2

CSV Output

The harvester generates CSV files with these headers by default:

objectid,parentid,title,field_edtf_date,field_description_long,field_linked_agent,field_physical_form,field_genre,field_extent,field_language,field_subject,field_subjects_name,field_geographic_subject,field_temporal_subject,field_use_reproduction,field_rights_statement,field_collection_guide,field_shelf_locator,field_local_identifier,field_note,field_member_of,type,format,object_comments,group,display_template,object_location,image_small,image_thumb,extension,image_alt_text,notes

Headers are configurable, by modifying values in the CSV_HEADERS variable.

Metadata Mapping

The script maps standard OAI/Dublin Core metadata fields into Islandora-style ingest fields.

Example mappings:

OAI/DC Field CSV Column title title description field_description_long date field_edtf_date creator field_linked_agent subject field_subject rights field_rights_statement identifier field_local_identifier language field_language

Image Scraping

For each harvested record, the script:

  1. Extracts the OAI identifier
  2. Converts it into a public node URL
  3. Scrapes the page HTML
  4. Searches for image filenames containing:

Service File

If found, the image URL is automatically inserted into:

image_small image_thumb object_location

The script also auto-fills:

extension = jpg display_template = artifact_image group = artifact type = Image;StillImage

Example OAI Identifier Conversion

Input:

oai:digitalcollections.tricolib.brynmawr.edu:node-525276

Generated scrape URL:

https://digitalcollections.tricolib.brynmawr.edu/node/525276

Supported OAI-PMH Verbs

The harvester uses:

ListSets to discover available sets

ListIdentifiers to fetch identifiers within a set

GetRecord to fetch full metadata records

These are the minimum required methods from the OAI-PMH spec to harvest metadata.

Notes

  • Deleted records are skipped automatically.
  • The harvester respects OAI-PMH resumption tokens.
  • The default metadata format is:

oai_dc

You can change it with:

python harvester.py "https://example.org/oai" --metadata-prefix mods

Troubleshooting

No metadata appears in the CSV

Possible causes:

  • Repository metadata fields do not map to the configured CSV fields
  • Repository does not expose oai_dc
  • XML namespaces differ from expected structure

Try inspecting raw XML manually:

https://example.org/oai?verb=GetRecord&metadataPrefix=oai_dc&identifier=RECORD_ID

No image URLs appear

Possible causes:

  • No “Service File” image exists. NOTE: The script will only grab files that contain the substring "Service File"
  • The repository uses different image naming conventions
  • Images are dynamically loaded with JavaScript or in a frame of some kind