OAI-PMH Harvester

Is a lightweight Python CLI harvester for extracting metadata from OAI-PMH repositories and exporting it into a CSV format suitable for Islandora-style ingest workflows.

The harvester:

Connects to an OAI-PMH endpoint
Uses ListSets method to discover available collections
Prompts the user to choose a set
Harvests records using ListIdentifiers and GetRecord
Maps harvested metadata into predefined CSV fields
Scrapes record pages for image with “Service File” string in filename
Automatically populates image-related ingest metadata (custom CB fields)
Outputs a CSV file in the current directory

The script uses only Python standard library modules (urllib, xml.etree, csv, etc.) and requires no third-party dependencies. Or VENV...

Requirements

Python 3.9+

No additional packages are required.

Author's Note: Calling Python will differ across platforms. Windows uses "python" Mac uses "python3" Linux uses "python"

Installation

Extract the script into your working directory.

Usage

Run the harvester with an OAI-PMH endpoint URL:

python harvester.py "https://myOAIEnabledSite.edu/oaiEndpointHere"

The script will:

Ask for a CSV filename (uses artifacts.csv by default)
Fetch available OAI-PMH sets
Prompt you to select one set
Harvest metadata records
Write a CSV file to the current directory

Workflow

Extract the script to a working directory, or the CollectionBuilder /_data
Delete the existing metadata file
Run harvester.py
Replace the existing or deleted metadata file with the harvester's output
Add metadata manually as needed (3D models, etc.)

Examples

python harvester.py "https://digitalcollections.tricolib.brynmawr.edu/oai/request"

Example session:

Enter CSV file name: artifacts.csv Available sets:

Peace Collection setSpec: peace_collection
Photographs setSpec: photographs Select a set to harvest [1-2]: 2

CSV Output

The harvester generates CSV files with these headers by default:

objectid,parentid,title,field_edtf_date,field_description_long,field_linked_agent,field_physical_form,field_genre,field_extent,field_language,field_subject,field_subjects_name,field_geographic_subject,field_temporal_subject,field_use_reproduction,field_rights_statement,field_collection_guide,field_shelf_locator,field_local_identifier,field_note,field_member_of,type,format,object_comments,group,display_template,object_location,image_small,image_thumb,extension,image_alt_text,notes

Headers are configurable, by modifying values in the CSV_HEADERS variable.

Metadata Mapping

The script maps standard OAI/Dublin Core metadata fields into Islandora-style ingest fields.

Example mappings:

OAI/DC Field CSV Column title title description field_description_long date field_edtf_date creator field_linked_agent subject field_subject rights field_rights_statement identifier field_local_identifier language field_language

Image Scraping

For each harvested record, the script:

Extracts the OAI identifier
Converts it into a public node URL
Scrapes the page HTML
Searches for image filenames containing:

Service File

If found, the image URL is automatically inserted into:

image_small image_thumb object_location

The script also auto-fills:

extension = jpg display_template = artifact_image group = artifact type = Image;StillImage

Example OAI Identifier Conversion

Input:

oai:digitalcollections.tricolib.brynmawr.edu:node-525276

Generated scrape URL:

https://digitalcollections.tricolib.brynmawr.edu/node/525276

Supported OAI-PMH Verbs

The harvester uses:

ListSets to discover available sets

ListIdentifiers to fetch identifiers within a set

GetRecord to fetch full metadata records

These are the minimum required methods from the OAI-PMH spec to harvest metadata.

Notes

Deleted records are skipped automatically.
The harvester respects OAI-PMH resumption tokens.
The default metadata format is:

oai_dc

You can change it with:

python harvester.py "https://example.org/oai" --metadata-prefix mods

Troubleshooting

No metadata appears in the CSV

Possible causes:

Repository metadata fields do not map to the configured CSV fields
Repository does not expose oai_dc
XML namespaces differ from expected structure

Try inspecting raw XML manually:

https://example.org/oai?verb=GetRecord&metadataPrefix=oai_dc&identifier=RECORD_ID

No image URLs appear

Possible causes:

No “Service File” image exists. NOTE: The script will only grab files that contain the substring "Service File"
The repository uses different image naming conventions
Images are dynamically loaded with JavaScript or in a frame of some kind

Spatial.io

3D Scanning and Processing

OAI-PMH Harvester

Requirements

Installation

Usage

Workflow

Examples

CSV Output

Metadata Mapping

Image Scraping

Example OAI Identifier Conversion

Supported OAI-PMH Verbs

Notes

Troubleshooting

No metadata appears in the CSV

No image URLs appear

OAI-PMH Harvester ​

Requirements ​

Installation ​

Usage ​

Workflow ​

Examples ​

CSV Output ​

Metadata Mapping ​

Image Scraping ​

Example OAI Identifier Conversion ​

Supported OAI-PMH Verbs ​

Notes ​

Troubleshooting ​

No metadata appears in the CSV ​

No image URLs appear ​

OAI-PMH Harvester

Requirements

Installation

Usage

Workflow

Examples

CSV Output

Metadata Mapping

Image Scraping

Example OAI Identifier Conversion

Supported OAI-PMH Verbs

Notes

Troubleshooting

No metadata appears in the CSV

No image URLs appear