OAI-PMH Harvester
Is a lightweight Python CLI harvester for extracting metadata from OAI-PMH repositories and exporting it into a CSV format suitable for Islandora-style ingest workflows.
The harvester:
- Connects to an OAI-PMH endpoint
- Uses ListSets method to discover available collections
- Prompts the user to choose a set
- Harvests records using ListIdentifiers and GetRecord
- Maps harvested metadata into predefined CSV fields
- Scrapes record pages for image with “Service File” string in filename
- Automatically populates image-related ingest metadata (custom CB fields)
- Outputs a CSV file in the current directory
The script uses only Python standard library modules (urllib, xml.etree, csv, etc.) and requires no third-party dependencies. Or VENV...
Requirements
- Python 3.9+
No additional packages are required.
Author's Note: Calling Python will differ across platforms. Windows uses "python" Mac uses "python3" Linux uses "python"
Installation
Extract the script into your working directory.
Usage
Run the harvester with an OAI-PMH endpoint URL:
python harvester.py "https://myOAIEnabledSite.edu/oaiEndpointHere"
The script will:
- Ask for a CSV filename (uses artifacts.csv by default)
- Fetch available OAI-PMH sets
- Prompt you to select one set
- Harvest metadata records
- Write a CSV file to the current directory
Workflow
- Extract the script to a working directory, or the CollectionBuilder /_data
- Delete the existing metadata file
- Run harvester.py
- Replace the existing or deleted metadata file with the harvester's output
- Add metadata manually as needed (3D models, etc.)
Examples
python harvester.py "https://digitalcollections.tricolib.brynmawr.edu/oai/request"
Example session:
Enter CSV file name: artifacts.csv Available sets:
- Peace Collection setSpec: peace_collection
- Photographs setSpec: photographs Select a set to harvest [1-2]: 2
CSV Output
The harvester generates CSV files with these headers by default:
objectid,parentid,title,field_edtf_date,field_description_long,field_linked_agent,field_physical_form,field_genre,field_extent,field_language,field_subject,field_subjects_name,field_geographic_subject,field_temporal_subject,field_use_reproduction,field_rights_statement,field_collection_guide,field_shelf_locator,field_local_identifier,field_note,field_member_of,type,format,object_comments,group,display_template,object_location,image_small,image_thumb,extension,image_alt_text,notes
Headers are configurable, by modifying values in the CSV_HEADERS variable.
Metadata Mapping
The script maps standard OAI/Dublin Core metadata fields into Islandora-style ingest fields.
Example mappings:
OAI/DC Field CSV Column title title description field_description_long date field_edtf_date creator field_linked_agent subject field_subject rights field_rights_statement identifier field_local_identifier language field_language
Image Scraping
For each harvested record, the script:
- Extracts the OAI identifier
- Converts it into a public node URL
- Scrapes the page HTML
- Searches for image filenames containing:
Service File
If found, the image URL is automatically inserted into:
image_small image_thumb object_location
The script also auto-fills:
extension = jpg display_template = artifact_image group = artifact type = Image;StillImage
Example OAI Identifier Conversion
Input:
oai:digitalcollections.tricolib.brynmawr.edu:node-525276
Generated scrape URL:
https://digitalcollections.tricolib.brynmawr.edu/node/525276
Supported OAI-PMH Verbs
The harvester uses:
ListSets to discover available sets
ListIdentifiers to fetch identifiers within a set
GetRecord to fetch full metadata records
These are the minimum required methods from the OAI-PMH spec to harvest metadata.
Notes
- Deleted records are skipped automatically.
- The harvester respects OAI-PMH resumption tokens.
- The default metadata format is:
oai_dc
You can change it with:
python harvester.py "https://example.org/oai" --metadata-prefix mods
Troubleshooting
No metadata appears in the CSV
Possible causes:
- Repository metadata fields do not map to the configured CSV fields
- Repository does not expose oai_dc
- XML namespaces differ from expected structure
Try inspecting raw XML manually:
https://example.org/oai?verb=GetRecord&metadataPrefix=oai_dc&identifier=RECORD_ID
No image URLs appear
Possible causes:
- No “Service File” image exists. NOTE: The script will only grab files that contain the substring "Service File"
- The repository uses different image naming conventions
- Images are dynamically loaded with JavaScript or in a frame of some kind