This section gives a brief overview on how to use Sickle for querying OAI interfaces.
This section gives a basic overview of the Open Archives Protocol for Metadata Harvesting (OAI-PMH). For more detailed information, please refer to the protocol specification.
OAI-PMH features six main API methods (so-called “OAI verbs”) that can be issued by harvesters. Some verbs can be combined with further arguments:
Returns a single record. Arguments:
Returns the records in the repository in batches (possibly filtered by a timestamp or a set). Arguments:
OAI interfaces may expose metadata records in multiple metadata formats. These formats are identified by so-called “metadata prefixes”. For instance, the prefix oai_dc refers to the OAI-DC format, which by definition has to be exposed by every valid OAI interface. OAI-DC is based on the 15 metadata elements specified in the Dublin Core Metadata Element Set.
Note
Sickle only supports the OAI-DC format out of the box. See section XXX for how to extend Sickle for retrieving metadata in other formats.
To make a connection to an OAI interface, you need to import the Sickle object:
>>> from sickle import Sickle
Next, you can initialize the connection by passing it the basic URL. In our example, we use the OAI interface of the ELIS repository:
>>> sickle = Sickle('http://elis.da.ulcc.ac.uk/cgi/oai2')
Now you are set to issue some requests. Sickle provides methods for each of the six OAI verbs (ListRecords, GetRecord, Idenitfy, ListSets, ListMetadataFormats, ListIdentifiers). Start with a ListRecords request:
>>> records = sickle.ListRecords(metadataPrefix='oai_dc')
Note that all keyword arguments you provide to this function are passed to the OAI interface as HTTP parameters. Therefore our example request results in verb=ListRecords&metadataPrefix=oai_dc. Consequently, we can add additional parameters, like set for example:
>>> records = sickle.ListRecords(metadataPrefix='oai_dc', set='driver')
If you need to perform selective harvesting by date using the from parameter, you will run into problems though, since from is a reserved word in Python:
>>> records = sickle.ListRecords(metadataPrefix='oai_dc', from="2012-12-12")
File "<stdin>", line 1
records = sickle.ListRecords(metadataPrefix='oai_dc', from="2012-12-12")
^
SyntaxError: invalid syntax
Fortunately, you can circumvent this problem by using a dictionary together with the ** operator:
>>> records = sickle.ListRecords(
... **{'metadataPrefix': 'oai_dc',
... 'from': '2012-12-12'
... }
... )
Sickle lets you conveniently iterate through resumption batches without having to deal with resumptionTokens yourself:
>>> records = sickle.ListRecords(metadataPrefix='oai_dc')
>>> records.next()
<Record oai:eprints.rclis.org:4088>
Note that this works with all requests that return more than one element. These are: ListRecords(), ListIdentifiers(), ListSets(), and ListMetadataFormats().
Iterating through the headers returned by ListIdentifiers:
>>> headers = sickle.ListIdentifiers(metadataPrefix='oai_dc')
>>> headers.next()
<Header oai:eprints.rclis.org:4088>
Or through the sets returned by ListSets:
>>> sets = sickle.ListSets()
>>> sets.next()
<Set Status = In Press>
OAI-PMH allows you to get a single record by using the GetRecord verb. And so does Sickle:
>>> sickle.GetRecord(identifier='oai:eprints.rclis.org:4088',
... metadataPrefix='oai_dc')
<Record oai:eprints.rclis.org:4088>
The ListRecords() and ListIdentifiers() methods take an optional parameter ignore_deleted. If it is set to True, the returned OAIIterator will skip deleted records/headers:
>>> records = sickle.ListRecords(metadataPrefix='oai_dc', ignore_deleted=True)