Complete API

Reader objects

High-level classes to open and read DarwinCore Archive.

class dwca.read.DwCAReader

Bases: object

This class is used to represent a Darwin Core Archive as a whole.

It gives read access to the contained data, to the scientific metadata, … It supports archives with or without Metafile, such as described on page 2 of the Reference Guide to the XML Descriptor.

Parameters
  • path (str) – path to the Darwin Core Archive (either a zip/tgz file or a directory) to open.

  • extensions_to_ignore (list) – path (relative to the archive root) of extension data files to ignore. This will improve speed and memory usage for large archives. Missing files are silently ignored.

  • tmp_dir (str) – temporary directory to use to uncompress the archive (if needed). If not provided, Python default will be used.

Raises

dwca.exceptions.InvalidArchive

Raises

dwca.exceptions.InvalidSimpleArchive

Usage:

from dwca.read import DwCAReader

dwca = DwCAReader('my_archive.zip')

# Iterating on core rows is easy:
for core_row in dwca:
    # core_row is an instance of dwca.rows.CoreRow
    print(core_row)

# Scientific metadata (EML) is available as an ElementTree.Element object
print(dwca.metadata)

# Close the archive to free resources
dwca.close()

The archive can also be opened using the with statement. This is recommended, since it ensures resources will be properly cleaned after usage:

from dwca.read import DwCAReader

with DwCAReader('my-archive.zip') as dwca:
    pass  # Do what you want

# When leaving the block, resources are automatically freed.
absolute_temporary_path(relative_path: str) → str

Return the absolute path of a file located within the archive.

This method allows raw access to the files contained in the archive. It can be useful to open additional, non-standard files embedded in the archive, or to open a standard file with another library.

Parameters

relative_path (str) – the path (relative to the archive root) of the file.

Returns

the absolute path to the file.

Usage:

dwca.absolute_temporary_path('occurrence.txt')  # => /tmp/afdfsec7/occurrence.txt

Warning

If the archive is contained in a zip or tgz file, the returned path will point to a temporary file that will be removed when closing the dwca.read.DwCAReader instance.

Note

File existence is not tested.

archive_path = None

The path to the Darwin Core Archive file, as passed to the constructor.

close() → None

Close the Darwin Core Archive and remove temporary/working files.

Note

  • Alternatively, DwCAReader can be instanciated using the with statement. (see example above).

core_contains_term(term_url: str) → bool

Return True if the Core file of the archive contains the term_url term.

core_file = None

An instance of dwca.files.CSVDataFile for the core data file.

property core_file_location

The (relative) path to the core data file.

Example: ‘occurrence.txt’

descriptor = None

An descriptors.ArchiveDescriptor instance giving access to the archive descriptor/metafile (meta.xml)

extension_files = None

A list of dwca.files.CSVDataFile, one entry for each extension data file , sorted by order of appearance in the Metafile (or an empty list if the archive doesn’t use extensions).

get_corerow_by_id(row_id: str) → dwca.rows.CoreRow

Return the (core) row whose ID is row_id.

Parameters

row_id (str) – ID of the core row you want

Returns

dwca.rows.CoreRow – the matching row.

Raises

dwca.exceptions.RowNotFound

Warning

It is rarely a good idea to rely on the row ID, because: 1) Not all Darwin Core Archives specifies row IDs. 2) Nothing guarantees that the ID will actually be unique within the archive (depends of the data publisher). In that case, this method don’t guarantee which one will be returned. get_corerow_by_position() may be more appropriate in this case.

get_corerow_by_position(position: int) → dwca.rows.CoreRow

Return a core row according to its position/index in core file.

Parameters

position (int) – the position (starting at 0) of the row you want in the core file.

Returns

dwca.rows.CoreRow – the matching row.

Raises

dwca.exceptions.RowNotFound

Note

  • If index is bigger than the length of the archive, None is returned

  • The position is often an appropriate way to unambiguously identify a core row in a DwCA.

get_descriptor_for(relative_path: str) → dwca.descriptors.DataFileDescriptor

Return a descriptor for the data file located at relative_path.

Parameters

relative_path (str) – the path (relative to the archive root) to the data file you want info about.

Returns

dwca.descriptors.DataFileDescriptor

Raises

dwca.exceptions.NotADataFile if relative_path doesn’t reference a valid data file.

Examples:

dwca.get_descriptor_for('occurrence.txt')
dwca.get_descriptor_for('verbatim.txt')
metadata = None

A xml.etree.ElementTree.Element instance containing the (scientific) metadata of the archive, or None if the archive has no metadata.

open_included_file(relative_path: str, *args: Any, **kwargs: Any) → IO

Simple wrapper around Python’s build-in open function.

To be used only for reading.

Warning

Don’t forget to close the files after usage. This is especially important on Windows because temporary (extracted) files won’t be cleanable if not closed.

orphaned_extension_rows()

Return a dict of the orphaned extension rows.

Orphaned extension rows are extension rows who reference non-existing core rows. This methods returns a dict such as:

{'description.txt': {u'5': [3, 4], u'6': [5]},
 'vernacularname.txt': {u'7': [4]}}

Meaning:

  • in description.txt, rows at position 3 and 4 reference a core row whose ID is ‘5’, but such a core row doesn’t exists. Row at position 5 references an imaginary core row with ID ‘6’

  • in vernacularname.txt, the row at position 4 references an imaginary core row with ID ‘7’

pd_read(relative_path, **kwargs)

Return a Pandas DataFrame for the data file located at relative_path.

This method wraps pandas.read_csv() and accept the same keyword arguments. The following arguments will be ignored (because they are set appropriately for the data file): delimiter, skiprows, header and names.

Parameters

relative_path (str) – path to the data file (relative to the archive root).

Raises

ImportError if Pandas is not installed.

Raises

dwca.exceptions.NotADataFile if relative_path doesn’t designate a valid data file in the archive.

Warning

You’ll need to install Pandas before using this method.

Note

Default values of Darwin Core Archive are supported: A column will be added to the DataFrame if a term has a default value in the Metafile (but no corresponding column in the CSV Data File).

property rows

A list of rows.CoreRow objects representing the content of the archive.

Warning

All rows will be loaded in memory. In case of a large Darwin Core Archive, you may prefer using a for loop.

source_metadata = None

If the archive contains source-level metadata (typically, GBIF downloads), this is a dict such as:

{'dataset1_UUID': <dataset1 EML> (xml.etree.ElementTree.Element object),
 'dataset2_UUID': <dataset2 EML> (xml.etree.ElementTree.Element object), ...}

See The GBIF Occurrence download format for more details.

property use_extensions

True if the archive makes use of extensions.

Row objects

Objects that represents data rows coming from DarwinCore Archives.

class dwca.rows.CoreRow(csv_line: str, position: int, datafile_descriptor: dwca.descriptors.DataFileDescriptor)

Bases: dwca.rows.Row

This class is used to represent a row/line from a Darwin Core Archive core data file.

You probably won’t instantiate it manually but rather obtain it via dwca.read.DwCAReader.get_corerow_by_position(), dwca.read.DwCAReader.get_corerow_by_id() or simply by looping over a dwca.read.DwCAReader object.

property extensions

A list of ExtensionRow instances that relates to this Core row.

id = None

The row id

class dwca.rows.ExtensionRow(csv_line: str, position: int, datafile_descriptor: dwca.descriptors.DataFileDescriptor)

Bases: dwca.rows.Row

This class is used to represent a row/line from a Darwin Core Archive extension data file.

Most of the time, you won’t instantiate it manually but rather obtain it trough the extensions attribute of CoreRow.

core_id = None

The id of the core row this extension row is referring to.

class dwca.rows.Row(csv_line: str, position: int, datafile_descriptor: dwca.descriptors.DataFileDescriptor)

Bases: object

This class is used to represent a row/line in a Darwin Core Archive.

This class is intended to be subclassed rather than used directly.

data = None

A dict containing the Row data, such as:

{'dwc_term_1': 'value',
 'dwc_term_2': 'value',
 ...}

Usage:

myrow.data['http://rs.tdwg.org/dwc/terms/locality']  # => "Brussels"

Note

The dwca.darwincore.utils.qualname() helper is available to make such calls less verbose.

descriptor = None

An instance of dwca.descriptors.DataFileDescriptor describing the originating data file.

position = None

The row position/index (starting at 0) in the source data file. This can be used, for example with dwca.read.DwCAReader.get_corerow_by_position() or dwca.files.CSVDataFile.get_row_by_position().

raw_fields = None
rowtype = None

The csv line type as stated in the archive descriptor. (or None if the archive has no descriptor). Examples: http://rs.tdwg.org/dwc/terms/Occurrence, http://rs.gbif.org/terms/1.0/VernacularName, …

dwca.rows.csv_line_to_fields(csv_line, line_ending, field_ending, fields_enclosed_by)

Split a line from a CSV file.

Return a list of fields. Content is not trimmed.

Descriptor objects

Classes to represents descriptors of a DwC-A.

  • ArchiveDescriptor represents the full archive descriptor, initialized from the metafile content.

  • DataFileDescriptor describes characteristics of a given data file in the archive. It’s either created from a subsection of the ArchiveDescriptor describing the data file, either by introspecting the CSV data file (useful for Archives without metafile).

class dwca.descriptors.ArchiveDescriptor

Bases: object

Class used to encapsulate the whole Metafile (meta.xml).

extensions = None

A list of dwca.descriptors.DataFileDescriptor instances describing each of the archive’s extension data files.

extensions_type = None

A list of extension (types) in use in the archive.

Example:

["http://rs.gbif.org/terms/1.0/VernacularName",
 "http://rs.gbif.org/terms/1.0/Description"]
metadata_filename = None

The path (relative to archive root) of the (scientific) metadata of the archive.

raw_element = None

A xml.etree.ElementTree.Element instance containing the complete Archive Descriptor.

class dwca.descriptors.DataFileDescriptor

Bases: object

Those objects describe a data file fom the archive.

They’re generally not instanciated manually, but rather by calling:

coreid_index = None

If the section represents an extension data file, the index/position of the core_id column in that file. The core_id in an extension is the foreign key to the “extended” core row.

created_from_file = None

True if this descriptor was created by analyzing the data file.

fields = None

A list of dicts where each entry represent a data field in use.

Each dict contains:
  • The term identifier

  • (Possibly) a default value

  • The column index/position in the CSV file (except if we use a default value instead)

Example:

[{'term': 'http://rs.tdwg.org/dwc/terms/scientificName',
  'index': '1',
  'default': None},

 {'term': 'http://rs.tdwg.org/dwc/terms/locality',
  'index': '2',
  'default': ''},

 # The data for `country` is a the default value 'Belgium' for all rows, so there's
 # no column in CSV file.

 {'term': 'http://rs.tdwg.org/dwc/terms/country',
  'index': None,
  'default': 'Belgium'}]
fields_enclosed_by = None

The string or character used to enclose fields in the data file.

fields_terminated_by = None

The string or character used as a field separator in the data file. Example: “\t”.

file_encoding = None

The encoding of the data file. Example: “utf-8”.

file_location = None

The data file location, relative to the archive root.

property headers

A list of (ordered) column names that can be used to create a header line for the data file.

Example:

['id', 'http://rs.tdwg.org/dwc/terms/scientificName', 'http://rs.tdwg.org/dwc/terms/basisOfRecord',
'http://rs.tdwg.org/dwc/terms/family', 'http://rs.tdwg.org/dwc/terms/locality']

See also short_headers if you prefer less verbose headers.

id_index = None

If the section represents a core data file, the index/position of the id column in that file.

lines_terminated_by = None

The string or character used as a line separator in the data file. Example: “\n”.

property lines_to_ignore

Return the number of header lines/lines to ignore in the data file.

classmethod make_from_file(datafile_path)

Create and return a DataFileDescriptor by analyzing the file at datafile_path.

Parameters

datafile_path (str) – Relative path to a data file to analyze in order to instantiate the descriptor.

classmethod make_from_metafile_section(section_tag)

Create and return a DataFileDescriptor from a metafile <section> tag.

Parameters

section_tag (xml.etree.ElementTree.Element) – The XML Element section containing details about the data file.

raw_element = None

The <section> element describing the data file, from the metafile. None if the archive contains no metafile.

represents_corefile = None

True if this descriptor is used to represent the core file an archive.

represents_extension = None

True if this descriptor is used to represent an extension file in an archive.

property short_headers

A list of (ordered) column names (short version) that can be used to create a header line for the data file.

Example:

['id', 'scientificName', 'basisOfRecord', 'family', 'locality']

See also headers.

property terms

Return a Python set containing all the Darwin Core terms appearing in file.

type = None

File objects

File-related classes and functions.

class dwca.files.CSVDataFile(work_directory: str, file_descriptor: dwca.descriptors.DataFileDescriptor)

Object used to access a DwCA-enclosed CSV data file.

Parameters
  • work_directory – absolute path to the target directory (archive content, previously extracted if necessary).

  • file_descriptor – an instance of dwca.descriptors.DataFileDescriptor describing the data file.

The file content can be accessed:

On initialization, an index of new lines is build. This may take time, but makes random access faster.

close() → None

Close the file.

The content of the file will not be accessible in any way afterwards.

property coreid_index

An index of the core rows referenced by this data file.

It is a Python dict such as:

{
core_id1: [1],    # Row at position 1 references a Core Row whose ID is core_id1
core_id2: [8, 10] # Rows at position 8 and 10 references a Core Row whose ID is core_id2
}

Warning

for performance reasons, dictionary values are arrays(‘L’) instead of regular python lists

Warning

Creating this index can be time and memory consuming for large archives, so it’s created on the fly at first access.

file_descriptor = None

An instance of dwca.descriptors.DataFileDescriptor, as given to the constructor.

get_all_rows_by_coreid()

Return a list of dwca.rows.ExtensionRow whose Core Id field match core_id.

get_row_by_position(position: int) → Union[dwca.rows.CoreRow, dwca.rows.ExtensionRow]

Return the row at position in the file.

Header lines are ignored.

Raises

IndexError if there’s no line at position.

lines_to_ignore = None

Number of lines to ignore (header lines) in the CSV file.

Helpers

This module contains small helpers to make life easier.

dwca.darwincore.utils.qualname(short_term)

Takes a darwin core term (short form) and returns the corresponding qualname.

Note

It is generally used to make data access less verbose (see example below).

Raises

StopIteration if short_term is not found.

Typical real-world example:

from dwca.darwincore.utils import qualname as qn

qn("Occurrence")  # => "http://rs.tdwg.org/dwc/terms/Occurrence"

# To access data row:
myrow.data[qn('scientificName')]  # => u"Tetraodon fluviatilis"

# Instead of the verbose:
myrow.data['http://rs.tdwg.org/dwc/terms/scientificName']  # => u"Tetraodon fluviatilis"

Exceptions

Exceptions for the whole package.

exception dwca.exceptions.InvalidArchive

The archive appears to be invalid.

exception dwca.exceptions.InvalidSimpleArchive

The simple archive appears to be invalid.

exception dwca.exceptions.NotADataFile

The file doesn’t exist or is not a data file.

exception dwca.exceptions.RowNotFound

The DwC-A Row cannot be found.