Complete API¶

Reader objects¶

High-level classes to open and read DarwinCore Archive.

class dwca.read.DwCAReader¶

Bases: object

This class is used to represent a Darwin Core Archive as a whole.

It gives read access to the contained data, to the scientific metadata, … It supports archives with or without Metafile, such as described on page 2 of the Reference Guide to the XML Descriptor.

Parameters

path (str) – path to the Darwin Core Archive (either a zip/tgz file or a directory) to open.
extensions_to_ignore (list) – path (relative to the archive root) of extension data files to ignore. This will improve speed and memory usage for large archives. Missing files are silently ignored.
tmp_dir (str) – temporary directory to use to uncompress the archive (if needed). If not provided, Python default will be used.

Raises

dwca.exceptions.InvalidArchive

Raises

dwca.exceptions.InvalidSimpleArchive

Usage:

from dwca.read import DwCAReader

dwca = DwCAReader('my_archive.zip')

# Iterating on core rows is easy:
for core_row in dwca:
    # core_row is an instance of dwca.rows.CoreRow
    print(core_row)

# Scientific metadata (EML) is available as an ElementTree.Element object
print(dwca.metadata)

# Close the archive to free resources
dwca.close()

The archive can also be opened using the with statement. This is recommended, since it ensures resources will be properly cleaned after usage:

from dwca.read import DwCAReader

with DwCAReader('my-archive.zip') as dwca:
    pass  # Do what you want

# When leaving the block, resources are automatically freed.

absolute_temporary_path(relative_path: str) → str¶

Return the absolute path of a file located within the archive.

This method allows raw access to the files contained in the archive. It can be useful to open additional, non-standard files embedded in the archive, or to open a standard file with another library.

Parameters: relative_path (str) – the path (relative to the archive root) of the file.
Returns: the absolute path to the file.

Usage:

dwca.absolute_temporary_path('occurrence.txt')  # => /tmp/afdfsec7/occurrence.txt

Warning

If the archive is contained in a zip or tgz file, the returned path will point to a temporary file that will be removed when closing the dwca.read.DwCAReader instance.

Note

File existence is not tested.

archive_path = None¶: The path to the Darwin Core Archive file, as passed to the constructor.

close() → None¶

Close the Darwin Core Archive and remove temporary/working files.

Note

Alternatively, DwCAReader can be instanciated using the with statement. (see example above).

core_contains_term(term_url: str) → bool¶: Return True if the Core file of the archive contains the term_url term.

core_file = None¶: An instance of dwca.files.CSVDataFile for the core data file.

property core_file_location¶

The (relative) path to the core data file.

Example: ‘occurrence.txt’

descriptor = None¶: An descriptors.ArchiveDescriptor instance giving access to the archive descriptor/metafile (meta.xml)

extension_files = None¶: A list of dwca.files.CSVDataFile, one entry for each extension data file , sorted by order of appearance in the Metafile (or an empty list if the archive doesn’t use extensions).

get_corerow_by_id(row_id: str) → dwca.rows.CoreRow¶

Return the (core) row whose ID is row_id.

Parameters: row_id (str) – ID of the core row you want
Returns: dwca.rows.CoreRow – the matching row.
Raises: dwca.exceptions.RowNotFound

Warning

It is rarely a good idea to rely on the row ID, because: 1) Not all Darwin Core Archives specifies row IDs. 2) Nothing guarantees that the ID will actually be unique within the archive (depends of the data publisher). In that case, this method don’t guarantee which one will be returned. get_corerow_by_position() may be more appropriate in this case.

get_corerow_by_position(position: int) → dwca.rows.CoreRow¶

Return a core row according to its position/index in core file.

Parameters: position (int) – the position (starting at 0) of the row you want in the core file.
Returns: dwca.rows.CoreRow – the matching row.
Raises: dwca.exceptions.RowNotFound

Note

If index is bigger than the length of the archive, None is returned
The position is often an appropriate way to unambiguously identify a core row in a DwCA.

get_descriptor_for(relative_path: str) → dwca.descriptors.DataFileDescriptor¶

Return a descriptor for the data file located at relative_path.

Parameters: relative_path (str) – the path (relative to the archive root) to the data file you want info about.
Returns: dwca.descriptors.DataFileDescriptor
Raises: dwca.exceptions.NotADataFile if relative_path doesn’t reference a valid data file.

Examples:

dwca.get_descriptor_for('occurrence.txt')
dwca.get_descriptor_for('verbatim.txt')

metadata = None¶: A xml.etree.ElementTree.Element instance containing the (scientific) metadata of the archive, or None if the archive has no metadata or if the skip_metadata parameter is True.

open_included_file(relative_path: str, *args: Any, **kwargs: Any) → IO¶

Simple wrapper around Python’s build-in open function.

To be used only for reading.

Warning

Don’t forget to close the files after usage. This is especially important on Windows because temporary (extracted) files won’t be cleanable if not closed.

orphaned_extension_rows()¶

Return a dict of the orphaned extension rows.

Orphaned extension rows are extension rows who reference non-existing core rows. This methods returns a dict such as:

{'description.txt': {u'5': [3, 4], u'6': [5]},
 'vernacularname.txt': {u'7': [4]}}

Meaning:

in description.txt, rows at position 3 and 4 reference a core row whose ID is ‘5’, but such a core row doesn’t exists. Row at position 5 references an imaginary core row with ID ‘6’

in vernacularname.txt, the row at position 4 references an imaginary core row with ID ‘7’

pd_read(relative_path, **kwargs)¶

Return a Pandas DataFrame for the data file located at relative_path.

This method wraps pandas.read_csv() and accept the same keyword arguments. The following arguments will be ignored (because they are set appropriately for the data file): delimiter, skiprows, header and names.

Parameters: relative_path (str) – path to the data file (relative to the archive root).
Raises: ImportError if Pandas is not installed.
Raises: dwca.exceptions.NotADataFile if relative_path doesn’t designate a valid data file in the archive.

Warning

You’ll need to install Pandas before using this method.

Note

Default values of Darwin Core Archive are supported: A column will be added to the DataFrame if a term has a default value in the Metafile (but no corresponding column in the CSV Data File). This is unfortunately not supported in case the value returned by pandas.read_csv() is a TextFileReader (e.g. when using chunksize or iterator=True).

property rows¶: A list of rows.CoreRow objects representing the content of the archive.

Warning

All rows will be loaded in memory. In case of a large Darwin Core Archive, you may prefer using a for loop.

source_metadata = None¶

If the archive contains source-level metadata (typically, GBIF downloads), this is a dict such as:

{'dataset1_UUID': <dataset1 EML> (xml.etree.ElementTree.Element object),
 'dataset2_UUID': <dataset2 EML> (xml.etree.ElementTree.Element object), ...}

See The GBIF Occurrence download format for more details.

property use_extensions¶: True if the archive makes use of extensions.

Row objects¶

Objects that represents data rows coming from DarwinCore Archives.

class dwca.rows.CoreRow(csv_line: str, position: int, datafile_descriptor: dwca.descriptors.DataFileDescriptor)¶

Bases: dwca.rows.Row

This class is used to represent a row/line from a Darwin Core Archive core data file.

You probably won’t instantiate it manually but rather obtain it via dwca.read.DwCAReader.get_corerow_by_position(), dwca.read.DwCAReader.get_corerow_by_id() or simply by looping over a dwca.read.DwCAReader object.

property extensions¶: A list of ExtensionRow instances that relates to this Core row.

id = None¶: The row id

class dwca.rows.ExtensionRow(csv_line: str, position: int, datafile_descriptor: dwca.descriptors.DataFileDescriptor)¶

Bases: dwca.rows.Row

This class is used to represent a row/line from a Darwin Core Archive extension data file.

Most of the time, you won’t instantiate it manually but rather obtain it trough the extensions attribute of CoreRow.

core_id = None¶: The id of the core row this extension row is referring to.

class dwca.rows.Row(csv_line: str, position: int, datafile_descriptor: dwca.descriptors.DataFileDescriptor)¶

Bases: object

This class is used to represent a row/line in a Darwin Core Archive.

This class is intended to be subclassed rather than used directly.

data = None¶

A dict containing the Row data, such as:

{'dwc_term_1': 'value',
 'dwc_term_2': 'value',
 ...}

Usage:

myrow.data['http://rs.tdwg.org/dwc/terms/locality']  # => "Brussels"

Note

The dwca.darwincore.utils.qualname() helper is available to make such calls less verbose.

descriptor = None¶: An instance of dwca.descriptors.DataFileDescriptor describing the originating data file.

position = None¶: The row position/index (starting at 0) in the source data file. This can be used, for example with dwca.read.DwCAReader.get_corerow_by_position() or dwca.files.CSVDataFile.get_row_by_position().

raw_fields = None¶

rowtype = None¶: The csv line type as stated in the archive descriptor. (or None if the archive has no descriptor). Examples: http://rs.tdwg.org/dwc/terms/Occurrence, http://rs.gbif.org/terms/1.0/VernacularName, …

dwca.rows.csv_line_to_fields(csv_line, line_ending, field_ending, fields_enclosed_by)¶

Split a line from a CSV file.

Return a list of fields. Content is not trimmed.

Descriptor objects¶

Classes to represents descriptors of a DwC-A.

ArchiveDescriptor represents the full archive descriptor, initialized from the metafile content.
DataFileDescriptor describes characteristics of a given data file in the archive. It’s either created from a subsection of the ArchiveDescriptor describing the data file, either by introspecting the CSV data file (useful for Archives without metafile).

class dwca.descriptors.ArchiveDescriptor¶

Bases: object

Class used to encapsulate the whole Metafile (meta.xml).

extensions = None¶: A list of dwca.descriptors.DataFileDescriptor instances describing each of the archive’s extension data files.

extensions_type = None¶

A list of extension (types) in use in the archive.

Example:

["http://rs.gbif.org/terms/1.0/VernacularName",
 "http://rs.gbif.org/terms/1.0/Description"]

metadata_filename = None¶: The path (relative to archive root) of the (scientific) metadata of the archive.

raw_element = None¶: A xml.etree.ElementTree.Element instance containing the complete Archive Descriptor.

class dwca.descriptors.DataFileDescriptor¶

Bases: object

Those objects describe a data file fom the archive.

They’re generally not instanciated manually, but rather by calling:

make_from_metafile_section() (if the archive contains a metafile)

make_from_file() (created by analyzing the data file)

coreid_index = None¶: If the section represents an extension data file, the index/position of the core_id column in that file. The core_id in an extension is the foreign key to the “extended” core row.

created_from_file = None¶: True if this descriptor was created by analyzing the data file.

fields = None¶

A list of dicts where each entry represent a data field in use.

Each dict contains:

The term identifier
(Possibly) a default value
The column index/position in the CSV file (except if we use a default value instead)

Example:

[{'term': 'http://rs.tdwg.org/dwc/terms/scientificName',
  'index': '1',
  'default': None},

 {'term': 'http://rs.tdwg.org/dwc/terms/locality',
  'index': '2',
  'default': ''},

 # The data for `country` is a the default value 'Belgium' for all rows, so there's
 # no column in CSV file.

 {'term': 'http://rs.tdwg.org/dwc/terms/country',
  'index': None,
  'default': 'Belgium'}]

fields_enclosed_by = None¶: The string or character used to enclose fields in the data file.

fields_terminated_by = None¶: The string or character used as a field separator in the data file. Example: “\t”.

file_encoding = None¶: The encoding of the data file. Example: “utf-8”.

file_location = None¶: The data file location, relative to the archive root.

property headers¶

A list of (ordered) column names that can be used to create a header line for the data file.

Example:

['id', 'http://rs.tdwg.org/dwc/terms/scientificName', 'http://rs.tdwg.org/dwc/terms/basisOfRecord',
'http://rs.tdwg.org/dwc/terms/family', 'http://rs.tdwg.org/dwc/terms/locality']

See also short_headers if you prefer less verbose headers.

id_index = None¶: If the section represents a core data file, the index/position of the id column in that file.

lines_terminated_by = None¶: The string or character used as a line separator in the data file. Example: “\n”.

property lines_to_ignore¶: Return the number of header lines/lines to ignore in the data file.

classmethod make_from_file(datafile_path)¶

Create and return a DataFileDescriptor by analyzing the file at datafile_path.

Parameters: datafile_path (str) – Relative path to a data file to analyze in order to instantiate the descriptor.

classmethod make_from_metafile_section(section_tag)¶

Create and return a DataFileDescriptor from a metafile <section> tag.

Parameters: section_tag (xml.etree.ElementTree.Element) – The XML Element section containing details about the data file.

raw_element = None¶: The <section> element describing the data file, from the metafile. None if the archive contains no metafile.

represents_corefile = None¶: True if this descriptor is used to represent the core file an archive.

represents_extension = None¶: True if this descriptor is used to represent an extension file in an archive.

property short_headers¶

A list of (ordered) column names (short version) that can be used to create a header line for the data file.

Example:

['id', 'scientificName', 'basisOfRecord', 'family', 'locality']

File objects¶

File-related classes and functions.

class dwca.files.CSVDataFile(work_directory: str, file_descriptor: dwca.descriptors.DataFileDescriptor)¶

Object used to access a DwCA-enclosed CSV data file.

Parameters

work_directory – absolute path to the target directory (archive content, previously extracted if necessary).
file_descriptor – an instance of dwca.descriptors.DataFileDescriptor describing the data file.

The file content can be accessed:

By iterating on this object: a str is returned, including separators.
With get_row_by_position() (A dwca.rows.CoreRow or dwca.rows.ExtensionRow object is returned)
For an extension data file, with get_all_rows_by_coreid() (A dwca.rows.CoreRow or dwca.rows.ExtensionRow object is returned)

On initialization, an index of new lines is build. This may take time, but makes random access faster.

close() → None¶

Close the file.

The content of the file will not be accessible in any way afterwards.

property coreid_index¶

An index of the core rows referenced by this data file.

It is a Python dict such as:

{
core_id1: [1],    # Row at position 1 references a Core Row whose ID is core_id1
core_id2: [8, 10] # Rows at position 8 and 10 references a Core Row whose ID is core_id2
}

Warning

for performance reasons, dictionary values are arrays(‘L’) instead of regular python lists

Warning

Creating this index can be time and memory consuming for large archives, so it’s created on the fly at first access.

file_descriptor = None¶: An instance of dwca.descriptors.DataFileDescriptor, as given to the constructor.

get_all_rows_by_coreid()¶: Return a list of dwca.rows.ExtensionRow whose Core Id field match core_id.

get_row_by_position(position: int) → Union[dwca.rows.CoreRow, dwca.rows.ExtensionRow]¶

Return the row at position in the file.

Header lines are ignored.

Raises: IndexError if there’s no line at position.

lines_to_ignore = None¶: Number of lines to ignore (header lines) in the CSV file.

Helpers¶

This module contains small helpers to make life easier.

dwca.darwincore.utils.qualname(short_term)¶

Takes a darwin core term (short form) and returns the corresponding qualname.

Note

It is generally used to make data access less verbose (see example below).

Raises: StopIteration if short_term is not found.

Typical real-world example:

from dwca.darwincore.utils import qualname as qn

qn("Occurrence")  # => "http://rs.tdwg.org/dwc/terms/Occurrence"

# To access data row:
myrow.data[qn('scientificName')]  # => u"Tetraodon fluviatilis"

# Instead of the verbose:
myrow.data['http://rs.tdwg.org/dwc/terms/scientificName']  # => u"Tetraodon fluviatilis"

Exceptions¶

Exceptions for the whole package.

exception dwca.exceptions.InvalidArchive¶: The archive appears to be invalid.

exception dwca.exceptions.InvalidSimpleArchive¶: The simple archive appears to be invalid.

exception dwca.exceptions.NotADataFile¶: The file doesn’t exist or is not a data file.

exception dwca.exceptions.RowNotFound¶: The DwC-A Row cannot be found.