Changelog¶

v0.16.4 (2024-10-18)¶

Fixed a recently introduced bug that made Pandas required for the library to work.

v0.16.3 (2024-10-18)¶

Added a new skip_metadata parameter to the DwCAReader constructor. When set to True, the metadata file is not read. This can be useful for example if the metadata is corrupt (invalid XML) and you don’t need it.

v0.16.2 (2024-08-23)¶

Fix a packaging issue that prevented the release of v0.16.1 on PyPI.

v0.16.1 (2024-08-23)¶

Added official support for Python 3.12
Proper error message when trying to use an unsupported combination of Pandas option and archives with default values (issue #106).

v0.16.0 (2023-11-13)¶

Dropped Python 3.5 and 3.6 support. Use Python 3.7 to 3.11 instead.
Added a star record iterator to DwCAReader (thanks to @csbrown)

v0.15.1 (2023-01-17)¶

Workaround for errors with buggy GBIF downloads, see https://github.com/gbif/portal-feedback/issues/4533

v0.15.0 (2020-09-09)¶

Lower memory use for archives with extensions
GBIFResultsReader (deprecated since v0.10.0) is now removed
DwCAReader.get_row_by_id() (long deprecated) is now removed, use DwCAReader.get_corerow_by_id() instead
DwCAReader.get_row_by_index() (long deprecated) is now removed, use DwCAReader.get_corerow_by_position() instead

v0.14.0 (2020-04-27)¶

Dropped support for Python 2.7
API new: Temporary directory is now configurable

v0.13.2 (2019-09-20)¶

Better standard support: fields with data column can also have a default value (issue #80)

v0.13.1 (2018-08-30)¶

API new: DwCAReader.core_file_location
API new: String representation (__str__) for CSVDataFile objects.
API change: CSVDataFile.get_line_at_position() raises IndexError in case of line not found (previously: returned None)

v0.13.0 (2017-12-01)¶

Bugfix: better encoding support for Metadata file - see issue #73.
API change: DwCAReader.get_descriptor_for(filename) now raises NotADataFile exception if filename doesn’t exists (previously: None was returned).
API new: DwCAReader.core_file (previously private: _corefile).
API new: DwCAReader.extension_files (previously private: _extensionfiles).

v0.12.0 (2017-11-10)¶

API new: DwCAReader.pd_read() - See Pandas tutorial for usage.
API new: new NotADataFile exception.

v0.11.2 (2017-10-18)¶

API new: DwCAReader.get_descriptor_for()

v0.11.1 (2017-10-11)¶

API new: DataFileDescriptor.short_headers

v0.11.0 (2017-10-10)¶

Bugfix: An enclosed field can now contain the field separator (By Ben Cail)
API change: DwCAReader.get_row_by_id() is renamed to DwCAReader.get_corerow_by_id()
API change: DwCAReader.get_row_by_index() is renamed DwCAReader.get_corerow_by_position()
API new: DwCAReader.orphaned_extension_rows() (thanks to Pieter Provoost).
API new: CSVDataFile.coreid_index (was previously known as CSVDataFile._coreid_index).
API new: Row/CoreRow/ExtensionRow.position

v0.10.2 (2017-04-11)¶

experimental support for Windows.

v0.10.1 (2017-04-04)¶

fixed temporary directory: previously, it was always created under current dir instead of something chosen by Python such as /tmp.

v0.10.0 (2017-03-16)¶

GBIFResultsReader is now deprecated.
API new: DwCAReader now provides source_metadata for GBIF-like archives (previously the main perk of GBIFResultsReader)
API change: dwca.source_metadata is an empty dict (previously: None) when the archive doesn’t has source metadata
API new: dwca.utils._DataFile is now public and renamed to dwca.files.CSVDataFile.

v0.9.2 (2016-04-29)¶

Updated Darwin Core terms for the qualname helper (including Event Core).
Updated dev. script (https://github.com/BelgianBiodiversityPlatform/python-dwca-reader/issues/45).

v0.9.1 (2016-04-28)¶

API new: DwCAReader.open_included_file(relative_path).
InvalidArchive exception is now raised when a file descriptor references a non-existing field.

v0.9.0 (2016-04-05)¶

Support for new types of archives:
- DwCA without metafile (see page 2 of http://www.gbif.org/resource/80639), including or not a Metadata document (EML.xml). Fixes #47 and #7.
- DwCA where data fields are enclosed by some character (using fieldsEnclosedBy when the Archive provides a Metafile, autodetection otherwise). Fixes issue #53.
- Archives without a metadata attribute in the Metafile. See #51.
- Tgz archives.
API change: SectionDescriptor => DataFileDescriptor
API change: DataFileDescriptor.encoding => DataFileDescriptor.file_encoding
API change: the reader previously only supported zip archives, and raised BadZipFile when the file format was not understood. It nows also supports .tgz, and throw InvalidArchive when the archive cannot be opened as .zip nor as .tgz.
API new: DataFileDescriptor.fields_enclosed_by
API new: DwCAReader.use_extensions

v0.8.1 (2016-03-10)¶

Support for archives contained in a single (sub)directory. See https://github.com/BelgianBiodiversityPlatform/python-dwca-reader/issues/49

v0.8.0 (2016-02-11)¶

Experimental support for Python 3.5 (while maintaining compatibility with 2.7)

v0.7.0 (2015-08-20)¶

Python-dwca-reader doesn’t rely anymore on BeautifulSoup/lxml, using ElementTree from the standard library instead fot its XML parsing needs. This has a series of consequences: * It should be easier to install and deploy (also on platforms such as Jython). * All methods and attributes that used to return BeautifulSoup objects will now return xml.etree.ElementTree.Element instances. This includes DwCAReader.metadata. SectionDescriptor.raw_beautifulsoup and ArchiveDescriptor.raw_beautifulsoup have been replaced by SectionDescriptor.raw_element and ArchiveDescriptor.raw_element (Element objects).

v0.6.5 (2015-08-18)¶

New InvalidArchive exception. Currrently, it is only raised when a DwC-A references a metadata file that’s not present in the archive.

v0.6.4 (2015-02-17)¶

Performance: an optional ‘extension_to_ignores’ parameter (List) can be passed to DwCAReader’s constructor. In cases where an archive contains large but unneeded extensions, this can greatly improve memory consumption. A typical use-case for that would be the huge ‘verbatim.txt’ contained in GBIF downloads.

v0.6.3 (2015-02-16)¶

Performance: we now use core_id based indexes for extensions. There’s a memory penalty, but extension file parsing is now only done once.

v0.6.2 (2015-01-26)¶

Better performance with extensions.

v0.6.1 (2015-01-09)¶

It can now open not zipped (directory) Darwin Core Archives
More testing for Descriptor classes.
Better respect of the standard (http://rs.tdwg.org/dwc/terms/guides/text/):
- We now support default value (n) for linesTerminatedBy and fieldsTerminatedBy.
Lower memory use with large archives.

v0.6.0 (2014-08-08)¶

Better performance thanks to a better architecture
API add: brand new _ArchiveDescriptor and _SectionDescriptor
API change: DwCAReader.descriptor is an instance of _ArchiveDescriptor (previously BeautifulSoup)
API remove: DwCAReader.core_rowtype (use DwCAReader.descriptor.core.type instead)
API remove: DwCAReader.extensions_rowtype (use DwCAReader.descriptor.extensions_type instead)
API remove: DwCAReader.core_terms (use DwCAReader.descriptor.core.terms instead)

v0.5.1 (2014-08-05)¶

Performance: dramatically improved performance of get_row_by_index/looping for large files by
building an index of line positions at file opening (so there’s a slight overhead there)

v0.5.0 (2014-01-21)¶

API new: DwCAReader.descriptor
API change: “for core_line in dwca.each_line():” => “for core_row in dwca:”
API change: from_core and from_extension attributes of DwCALine (and sublasses) have been removed.
The isinstance built-in function can be used to test if a line is an instance of DwCACoreLine or of DwCAExtensionLine.
API change: DwCAReader.lines => DwCAReader.rows
API change: DwCACoreLine => CoreRow
API change: DwCAExtensionLine => ExtensionRow
API change: DWCAReader.get_line_by_id => DWCAReader.get_row_by_id
API change: DWCAReader.get_line_by_index => DWCAReader.get_row_by_index
API change: DwCAReader.get_row_by_* methods throw RowNotFound when failure instead of returning None
Cleaner code and better documentation.

v0.4.0 (2013-09-24)¶

API change: dwca.get_line() -> dwca.get_line_by_id()
API new: dwca.get_line_by_index()
(Core File) iteration order is now guaranteed for dwca.each_line()
Refactoring: DwCALine subclassed (as DwCACoreLine and DwCAExtensionLine)

v0.3.3 (2013-09-05)¶

DwCALines are now hashable.

v0.3.2 (2013-08-28)¶

API: added the dwca.absolute_temporary_path() method.

v0.3.1 (2013-08-09)¶

Bugfix: lxml added as a requirement.

v0.3.0 (2013-08-08)¶

XML parsing (metadata, EML, …) now uses BeautifulSoup 4 instead of v3.

v0.2.1 (2013-08-02)¶

Added a property (core_terms) to DwCAReader to get a list the DwC terms in use in the core file.

v0.2.0 (2013-07-31)¶

Specific support for GBIF Data portal (occurrences) export.
Small bug fixes.

v0.1.1 (2013-05-28)¶

Fixes packaging issues.

v0.1.0 (2013-05-28)¶

Initial release.