Changelog

v0.16.0 (2023-11-13)

  • Dropped Python 3.5 and 3.6 support. Use Python 3.7 to 3.11 instead.

  • Added a star record iterator to DwCAReader (thanks to @csbrown)

v0.15.1 (2023-01-17)

v0.15.0 (2020-09-09)

  • Lower memory use for archives with extensions

  • GBIFResultsReader (deprecated since v0.10.0) is now removed

  • DwCAReader.get_row_by_id() (long deprecated) is now removed, use DwCAReader.get_corerow_by_id() instead

  • DwCAReader.get_row_by_index() (long deprecated) is now removed, use DwCAReader.get_corerow_by_position() instead

v0.14.0 (2020-04-27)

  • Dropped support for Python 2.7

  • API new: Temporary directory is now configurable

v0.13.2 (2019-09-20)

  • Better standard support: fields with data column can also have a default value (issue #80)

v0.13.1 (2018-08-30)

  • API new: DwCAReader.core_file_location

  • API new: String representation (__str__) for CSVDataFile objects.

  • API change: CSVDataFile.get_line_at_position() raises IndexError in case of line not found (previously: returned None)

v0.13.0 (2017-12-01)

  • Bugfix: better encoding support for Metadata file - see issue #73.

  • API change: DwCAReader.get_descriptor_for(filename) now raises NotADataFile exception if filename doesn’t exists (previously: None was returned).

  • API new: DwCAReader.core_file (previously private: _corefile).

  • API new: DwCAReader.extension_files (previously private: _extensionfiles).

v0.12.0 (2017-11-10)

  • API new: DwCAReader.pd_read() - See Pandas tutorial for usage.

  • API new: new NotADataFile exception.

v0.11.2 (2017-10-18)

  • API new: DwCAReader.get_descriptor_for()

v0.11.1 (2017-10-11)

  • API new: DataFileDescriptor.short_headers

v0.11.0 (2017-10-10)

  • Bugfix: An enclosed field can now contain the field separator (By Ben Cail)

  • API change: DwCAReader.get_row_by_id() is renamed to DwCAReader.get_corerow_by_id()

  • API change: DwCAReader.get_row_by_index() is renamed DwCAReader.get_corerow_by_position()

  • API new: DwCAReader.orphaned_extension_rows() (thanks to Pieter Provoost).

  • API new: CSVDataFile.coreid_index (was previously known as CSVDataFile._coreid_index).

  • API new: Row/CoreRow/ExtensionRow.position

v0.10.2 (2017-04-11)

  • experimental support for Windows.

v0.10.1 (2017-04-04)

  • fixed temporary directory: previously, it was always created under current dir instead of something chosen by Python such as /tmp.

v0.10.0 (2017-03-16)

  • GBIFResultsReader is now deprecated.

  • API new: DwCAReader now provides source_metadata for GBIF-like archives (previously the main perk of GBIFResultsReader)

  • API change: dwca.source_metadata is an empty dict (previously: None) when the archive doesn’t has source metadata

  • API new: dwca.utils._DataFile is now public and renamed to dwca.files.CSVDataFile.

v0.9.2 (2016-04-29)

v0.9.1 (2016-04-28)

  • API new: DwCAReader.open_included_file(relative_path).

  • InvalidArchive exception is now raised when a file descriptor references a non-existing field.

v0.9.0 (2016-04-05)

  • Support for new types of archives:

    • DwCA without metafile (see page 2 of http://www.gbif.org/resource/80639), including or not a Metadata document (EML.xml). Fixes #47 and #7.

    • DwCA where data fields are enclosed by some character (using fieldsEnclosedBy when the Archive provides a Metafile, autodetection otherwise). Fixes issue #53.

    • Archives without a metadata attribute in the Metafile. See #51.

    • Tgz archives.

  • API change: SectionDescriptor => DataFileDescriptor

  • API change: DataFileDescriptor.encoding => DataFileDescriptor.file_encoding

  • API change: the reader previously only supported zip archives, and raised BadZipFile when the file format was not understood. It nows also supports .tgz, and throw InvalidArchive when the archive cannot be opened as .zip nor as .tgz.

  • API new: DataFileDescriptor.fields_enclosed_by

  • API new: DwCAReader.use_extensions

v0.8.1 (2016-03-10)

v0.8.0 (2016-02-11)

  • Experimental support for Python 3.5 (while maintaining compatibility with 2.7)

v0.7.0 (2015-08-20)

  • Python-dwca-reader doesn’t rely anymore on BeautifulSoup/lxml, using ElementTree from the standard library instead fot its XML parsing needs. This has a series of consequences: * It should be easier to install and deploy (also on platforms such as Jython). * All methods and attributes that used to return BeautifulSoup objects will now return xml.etree.ElementTree.Element instances. This includes DwCAReader.metadata. SectionDescriptor.raw_beautifulsoup and ArchiveDescriptor.raw_beautifulsoup have been replaced by SectionDescriptor.raw_element and ArchiveDescriptor.raw_element (Element objects).

v0.6.5 (2015-08-18)

  • New InvalidArchive exception. Currrently, it is only raised when a DwC-A references a metadata file that’s not present in the archive.

v0.6.4 (2015-02-17)

  • Performance: an optional ‘extension_to_ignores’ parameter (List) can be passed to DwCAReader’s constructor. In cases where an archive contains large but unneeded extensions, this can greatly improve memory consumption. A typical use-case for that would be the huge ‘verbatim.txt’ contained in GBIF downloads.

v0.6.3 (2015-02-16)

  • Performance: we now use core_id based indexes for extensions. There’s a memory penalty, but extension file parsing is now only done once.

v0.6.2 (2015-01-26)

  • Better performance with extensions.

v0.6.1 (2015-01-09)

  • It can now open not zipped (directory) Darwin Core Archives

  • More testing for Descriptor classes.

  • Better respect of the standard (http://rs.tdwg.org/dwc/terms/guides/text/):
    • We now support default value (n) for linesTerminatedBy and fieldsTerminatedBy.

  • Lower memory use with large archives.

v0.6.0 (2014-08-08)

  • Better performance thanks to a better architecture

  • API add: brand new _ArchiveDescriptor and _SectionDescriptor

  • API change: DwCAReader.descriptor is an instance of _ArchiveDescriptor (previously BeautifulSoup)

  • API remove: DwCAReader.core_rowtype (use DwCAReader.descriptor.core.type instead)

  • API remove: DwCAReader.extensions_rowtype (use DwCAReader.descriptor.extensions_type instead)

  • API remove: DwCAReader.core_terms (use DwCAReader.descriptor.core.terms instead)

v0.5.1 (2014-08-05)

  • Performance: dramatically improved performance of get_row_by_index/looping for large files by

    building an index of line positions at file opening (so there’s a slight overhead there)

v0.5.0 (2014-01-21)

  • API new: DwCAReader.descriptor

  • API change: “for core_line in dwca.each_line():” => “for core_row in dwca:”

  • API change: from_core and from_extension attributes of DwCALine (and sublasses) have been removed.

    The isinstance built-in function can be used to test if a line is an instance of DwCACoreLine or of DwCAExtensionLine.

  • API change: DwCAReader.lines => DwCAReader.rows

  • API change: DwCACoreLine => CoreRow

  • API change: DwCAExtensionLine => ExtensionRow

  • API change: DWCAReader.get_line_by_id => DWCAReader.get_row_by_id

  • API change: DWCAReader.get_line_by_index => DWCAReader.get_row_by_index

  • API change: DwCAReader.get_row_by_* methods throw RowNotFound when failure instead of returning None

  • Cleaner code and better documentation.

v0.4.0 (2013-09-24)

  • API change: dwca.get_line() -> dwca.get_line_by_id()

  • API new: dwca.get_line_by_index()

  • (Core File) iteration order is now guaranteed for dwca.each_line()

  • Refactoring: DwCALine subclassed (as DwCACoreLine and DwCAExtensionLine)

v0.3.3 (2013-09-05)

  • DwCALines are now hashable.

v0.3.2 (2013-08-28)

  • API: added the dwca.absolute_temporary_path() method.

v0.3.1 (2013-08-09)

  • Bugfix: lxml added as a requirement.

v0.3.0 (2013-08-08)

  • XML parsing (metadata, EML, …) now uses BeautifulSoup 4 instead of v3.

v0.2.1 (2013-08-02)

  • Added a property (core_terms) to DwCAReader to get a list the DwC terms in use in the core file.

v0.2.0 (2013-07-31)

  • Specific support for GBIF Data portal (occurrences) export.

  • Small bug fixes.

v0.1.1 (2013-05-28)

  • Fixes packaging issues.

v0.1.0 (2013-05-28)

  • Initial release.