Tutorial ======== Example uses ------------ Basic use, access to metadata and data from the Core file ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code:: python from dwca.read import DwCAReader from dwca.darwincore.utils import qualname as qn # Let's open our archive... # Using the with statement ensure that resources will be properly freed/cleaned after use. with DwCAReader('my-archive.zip') as dwca: # We can now interact with the 'dwca' object # We can read scientific metadata (EML) through a xml.etree.ElementTree.Element object in the 'metadata' # attribute. dwca.metadata # The 'descriptor' attribute gives access to the Archive Descriptor (meta.xml) and allow # inspecting the archive: # For example, discover what the type the Core file is: (Occurrence, Taxon, ...) print("Core type is: %s" % dwca.descriptor.core.type) # => Core type is: http://rs.tdwg.org/dwc/terms/Occurrence # Check if a Darwin Core term in present in the core file if 'http://rs.tdwg.org/dwc/terms/locality' in dwca.descriptor.core.terms: print("This archive contains the 'locality' term in its core file.") else: print("Locality term is not present.") # Using full qualnames for DarwincCore terms (such as 'http://rs.tdwg.org/dwc/terms/country') is verbose... # The qualname() helper function make life easy for common terms. # (here, it has been imported as 'qn'): qn('locality') # => u'http://rs.tdwg.org/dwc/terms/locality' # Combined with previous examples, this can be used to things more clear: # For example: if qn('locality') in dwca.descriptor.core.terms: pass # Or: if dwca.descriptor.core.type == qn('Occurrence'): pass # Finally, let's iterate over the archive core rows and get the data: for row in dwca: # row is an instance of CoreRow # iteration respects their order of appearance in the core file # Print() can be used for debugging purposes... print(row) # => -- # => Rowtype: http://rs.tdwg.org/dwc/terms/Occurrence # => Source: Core file # => Row ID: # => Data: {u'http://rs.tdwg.org/dwc/terms/basisOfRecord': u'Observation', u'http://rs.tdwg.org/dwc/terms/family': # => u'Tetraodontidae', u'http://rs.tdwg.org/dwc/terms/locality': u'Borneo', u'http://rs.tdwg.# # => org/dwc/terms/scientificName': u'tetraodon fluviatilis'} # => -- # You can get the value of a specific Darwin Core term through # the "data" dict: print("Value of 'locality' for this row: %s" % row.data[qn('locality')]) # => Value of 'locality' for this row: Mumbai # Alternatively, we can get a list of core rows instead of iterating: # BEWARE: all rows will be loaded in memory! rows = dwca.rows # Or retrieve a specific row by its id: occurrence_number_three = dwca.get_corerow_by_id(3) # Caution: ids are generally a fragile way to identify a core row in an archive, since the standard doesn't # guarantee unicity (nor even that there will be an id). The index (position) of the row (starting at 0) is # generally preferable. occurrence_on_second_line = dwca.get_row_by_index(1) # We can retreive the (absolute) of embedded files # NOTE: this path point to a temporary directory that will be removed at the end of the DwCAReader object life # cycle. path = dwca.absolute_temporary_path('occurrence.txt') Access to Darwin Core Archives with extensions (star schema) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code:: python from dwca.read import DwCAReader with DwCAReader('archive_with_vernacularnames_extension.zip') as dwca: # Let's ask the archive what kind of extensions are in use: for e in dwca.descriptor.extensions: print(e.type) # => http://rs.gbif.org/terms/1.0/VernacularName first_core_row = dwca.rows[0] # Extension rows are accessible from a core row as a list of ExtensionRow instances: for extension_line in first_core_row.extensions: # Display all rows from extension files reffering to the first Core row print(extension_line) Another example with multiple extensions (no new API here) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code:: python from dwca.read import DwCAReader with DwCAReader('multiext_archive.zip') as dwca: rows = dwca.rows ostrich = rows[0] print("You'll find below all extensions rows reffering to Ostrich") print("There should be 3 vernacular names and 2 taxon description") for ext in ostrich.extensions: print(ext) print("We can then simply filter by type...") for ext in ostrich.extensions: if ext.rowtype == 'http://rs.gbif.org/terms/1.0/VernacularName': print(ext) GBIF Downloads ~~~~~~~~~~~~~~ The GBIF website allow visitors to export occurrences as a Darwin Core Archive. The resulting file contains a few more things that are not part of the `Darwin Core Archive`_ standard. These additions also works with python-dwca-reader. See :doc:`gbif_results` for explanations on the file format and how to use it. .. _Darwin Core Archive: http://en.wikipedia.org/wiki/Darwin_Core_Archive Data analysis and manipulation with Pandas ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Python-dwca-reader provides specific tools to make working with Pandas easier, see :doc:`pandas_tutorial` for concrete examples.