Interaction with Pandas Package¶

Warning

You’ll need to install Pandas first.

Pandas is a powerful data analysis package that provides the user a large set of functionalities, such as easy slicing, filtering, calculating and summarizing statistics or plotting.

Python-dwca-reader exposes a pd_read() method to easily load the content of a data file (core or extension) from the archive into a Pandas DataFrame.

from dwca.read import DwCAReader

with DwCAReader('gbif-results.zip') as dwca:
   print("Core data file is: {}".format(dwca.descriptor.core.file_location)) # => 'occurrence.txt'

   core_df = dwca.pd_read('occurrence.txt', parse_dates=True)

   # All Pandas functionalities are now available on the core_df DataFrame

Note

DwCAReader.pd_read() is a simple wrapper around pandas.read_csv() and accept the same optional arguments. Only a few of them (delimiter, skiprows, encoding, …) will be ignored because DwCAReader sets them appropriately for the data file.

Note

Alternatively, you can do core_df = dwca.pd_read(dwca.core_file_location, …) which is handy if you don’t know the name of the core data file.

As a small example, some applications on the core_df:

Warning

You’ll need to install Seaborn for this example.

import pandas as pd
import seaborn as sns

# Number of records for each institutioncode
core_df["institutionCode"].value_counts()

# Select the coordinate information of the first twenty records
core_df.loc[:20, ["decimalLatitude", "decimalLongitude"]]

# Count the number of records with date information after 1950
sum(core_df["year"] > 1950)

# Convert eventDate to DateTime python object
core_df['eventDate'] = pd.to_datetime(core_df['eventDate'])

# Select only those records with coordinates, not (0, 0) coordinates and eventDate provided
core_df[(core_df["decimalLatitude"] != 0.0) &
        (core_df["decimalLatitude"].notnull()) &
        (core_df["decimalLongitude"] != 0.0) &
        (core_df["decimalLongitude"].notnull()) &
        (core_df["eventDate"].notnull())]

# Count the number of records for each species for each month
count_occ = core_df.pivot_table(index="scientificName",
                                columns="month",
                                values="id",
                                aggfunc='count')
# Visualisation of the counts on a heatmap (Seaborn)
sns.heatmap(count_occ)

Counts per species for each month of the year

For more information about Pandas and Seaborn, see their respective documentation.

When the DwCA contains multiple files, joining the extensions with the core file could be of interest for further analysis.

import pandas as pd
from dwca.read import DwCAReader

with DwCAReader('dwca-2extensions.zip') as dwca:
    # Check the core file of the Archive  (Occurrence, Taxon, ...)
    print("Core type is: {}".format(dwca.descriptor.core.type))

    # Check the available extensions
    print("Available extensions: {}".format([ext.split("/")[-1] for ext in dwca.descriptor.extensions_type]))

    taxon_df = dwca.pd_read('taxon.txt')
    descr_df = dwca.pd_read('description.txt')
    vern_df = dwca.pd_read('vernacularname.txt')

# Join the information of the description and vernacularname extension to the core taxon information
# (cfr. database JOIN)
taxon_df = pd.merge(taxon_df, descr_df, left_on='id', right_on='coreid', how="left")
taxon_df = pd.merge(taxon_df, vern_df, left_on='id', right_on='coreid', how="left")

The result is the core file joined with the extension files. More information about the Pandas merge is provided in the documentation.

Remark that reading in the data to Pandas will load the entire file into memory. For large archives, this won’t be feasible. Pandas support the usage of chunks, reading in a processing the data in chunks. As an example, consider the selection of those occurrences for which the eventDate was a Sunday:

import pandas as pd
from dwca.read import DwCAReader

chunksize = 10 # Chosen chunksize to process the data (pick a larger value for real world cases)
with DwCAReader('gbif-results.zip') as dwca:
    sunday_occ = []
    for chunk in dwca.pd_read('occurrence.txt', chunksize=chunksize):
        chunk['eventDate'] = pd.to_datetime(chunk['eventDate'])

        # Subselect only the records recorded on a sunday
        sunday_occ.append(chunk[chunk['eventDate'].dt.weekday == 6]) # Monday = 0, Sunday = 6

sunday_occ = pd.concat(sunday_occ)

More advanced processing is supported by Pandas. However, when only interested in counting the number of occurrences for a specific condition, Pandas is not always required. As an example, counting the number of occurrences for each species in the data set is easily supported by the Counter datatype of Python:

from collections import Counter

from dwca.read import DwCAReader
from dwca.darwincore.utils import qualname as qn

with DwCAReader('/Users/nicolasnoe/Desktop/gbif-results.zip') as dwca:
    count_species = Counter()

    for row in dwca:
        count_species.update([row.data[qn('scientificName')]])

    print(count_species)

Hence, the added value of Pandas depends on the type of analysis. Some more extensive applications of Pandas to work with Darwin Core data is provided in this data cleaning tutorial and data analysis tutorial.