Quickstart#

Binder (Please be aware that Mybinder is reducing capacity and need funding. See this blog)

This page demonstrates how to use intake-dataframe-catalog by building and using a very simple dataframe catalog comprising a small number of publically-available data sources:

import intake

Getting set up#

First, we open each of the data sources. Our goal is to create an intake dataframe catalog with these as the cataloged intake sources.

aws_cesm2_lens = intake.open_esm_datastore(
    "https://raw.githubusercontent.com/NCAR/cesm2-le-aws/main/intake-catalogs/aws-cesm2-le.json"
)
/Users/u1166368/miniforge3/envs/intake-df-cat-test/lib/python3.12/site-packages/fastprogress/fastprogress.py:107: UserWarning: Couldn't import ipywidgets properly, progress bar will use console behavior
  warn("Couldn't import ipywidgets properly, progress bar will use console behavior")
google_cmip6 = intake.open_esm_datastore(
    "https://storage.googleapis.com/cmip6/pangeo-cmip6.json"
)
aws_cmip6 = intake.open_esm_datastore(
    "https://cmip6-pds.s3.amazonaws.com/pangeo-cmip6.json"
)
noaa_global_temp = intake.open_csv(
    "https://www.ncei.noaa.gov/access/monitoring/climate-at-a-glance/global/time-series/globe/land_ocean/1/4/1850-2023/data.csv",
    csv_kwargs={"skiprows": 4},
)

All of these data sources point to data that share some key attributes. For example, all sources contain timeseries of climate variables generated by a model at a particular temporal frequency. These shared metadata attributes are what we might consider including as columns in our intake dataframe catalog.

Initialising a dataframe catalog#

We’ll start by initialising a intake-dataframe-catalog object (intake_dataframe_catalog.core.DfFileCatalog). This can be done by initialising the class directly, or by using the intake.open_df_catalog convenience function:

cat = intake.open_df_catalog(path="./example_catalog.csv", mode="w")

Adding sources#

We can add sources to the dataframe catalog using the .add method. This method takes as arguments the sources to add and the metadata to associate with that source. As a simple demonstration, we’ll add metadata about the model(s) and variable(s). The noaa_global_temp source is the easiest to add, since it contains only one model and one variable:

noaa_global_temp.name = "noaa_global_temp"

cat.add(
    noaa_global_temp,
    metadata={"model": "NOAAGlobalTemp", "variable": ["global_temp_anom"]}
)

cat

Intake dataframe catalog with 1 source(s) across 1 rows:

model variable
name
noaa_global_temp {NOAAGlobalTemp} {global_temp_anom}

For the intake-esm datastores, we’ll parse model and variable metadata from the datastore itself. The aws_cesm2_lens datastore comprises only one model:

aws_cesm2_lens.name = "aws_cesm2_lens"
aws_cesm2_lens_model = "CESM2-LENS"
aws_cesm2_lens_variables = list(
    set(
        aws_cesm2_lens.df.variable.unique().astype(str)
    )
)

cat.add(
    aws_cesm2_lens,
    metadata={"model": aws_cesm2_lens_model, "variable": aws_cesm2_lens_variables}
)

cat

Intake dataframe catalog with 2 source(s) across 2 rows:

model variable
name
aws_cesm2_lens {CESM2-LENS} {FSNSC, aice, O2, TREFMXAV, TREFHTMN, TREFHT, U, WTS, SNOW, SOILLIQ, FLNS, SHFLX, V, SALT, DOC, VNT, UET, TS, FLUT, H2OSNO, RAIN, FLNSC, SOILWATER_10CM, VNS, PD, PSL, T, Z3, NPP, ICEFRAC, WTT, FSN...
noaa_global_temp {NOAAGlobalTemp} {global_temp_anom}

Both the CMIP6 datastores comprise multiple models. In order to keep track of which variables are available for which models we must add an entry in our dataframe catalog for each model. To do this, we’ll write a simple function for finding which variables are available for a given model:

def get_variables_for_model(datastore, model):
    """
    Returns a list of unique variables for a given model in a CMIP6 intake-esm datastore
    """
    return list(
        set(
            datastore.df[datastore.df.source_id == model].variable_id.unique().astype(str)
        )
    )

Then we can add the google_cmip6 datastore to our dataframe catalog:

google_cmip6.name = "google_cmip6"

for model in google_cmip6.df.source_id.unique():
    variables = get_variables_for_model(google_cmip6, model)
    cat.add(
        google_cmip6,
        metadata={"model": model, "variable": variables}
    )

And the same for the aws_cmip6 datastore:

aws_cmip6.name = "aws_cmip6"

for model in aws_cmip6.df.source_id.unique():
    variables = get_variables_for_model(aws_cmip6, model)
    cat.add(
        aws_cmip6,
        metadata={"model": model, "variable": variables}
    )

Note that even though we added separate rows for each model in each CMIP6 datastore, we still see a convenient summary with only one row per source when we display the dataframe in a Jupyter environment (note, this is displaying the .df_summary property of cat):

cat

Intake dataframe catalog with 4 source(s) across 178 rows:

model variable
name
aws_cesm2_lens {CESM2-LENS} {FSNSC, aice, O2, TREFMXAV, TREFHTMN, TREFHT, U, WTS, SNOW, SOILLIQ, FLNS, SHFLX, V, SALT, DOC, VNT, UET, TS, FLUT, H2OSNO, RAIN, FLNSC, SOILWATER_10CM, VNS, PD, PSL, T, Z3, NPP, ICEFRAC, WTT, FSN...
aws_cmip6 {EC-Earth3-LR, TaiESM1, NESM3, EC-Earth3P-VHR, MRI-AGCM3-2-H, INM-CM5-0, FGOALS-f3-L, CMCC-CM2-SR5, EC-Earth3-Veg-LR, CESM2-FV2, EC-Earth3-CC, CMCC-ESM2, CNRM-CM6-1-HR, IPSL-CM6A-LR-INCA, CAS-ESM2... {tsl, wmo, psitem, nppLut, ponos, limnpico, phydiat, dryso2, fbddtdisi, emidust, calc, phydiazos, mrros, limndiat, dcalc, cheaqpso4, rsutcsaf, expsi, bsios, co3, raLut, drynh3, fracLut, co3satcalc...
google_cmip6 {EC-Earth3-LR, TaiESM1, NESM3, EC-Earth3P-VHR, MRI-AGCM3-2-H, INM-CM5-0, FGOALS-f3-L, CMCC-CM2-SR5, EC-Earth3-Veg-LR, CESM2-FV2, EC-Earth3-CC, CMCC-ESM2, CNRM-CM6-1-HR, IPSL-CM6A-LR-INCA, CAS-ESM2... {tsl, wmo, psitem, nppLut, ponos, limnpico, phydiat, dryso2, fbddtdisi, emidust, calc, phydiazos, mrros, limndiat, dcalc, cheaqpso4, rsutcsaf, expsi, bsios, co3, raLut, drynh3, fracLut, co3satcalc...
noaa_global_temp {NOAAGlobalTemp} {global_temp_anom}

Passing overwrite=True to .add will overwrite any existing source entries with the same name:

cat.add(
    aws_cesm2_lens,
    metadata={
        "model": aws_cesm2_lens_model, 
        "variable": aws_cesm2_lens_variables
    },
    overwrite=True
    )

cat

Intake dataframe catalog with 4 source(s) across 178 rows:

model variable
name
aws_cesm2_lens {CESM2-LENS} {FSNSC, aice, O2, TREFMXAV, TREFHTMN, TREFHT, U, WTS, SNOW, SOILLIQ, FLNS, SHFLX, V, SALT, DOC, VNT, UET, TS, FLUT, H2OSNO, RAIN, FLNSC, SOILWATER_10CM, VNS, PD, PSL, T, Z3, NPP, ICEFRAC, WTT, FSN...
aws_cmip6 {EC-Earth3-LR, TaiESM1, NESM3, EC-Earth3P-VHR, MRI-AGCM3-2-H, INM-CM5-0, FGOALS-f3-L, CMCC-CM2-SR5, EC-Earth3-Veg-LR, CESM2-FV2, EC-Earth3-CC, CMCC-ESM2, CNRM-CM6-1-HR, IPSL-CM6A-LR-INCA, CAS-ESM2... {tsl, wmo, psitem, nppLut, ponos, limnpico, phydiat, dryso2, fbddtdisi, emidust, calc, phydiazos, mrros, limndiat, dcalc, cheaqpso4, rsutcsaf, expsi, bsios, co3, raLut, drynh3, fracLut, co3satcalc...
google_cmip6 {EC-Earth3-LR, TaiESM1, NESM3, EC-Earth3P-VHR, MRI-AGCM3-2-H, INM-CM5-0, FGOALS-f3-L, CMCC-CM2-SR5, EC-Earth3-Veg-LR, CESM2-FV2, EC-Earth3-CC, CMCC-ESM2, CNRM-CM6-1-HR, IPSL-CM6A-LR-INCA, CAS-ESM2... {tsl, wmo, psitem, nppLut, ponos, limnpico, phydiat, dryso2, fbddtdisi, emidust, calc, phydiazos, mrros, limndiat, dcalc, cheaqpso4, rsutcsaf, expsi, bsios, co3, raLut, drynh3, fracLut, co3satcalc...
noaa_global_temp {NOAAGlobalTemp} {global_temp_anom}

Saving a dataframe catalog#

Once we’re happy with the sources we have in our dataframe catalog, we can save it using the .save method:

cat.save()

Loading a dataframe catalog#

When reading existing catalogs, it’s good practice to use mode="r" (the default) to avoid accidentally overwriting the catalog:

cat = intake.open_df_catalog(
    path="./example_catalog.csv",
    columns_with_iterables="variable",
)

Searching in a dataframe catalog#

We can use the .interactive attribute to open an interactive search interface in a Jupyter environment. This allows us to explore the dataframe catalog interactively, filtering by the metadata columns we added earlier. However, it will not save our searches.

Warning: The interactive search widget relies on your browser JavaScript engine, so may not work in all browsers. If you encounter issues with the interactive search rendering poorly, you may wish to try using a different browser. In particular, we have noticed issues in Firefox, but not in Chrome or Safari.

cat.interactive
Loading ITables v2.4.0 from the internet... (need help?)

We can use the .search method to find sources that satisfy metadata queries:

new_cat = cat.search(model="CanESM5")

new_cat

Intake dataframe catalog with 2 source(s) across 2 rows:

model variable
name
aws_cmip6 {CanESM5} {clw, co3, sftlf, od550aer, siconca, deptho, cct, cRoot, tauv, pon, opottemppmdiff, rlds, tasmax, tauvo, rsntds, vo, epfz, psitem, tasmin, sci, agessc, limirrmisc, osaltdiff, utendwtem, osaltpadve...
google_cmip6 {CanESM5} {clw, co3, sftlf, od550aer, siconca, deptho, cct, cRoot, tauv, pon, opottemppmdiff, rlds, tasmax, tauvo, rsntds, vo, epfz, psitem, tasmin, sci, agessc, limirrmisc, osaltdiff, utendwtem, osaltpadve...

We can combine queries for more complex searches:

new_cat = cat.search(model="CanESM5", variable=["thetao", "msftmzmpa"])

new_cat

Intake dataframe catalog with 2 source(s) across 2 rows:

model variable
name
aws_cmip6 {CanESM5} {thetao, msftmzmpa}
google_cmip6 {CanESM5} {thetao}

By default, querying on a list as above returns sources that match on any of the values in the list. The .search method also has an optional require_all argument. If this is set to True, returned sources satisfy all the query criteria:

new_cat = cat.search(model="CanESM5", variable=["thetao", "msftmzmpa"], require_all=True)

new_cat

Intake dataframe catalog with 1 source(s) across 1 rows:

model variable
name
aws_cmip6 {CanESM5} {thetao, msftmzmpa}

Regex expressions can also be used in queries. For example, below we search for sources with variables containing word “Fire”. We can see that only one model (GFDL-ESM4) in each of the CMIP6 datastores contains variables matching this criteria:

new_cat = cat.search(variable=".*Fire.*")

new_cat

Intake dataframe catalog with 2 source(s) across 2 rows:

model variable
name
aws_cmip6 {GFDL-ESM4} {fFire, fFireNat}
google_cmip6 {GFDL-ESM4} {fFire, fFireNat}

Loading sources#

There are a few options for loading sources. We can load individual sources if we know their name:

cat["aws_cesm2_lens"] # This is the aws_cesm2_lens intake-esm datastore

aws-cesm2-le catalog with 40 dataset(s) from 322 asset(s):

unique
variable 53
long_name 51
component 4
experiment 2
forcing_variant 2
frequency 3
vertical_levels 3
spatial_domain 3
units 20
start_time 4
end_time 7
path 313
derived_variable 0

Or (if the source name comprises only letters, numbers and underscores):

cat.aws_cesm2_lens

aws-cesm2-le catalog with 40 dataset(s) from 322 asset(s):

unique
variable 53
long_name 51
component 4
experiment 2
forcing_variant 2
frequency 3
vertical_levels 3
spatial_domain 3
units 20
start_time 4
end_time 7
path 313
derived_variable 0

Alternatively, there are .to_source and .to_source_dict methods. The former only works when there is only one source remaining in the dataframe catalog (e.g. after performing .search operations). The latter loads all sources into a dictionary with the corresponding source names as keys:

source = cat.search(variable="TEMP").to_source()

source

aws-cesm2-le catalog with 40 dataset(s) from 322 asset(s):

unique
variable 53
long_name 51
component 4
experiment 2
forcing_variant 2
frequency 3
vertical_levels 3
spatial_domain 3
units 20
start_time 4
end_time 7
path 313
derived_variable 0
source_dict = cat.to_source_dict()

source_dict
{'aws_cesm2_lens': <aws-cesm2-le catalog with 40 dataset(s) from 322 asset(s)>,
 'google_cmip6': <pangeo-cmip6 catalog with 7674 dataset(s) from 514818 asset(s)>,
 'noaa_global_temp': sources:
   csv:
     args:
       csv_kwargs:
         skiprows: 4
       urlpath: https://www.ncei.noaa.gov/access/monitoring/climate-at-a-glance/global/time-series/globe/land_ocean/1/4/1850-2023/data.csv
     description: ''
     driver: intake.source.csv.CSVSource
     metadata:
       catalog_dir: '',
 'aws_cmip6': <pangeo-cmip6 catalog with 7780 dataset(s) from 522217 asset(s)>}

When the sources in the dataframe catalog are intake-esm datastores, it is common to want to execute the same query on both the dataframe catalog and on the resulting source(s). This can be done with the pass_query argument when loading source(s) with to_source or to_source_dict. Setting pass_query=True will pass the most recent query provided to cat.search on to the .search method of the source(s). An exception will be thrown if the source(s) do not have a .search method, or if the query is not valid for the sources’ .search method.

cat.search(variable="TEMP").to_source(pass_query=True)

aws-cesm2-le catalog with 3 dataset(s) from 3 asset(s):

unique
variable 1
long_name 1
component 1
experiment 2
forcing_variant 2
frequency 1
vertical_levels 1
spatial_domain 1
units 1
start_time 2
end_time 2
path 3
derived_variable 0

You can see that this has returned an intake-esm datastore that has been filtered to only 3 datasets based on the same query applied to cat. With pass_query=False, the full unfiltered intake-esm datastore is returned comprising 40 datasets (see two cells above).

Once sources are loaded, we can access data in the normal way for that intake source type. For example, see the intake-esm documentation for how to use intake-esm datastores like the ones we’ve been using in this demonstration.