Quickstart#
(Please be aware that Mybinder is reducing capacity and need funding. See this blog)
This page demonstrates how to use intake-dataframe-catalog by building and using a very simple dataframe catalog comprising a small number of publically-available data sources:
An intake-esm datastore intake-esm datastores for the Community Earth System Model Large Ensemble (CESM LENS) data hosted on AWS by NCAR
An intake-esm datastore intake-esm datastores for the Coupled Model Intercomparison Project 6 data hosted on AWS by Pangeo
An intake-esm datastore intake-esm datastores for the Coupled Model Intercomparison Project 6 data hosted on Google Cloud by Pangeo
A CSV file of global annual average temperatures provided by NOAA
import intake
Getting set up#
First, we open each of the data sources. Our goal is to create an intake dataframe catalog with these as the cataloged intake sources.
aws_cesm2_lens = intake.open_esm_datastore(
"https://raw.githubusercontent.com/NCAR/cesm2-le-aws/main/intake-catalogs/aws-cesm2-le.json"
)
/Users/u1166368/miniforge3/envs/intake-df-cat-test/lib/python3.12/site-packages/fastprogress/fastprogress.py:107: UserWarning: Couldn't import ipywidgets properly, progress bar will use console behavior
warn("Couldn't import ipywidgets properly, progress bar will use console behavior")
google_cmip6 = intake.open_esm_datastore(
"https://storage.googleapis.com/cmip6/pangeo-cmip6.json"
)
aws_cmip6 = intake.open_esm_datastore(
"https://cmip6-pds.s3.amazonaws.com/pangeo-cmip6.json"
)
noaa_global_temp = intake.open_csv(
"https://www.ncei.noaa.gov/access/monitoring/climate-at-a-glance/global/time-series/globe/land_ocean/1/4/1850-2023/data.csv",
csv_kwargs={"skiprows": 4},
)
All of these data sources point to data that share some key attributes. For example, all sources contain timeseries of climate variables generated by a model at a particular temporal frequency. These shared metadata attributes are what we might consider including as columns in our intake dataframe catalog.
Initialising a dataframe catalog#
We’ll start by initialising a intake-dataframe-catalog object (intake_dataframe_catalog.core.DfFileCatalog). This can be done by initialising the class directly, or by using the intake.open_df_catalog convenience function:
cat = intake.open_df_catalog(path="./example_catalog.csv", mode="w")
Adding sources#
We can add sources to the dataframe catalog using the .add method. This method takes as arguments the sources to add and the metadata to associate with that source. As a simple demonstration, we’ll add metadata about the model(s) and variable(s). The noaa_global_temp source is the easiest to add, since it contains only one model and one variable:
noaa_global_temp.name = "noaa_global_temp"
cat.add(
noaa_global_temp,
metadata={"model": "NOAAGlobalTemp", "variable": ["global_temp_anom"]}
)
cat
Intake dataframe catalog with 1 source(s) across 1 rows:
| model | variable | |
|---|---|---|
| name | ||
| noaa_global_temp | {NOAAGlobalTemp} | {global_temp_anom} |
For the intake-esm datastores, we’ll parse model and variable metadata from the datastore itself. The aws_cesm2_lens datastore comprises only one model:
aws_cesm2_lens.name = "aws_cesm2_lens"
aws_cesm2_lens_model = "CESM2-LENS"
aws_cesm2_lens_variables = list(
set(
aws_cesm2_lens.df.variable.unique().astype(str)
)
)
cat.add(
aws_cesm2_lens,
metadata={"model": aws_cesm2_lens_model, "variable": aws_cesm2_lens_variables}
)
cat
Intake dataframe catalog with 2 source(s) across 2 rows:
| model | variable | |
|---|---|---|
| name | ||
| aws_cesm2_lens | {CESM2-LENS} | {FSNSC, aice, O2, TREFMXAV, TREFHTMN, TREFHT, U, WTS, SNOW, SOILLIQ, FLNS, SHFLX, V, SALT, DOC, VNT, UET, TS, FLUT, H2OSNO, RAIN, FLNSC, SOILWATER_10CM, VNS, PD, PSL, T, Z3, NPP, ICEFRAC, WTT, FSN... |
| noaa_global_temp | {NOAAGlobalTemp} | {global_temp_anom} |
Both the CMIP6 datastores comprise multiple models. In order to keep track of which variables are available for which models we must add an entry in our dataframe catalog for each model. To do this, we’ll write a simple function for finding which variables are available for a given model:
def get_variables_for_model(datastore, model):
"""
Returns a list of unique variables for a given model in a CMIP6 intake-esm datastore
"""
return list(
set(
datastore.df[datastore.df.source_id == model].variable_id.unique().astype(str)
)
)
Then we can add the google_cmip6 datastore to our dataframe catalog:
google_cmip6.name = "google_cmip6"
for model in google_cmip6.df.source_id.unique():
variables = get_variables_for_model(google_cmip6, model)
cat.add(
google_cmip6,
metadata={"model": model, "variable": variables}
)
And the same for the aws_cmip6 datastore:
aws_cmip6.name = "aws_cmip6"
for model in aws_cmip6.df.source_id.unique():
variables = get_variables_for_model(aws_cmip6, model)
cat.add(
aws_cmip6,
metadata={"model": model, "variable": variables}
)
Note that even though we added separate rows for each model in each CMIP6 datastore, we still see a convenient summary with only one row per source when we display the dataframe in a Jupyter environment (note, this is displaying the .df_summary property of cat):
cat
Intake dataframe catalog with 4 source(s) across 178 rows:
| model | variable | |
|---|---|---|
| name | ||
| aws_cesm2_lens | {CESM2-LENS} | {FSNSC, aice, O2, TREFMXAV, TREFHTMN, TREFHT, U, WTS, SNOW, SOILLIQ, FLNS, SHFLX, V, SALT, DOC, VNT, UET, TS, FLUT, H2OSNO, RAIN, FLNSC, SOILWATER_10CM, VNS, PD, PSL, T, Z3, NPP, ICEFRAC, WTT, FSN... |
| aws_cmip6 | {EC-Earth3-LR, TaiESM1, NESM3, EC-Earth3P-VHR, MRI-AGCM3-2-H, INM-CM5-0, FGOALS-f3-L, CMCC-CM2-SR5, EC-Earth3-Veg-LR, CESM2-FV2, EC-Earth3-CC, CMCC-ESM2, CNRM-CM6-1-HR, IPSL-CM6A-LR-INCA, CAS-ESM2... | {tsl, wmo, psitem, nppLut, ponos, limnpico, phydiat, dryso2, fbddtdisi, emidust, calc, phydiazos, mrros, limndiat, dcalc, cheaqpso4, rsutcsaf, expsi, bsios, co3, raLut, drynh3, fracLut, co3satcalc... |
| google_cmip6 | {EC-Earth3-LR, TaiESM1, NESM3, EC-Earth3P-VHR, MRI-AGCM3-2-H, INM-CM5-0, FGOALS-f3-L, CMCC-CM2-SR5, EC-Earth3-Veg-LR, CESM2-FV2, EC-Earth3-CC, CMCC-ESM2, CNRM-CM6-1-HR, IPSL-CM6A-LR-INCA, CAS-ESM2... | {tsl, wmo, psitem, nppLut, ponos, limnpico, phydiat, dryso2, fbddtdisi, emidust, calc, phydiazos, mrros, limndiat, dcalc, cheaqpso4, rsutcsaf, expsi, bsios, co3, raLut, drynh3, fracLut, co3satcalc... |
| noaa_global_temp | {NOAAGlobalTemp} | {global_temp_anom} |
Passing overwrite=True to .add will overwrite any existing source entries with the same name:
cat.add(
aws_cesm2_lens,
metadata={
"model": aws_cesm2_lens_model,
"variable": aws_cesm2_lens_variables
},
overwrite=True
)
cat
Intake dataframe catalog with 4 source(s) across 178 rows:
| model | variable | |
|---|---|---|
| name | ||
| aws_cesm2_lens | {CESM2-LENS} | {FSNSC, aice, O2, TREFMXAV, TREFHTMN, TREFHT, U, WTS, SNOW, SOILLIQ, FLNS, SHFLX, V, SALT, DOC, VNT, UET, TS, FLUT, H2OSNO, RAIN, FLNSC, SOILWATER_10CM, VNS, PD, PSL, T, Z3, NPP, ICEFRAC, WTT, FSN... |
| aws_cmip6 | {EC-Earth3-LR, TaiESM1, NESM3, EC-Earth3P-VHR, MRI-AGCM3-2-H, INM-CM5-0, FGOALS-f3-L, CMCC-CM2-SR5, EC-Earth3-Veg-LR, CESM2-FV2, EC-Earth3-CC, CMCC-ESM2, CNRM-CM6-1-HR, IPSL-CM6A-LR-INCA, CAS-ESM2... | {tsl, wmo, psitem, nppLut, ponos, limnpico, phydiat, dryso2, fbddtdisi, emidust, calc, phydiazos, mrros, limndiat, dcalc, cheaqpso4, rsutcsaf, expsi, bsios, co3, raLut, drynh3, fracLut, co3satcalc... |
| google_cmip6 | {EC-Earth3-LR, TaiESM1, NESM3, EC-Earth3P-VHR, MRI-AGCM3-2-H, INM-CM5-0, FGOALS-f3-L, CMCC-CM2-SR5, EC-Earth3-Veg-LR, CESM2-FV2, EC-Earth3-CC, CMCC-ESM2, CNRM-CM6-1-HR, IPSL-CM6A-LR-INCA, CAS-ESM2... | {tsl, wmo, psitem, nppLut, ponos, limnpico, phydiat, dryso2, fbddtdisi, emidust, calc, phydiazos, mrros, limndiat, dcalc, cheaqpso4, rsutcsaf, expsi, bsios, co3, raLut, drynh3, fracLut, co3satcalc... |
| noaa_global_temp | {NOAAGlobalTemp} | {global_temp_anom} |
Saving a dataframe catalog#
Once we’re happy with the sources we have in our dataframe catalog, we can save it using the .save method:
cat.save()
Loading a dataframe catalog#
When reading existing catalogs, it’s good practice to use mode="r" (the default) to avoid accidentally overwriting the catalog:
cat = intake.open_df_catalog(
path="./example_catalog.csv",
columns_with_iterables="variable",
)
Searching in a dataframe catalog#
We can use the .interactive attribute to open an interactive search interface in a Jupyter environment. This allows us to explore the dataframe catalog interactively, filtering by the metadata columns we added earlier. However, it will not save our searches.
Warning: The interactive search widget relies on your browser JavaScript engine, so may not work in all browsers. If you encounter issues with the interactive search rendering poorly, you may wish to try using a different browser. In particular, we have noticed issues in Firefox, but not in Chrome or Safari.
cat.interactive
| Loading ITables v2.4.0 from the internet... (need help?) |
We can use the .search method to find sources that satisfy metadata queries:
new_cat = cat.search(model="CanESM5")
new_cat
Intake dataframe catalog with 2 source(s) across 2 rows:
| model | variable | |
|---|---|---|
| name | ||
| aws_cmip6 | {CanESM5} | {clw, co3, sftlf, od550aer, siconca, deptho, cct, cRoot, tauv, pon, opottemppmdiff, rlds, tasmax, tauvo, rsntds, vo, epfz, psitem, tasmin, sci, agessc, limirrmisc, osaltdiff, utendwtem, osaltpadve... |
| google_cmip6 | {CanESM5} | {clw, co3, sftlf, od550aer, siconca, deptho, cct, cRoot, tauv, pon, opottemppmdiff, rlds, tasmax, tauvo, rsntds, vo, epfz, psitem, tasmin, sci, agessc, limirrmisc, osaltdiff, utendwtem, osaltpadve... |
We can combine queries for more complex searches:
new_cat = cat.search(model="CanESM5", variable=["thetao", "msftmzmpa"])
new_cat
Intake dataframe catalog with 2 source(s) across 2 rows:
| model | variable | |
|---|---|---|
| name | ||
| aws_cmip6 | {CanESM5} | {thetao, msftmzmpa} |
| google_cmip6 | {CanESM5} | {thetao} |
By default, querying on a list as above returns sources that match on any of the values in the list. The .search method also has an optional require_all argument. If this is set to True, returned sources satisfy all the query criteria:
new_cat = cat.search(model="CanESM5", variable=["thetao", "msftmzmpa"], require_all=True)
new_cat
Intake dataframe catalog with 1 source(s) across 1 rows:
| model | variable | |
|---|---|---|
| name | ||
| aws_cmip6 | {CanESM5} | {thetao, msftmzmpa} |
Regex expressions can also be used in queries. For example, below we search for sources with variables containing word “Fire”. We can see that only one model (GFDL-ESM4) in each of the CMIP6 datastores contains variables matching this criteria:
new_cat = cat.search(variable=".*Fire.*")
new_cat
Intake dataframe catalog with 2 source(s) across 2 rows:
| model | variable | |
|---|---|---|
| name | ||
| aws_cmip6 | {GFDL-ESM4} | {fFire, fFireNat} |
| google_cmip6 | {GFDL-ESM4} | {fFire, fFireNat} |
Loading sources#
There are a few options for loading sources. We can load individual sources if we know their name:
cat["aws_cesm2_lens"] # This is the aws_cesm2_lens intake-esm datastore
aws-cesm2-le catalog with 40 dataset(s) from 322 asset(s):
| unique | |
|---|---|
| variable | 53 |
| long_name | 51 |
| component | 4 |
| experiment | 2 |
| forcing_variant | 2 |
| frequency | 3 |
| vertical_levels | 3 |
| spatial_domain | 3 |
| units | 20 |
| start_time | 4 |
| end_time | 7 |
| path | 313 |
| derived_variable | 0 |
Or (if the source name comprises only letters, numbers and underscores):
cat.aws_cesm2_lens
aws-cesm2-le catalog with 40 dataset(s) from 322 asset(s):
| unique | |
|---|---|
| variable | 53 |
| long_name | 51 |
| component | 4 |
| experiment | 2 |
| forcing_variant | 2 |
| frequency | 3 |
| vertical_levels | 3 |
| spatial_domain | 3 |
| units | 20 |
| start_time | 4 |
| end_time | 7 |
| path | 313 |
| derived_variable | 0 |
Alternatively, there are .to_source and .to_source_dict methods. The former only works when there is only one source remaining in the dataframe catalog (e.g. after performing .search operations). The latter loads all sources into a dictionary with the corresponding source names as keys:
source = cat.search(variable="TEMP").to_source()
source
aws-cesm2-le catalog with 40 dataset(s) from 322 asset(s):
| unique | |
|---|---|
| variable | 53 |
| long_name | 51 |
| component | 4 |
| experiment | 2 |
| forcing_variant | 2 |
| frequency | 3 |
| vertical_levels | 3 |
| spatial_domain | 3 |
| units | 20 |
| start_time | 4 |
| end_time | 7 |
| path | 313 |
| derived_variable | 0 |
source_dict = cat.to_source_dict()
source_dict
{'aws_cesm2_lens': <aws-cesm2-le catalog with 40 dataset(s) from 322 asset(s)>,
'google_cmip6': <pangeo-cmip6 catalog with 7674 dataset(s) from 514818 asset(s)>,
'noaa_global_temp': sources:
csv:
args:
csv_kwargs:
skiprows: 4
urlpath: https://www.ncei.noaa.gov/access/monitoring/climate-at-a-glance/global/time-series/globe/land_ocean/1/4/1850-2023/data.csv
description: ''
driver: intake.source.csv.CSVSource
metadata:
catalog_dir: '',
'aws_cmip6': <pangeo-cmip6 catalog with 7780 dataset(s) from 522217 asset(s)>}
When the sources in the dataframe catalog are intake-esm datastores, it is common to want to execute the same query on both the dataframe catalog and on the resulting source(s). This can be done with the pass_query argument when loading source(s) with to_source or to_source_dict. Setting pass_query=True will pass the most recent query provided to cat.search on to the .search method of the source(s). An exception will be thrown if the source(s) do not have a .search method, or if the query is not valid for the sources’ .search method.
cat.search(variable="TEMP").to_source(pass_query=True)
aws-cesm2-le catalog with 3 dataset(s) from 3 asset(s):
| unique | |
|---|---|
| variable | 1 |
| long_name | 1 |
| component | 1 |
| experiment | 2 |
| forcing_variant | 2 |
| frequency | 1 |
| vertical_levels | 1 |
| spatial_domain | 1 |
| units | 1 |
| start_time | 2 |
| end_time | 2 |
| path | 3 |
| derived_variable | 0 |
You can see that this has returned an intake-esm datastore that has been filtered to only 3 datasets based on the same query applied to cat. With pass_query=False, the full unfiltered intake-esm datastore is returned comprising 40 datasets (see two cells above).
Once sources are loaded, we can access data in the normal way for that intake source type. For example, see the intake-esm documentation for how to use intake-esm datastores like the ones we’ve been using in this demonstration.