hydrodata.point_observations.pandas.utils module

Point observations utility functions.

Note that these functions are not intended to be used stand-alone; they act as sub-processes within the collect_observations.get_pandas_observations method.

hydrodata.point_observations.pandas.utils.check_inputs(data_source, variable, temporal_resolution, aggregation, depth_level, return_metadata, all_attributes)

Checks on inputs to get_observations function.

Parameters:

data_source (str) – Source from which requested data originated. Currently supported: ‘usgs_nwis’, ‘usda_nrcs’, ‘ameriflux’.
variable (str) – Description of type of data requested. Currently supported: ‘streamflow’, ‘wtd’, ‘swe’, ‘precipitation’, ‘temperature’, ‘soil moisture’, ‘latent heat flux’, ‘sensible heat flux’, ‘shortwave radiation’, ‘longwave radiation’, ‘vapor pressure deficit’, ‘wind speed’.
temporal_resolution (str) – Collection frequency of data requested. Currently supported: ‘daily’, ‘hourly’, and ‘instantaneous’.
aggregation (str) – Additional information specifying the aggregation method for the variable to be returned. Options include descriptors such as ‘average’ and ‘total’. Please see the README documentation for allowable combinations with variable.
depth_level (int) – Depth level in inches at which the measurement is taken. Necessary for variable = ‘soil moisture’.
return_metadata (bool) – Whether the metadata DataFrame is also returned.
all_attributes (bool) – If the metadata DataFrame is returned, and indication of whether the full set of site attributes is included or only a subset.

Return type:

None

hydrodata.point_observations.pandas.utils.convert_to_pandas(ds)

Convert xarray DataSet to pandas DataFrame.

Parameters:

ds (DataSet) – xarray DataSet containing stacked observations data for a single variable.
var_id (int) – Integer variable ID associated with combination of variable, temporal_resolution, and aggregation.

Returns:

Stacked observations data for a single variable.

Return type:

DataFrame

hydrodata.point_observations.pandas.utils.filter_min_num_obs(df, min_num_obs)

Filter to only sites which have a minimum number of observations.

This filtering is done after the observations are subset by time, so these observation counts will only filter out sites if the number of observations within that time range is not satisfied.

Parameters:

df (DataFrame) – Stacked observations data for a single variable.
min_num_obs (int) – Value for the minimum number of observations desired for a site to have.

Returns:

Stacked observations data for a single variable, filtered to only sites that have the minimum number of observations specified.

Return type:

DataFrame

hydrodata.point_observations.pandas.utils.get_data_nc(site_list, var_id, date_start, date_end, min_num_obs)

Get observations data for data that is stored in NetCDF files.

Parameters:

site_list (list) – List of site IDs to query observations data for.
var_id (int) – Integer variable ID associated with combination of data_source, variable, temporal_resolution, and aggregation.
date_start (str; default=None) – ‘YYYY-MM-DD’ format date indicating beginning of time range.
date_end (str; default=None) – ‘YYYY-MM-DD’ format date indicating end of time range.
min_num_obs (int) – Value for the minimum number of observations desired for a site to have.

Returns:

Stacked observations data for a single variable, filtered to only sites that have the minimum number of observations specified.

Return type:

DataFrame

hydrodata.point_observations.pandas.utils.get_data_sql(conn, var_id, date_start, date_end, min_num_obs)

Get observations data for data that is stored in a SQL table.

Parameters:

conn (Connection object) – The Connection object associated with the SQLite database to query from.
var_id (int) – Integer variable ID associated with combination of data_source, variable, temporal_resolution, and aggregation.
date_start (str; default=None) – ‘YYYY-MM-DD’ format date indicating beginning of time range.
date_end (str; default=None) – ‘YYYY-MM-DD’ format date indicating end of time range.
min_num_obs (int) – Value for the minimum number of observations desired for a site to have.

Returns:

Stacked observations data for a single variable, filtered to only sites that have the minimum number of observations specified.

Return type:

DataFrame

hydrodata.point_observations.pandas.utils.get_dirpath(var_id)

Map variable with location of data on /hydrodata.

Parameters:: var_id (int) – Integer variable ID associated with combination of data_source, variable, temporal_resolution, and aggregation.
Returns:: dirpath – Directory path for observation data location.
Return type:: str

hydrodata.point_observations.pandas.utils.get_observations_metadata(conn, var_id, date_start=None, date_end=None, latitude_range=None, longitude_range=None, site_ids=None, state=None, all_attributes=False)

Build DataFrame with site attribute metadata information.

Parameters:

conn (Connection object) – The Connection object associated with the SQLite database to query from.
var_id (int) – Integer variable ID associated with combination of data_source, variable, temporal_resolution, and aggregation.
date_start (str; default=None) – ‘YYYY-MM-DD’ format date indicating beginning of time range.
date_end (str; default=None) – ‘YYYY-MM-DD’ format date indicating end of time range.
latitude_range (tuple; default=None) – Latitude range bounds for the geographic domain; lesser value is provided first.
longitude_range (tuple; default=None) – Longitude range bounds for the geographic domain; lesser value is provided first.
site_ids (list; default=None) – List of desired (string) site identifiers.
state (str; default=None) – Two-letter postal code state abbreviation.
all_attributes (bool; default=False) – Whether to include all available attributes on returned DataFrame.

Returns:

Site-level DataFrame of attribute metadata information.

Return type:

DataFrame

Notes

The returned field ‘record_count’ is OVERALL record count. Filtering of metadata only applies at the site level, so only sites within the provided bounds (space and time) are included. The record count does not reflect any filtering at the data/observation level.

hydrodata.point_observations.pandas.utils.get_var_id(conn, data_source, variable, temporal_resolution, aggregation, depth_level=None)

Return mapped var_id.

Parameters:

conn (Connection object) – The Connection object associated with the SQLite database to query from.
data_source (str) – Source from which requested data originated. Currently supported: ‘usgs_nwis’, ‘usda_nrcs’, ‘ameriflux’.
variable (str) – Description of type of data requested. Currently supported: ‘streamflow’, ‘wtd’, ‘swe’, ‘precipitation’, ‘temperature’, ‘soil moisture’, ‘latent heat flux’, ‘sensible heat flux’, ‘shortwave radiation’, ‘longwave radiation’, ‘vapor pressure deficit’, ‘wind speed’.
temporal_resolution (str) – Collection frequency of data requested. Currently supported: ‘daily’, ‘hourly’, and ‘instantaneous’.
aggregation (str) – Additional information specifying the aggregation method for the variable to be returned. Options include descriptors such as ‘average’ and ‘total’. Please see the README documentation for allowable combinations with variable.
depth_level (int) – Depth level in inches at which the measurement is taken. Necessary for variable=’soil moisture’.

Returns:

var_id – Integer variable ID associated with combination of data_source, variable, temporal_resolution, and aggregation.

Return type:

int