hydrodata.model_evaluation.utils module

Model evaluation utility functions.

Note that these functions are not intended to be used stand-alone; they act as sub-processes within the model_evaluation.evaluate method.

hydrodata.model_evaluation.utils.aggregate_dataframe(input_df, input_type, aggregate_level, aggregate_method)

Aggregate daily DataFrame to different time period (weekly, monthly, yearly).

Parameters:
  • input_df (DataFrame) – Pandas DataFrame with at least site_id, x, y, and columns for daily time series.

  • input_type (str) – Description of what type of input the DataFrame is sourced from: ‘observation’ or ‘simulation’.

  • aggregate_level (str) – Level at which to aggregate time series. Options include ‘day’, ‘week’, ‘month’, etc.

  • aggregate_method (str) – Type of aggregation to conduct. Options include ‘mean’, ‘min’, ‘max’, etc.

Returns:

Pandas DataFrame with site_id, x, y, and columns for time series at the aggregated level.

Return type:

DataFrame

hydrodata.model_evaluation.utils.calculate_stats(observations, simulations, metrics, aggregate_level, aggregate_method, date_start, date_end, statistics_df, debug=False)

Calculate evaluation metric statistics.

Parameters:
  • observations (DataFrame) – Pandas DataFrame that includes site ID, x and y CONUS grid mapping values, and columns for each time point to use for evaluation comparison. This DataFrame should already be aggregated to the aggregate_level defined.

  • simulations (ndarray) – Array of size (t, y, x), where y and x are the same size as CONUS1 or CONUS2 grid, depending on which x, y values are included in the observations DataFrame. The t dimension should match the number of time values included in the ‘daily’ observations DataFrame series. This array should not be aggregated to the aggregate_level defined. That process happens within this function. The spatial extent should be the same size as specified by the grid_bounds parameter if it is not the size of the full CONUS1 or CONUS2 grid.

  • metrics (list) – List of string names of metrics to use for evaluation. Must be present in METRICS_DICT dictionary in the model_evaluation.py module.

  • aggregate_level (str) – Level at which to aggregate time series. Options include ‘day’, ‘week’, ‘month’, etc.

  • aggregate_method (str) – Type of aggregation to conduct. Options include ‘mean’, ‘min’, ‘max’, etc.

  • date_start (str) – ‘YYYY-MM-DD’ date indicating beginning of time range.

  • date_end (str) – ‘YYYY-MM-DD’ date indicating end of time range.

  • statistics_df (DataFrame) – DataFrame containing site ID, x and y CONUS grid mapping values, along with empty columns for each of the evaluation metrics defined in metrics.

  • debug (bool; default=False) – Whether to show debugging print statements.

Returns:

Pandas DataFrame including calculated evaluation metrics for each site.

Return type:

DataFrame

hydrodata.model_evaluation.utils.collect_simulations(config_dict)

Collect and pre-process simulations data according to file format provided.

Parameters:

config_dict (dict) – Dictionary from YAML configuration file with the simulations information for a given observation type.

Returns:

NumPy array to use as input for evaluation comparison.

Return type:

ndarray

hydrodata.model_evaluation.utils.create_dataframe_from_dates(data, date_start, date_end)

Given a 1D NumPy array, create a DataFrame with columns spanning the given date range.

Parameters:
  • data (1D array) – 1-dimensional array containing the simulation results for a single x, y geo-location.

  • date_start (str) – ‘YYYY-MM-DD’ date indicating beginning of time range.

  • date_end (str) – ‘YYYY-MM-DD’ date indicating end of time range.

Returns:

DataFrame with the columns as dates and values as data from input array.

Return type:

DataFrame

hydrodata.model_evaluation.utils.filter_to_mapped_sites(df)

Filter site mapping data for only non-NaN values in mapping.

Parameters:

df (DataFrame) – Pandas DataFrame consisting of site ID, x, and y values representing the CONUS x and y coordinate mapping for each site.

Returns:

Pandas DataFrame consisting of site ID, x, and y values representing the CONUS x and y coordinate mapping for each site. Only sites with valid CONUS mapping values are included.

Return type:

DataFrame

hydrodata.model_evaluation.utils.filter_to_sites_in_bounds(mapped_df, grid_bounds)

Filter site list and grid-mapping DataFrame to only sites within grid_bounds. Adjust bound values to be relative to indexing on subset domain.

Parameters:
  • mapped_df (DataFrame) – Pandas DataFrame consisting of site ID, x, and y values representing the CONUS x and y coordinate mapping for each site.

  • grid_bounds (list) – List consisting of [xmin, ymin, xmax, ymax] CONUS grid bounds for domain.

Returns:

  • subset_site_list (list) – List of string site ID values restricted to sites within provided grid_bounds.

  • remapped_df (DataFrame) – Pandas DataFrame consisting of site ID, x, and y values representing the CONUS x and y coordinate mapping for each site. DataFrame is subset to only sites within provided grid_bounds.

hydrodata.model_evaluation.utils.get_full_site_list(observation_type)

Return list of site IDs with daily data available for a particular site type.

Parameters:

observation_type (str) – Type of observation. Examples include ‘streamflow’, ‘wtd’, ‘swe’.

Returns:

site_list – List of site ID strings for all sites with data available for the specific site type.

Return type:

list

hydrodata.model_evaluation.utils.get_network_site_list(observation_type, network_names)

Return list of site IDs for desired network of observation sites.

Parameters:
  • observation_type (str) – Type of observation. Examples include ‘streamflow’, ‘wtd’, ‘swe’.

  • network_names (list) – List of names of site networks. Can be a list with a single network name. Each network must have matching .csv file with a list of site ID values that comprise the network. This .csv file must be located in either STREAMFLOW_NETWORK_DIR or GROUNDWATER_NETWORK_DIR (as applicable) named as ‘network_name’.csv.

Returns:

site_list – List of site ID strings for sites belonging to named network.

Return type:

list

hydrodata.model_evaluation.utils.initialize_output(sites_df, metrics, aggregation_levels)

Initialize DataFrame table to store output.

Parameters:
  • site_df (DataFrame) – Pandas DataFrame consisting of at least site_id, x, and y CONUS grid mapping values.

  • metrics (list) – List of string names of metrics to use for evaluation. Must be present in METRICS_DICT dictionary in the model_evaluation.py module.

  • aggregation_levels (list) – List of aggregation levels to calculate evaluations for each metric in metrics. Options currently include: ‘day’, ‘week’, ‘month’, ‘calendar year’.

Returns:

DataFrame containing site ID, x and y CONUS grid mapping values, along with empty columns for each of the evaluation metrics defined in metrics.

Return type:

DataFrame

hydrodata.model_evaluation.utils.map_observations(observation_type, conus_version, site_list)

Get CONUS grid mapping values for sites.

Parameters:
  • observation_type (str) – Type of observation. Examples include ‘streamflow’, ‘wtd’, ‘swe’.

  • conus_version (str) – Reference for whether simulations are on CONUS1 or CONUS2 grid. Options include ‘conus1’ and ‘conus2’.

  • site_list (list) – List of site ID strings to query conus x, y coordinates for.

Returns:

Pandas DataFrame consisting of site ID, x, and y values representing the CONUS x and y coordinate mapping for each site. Only sites with valid CONUS mapping values are included.

Return type:

DataFrame

hydrodata.model_evaluation.utils.prepare_emulator_simulations(filepath, emulator_var, trace=1)

Prepare ParFlow emulator output for comparison to observations.

Parameters:
  • filepath (str) – Full path to location of ParFlow emulator output file.

  • emulator_var (str) – Name of variable to slice from ParFlow emulator output file.

  • trace (int; default=1) – Value of trace to slice from ParFlow emulator output file.

Returns:

  • a (NumPy array) – Array containing the extracted simulations data.

  • date_start (str) – ‘YYYY-MM-DD’ date indicating beginning of time range.

  • date_end (str) – ‘YYYY-MM-DD’ date indicating end of time range.

hydrodata.model_evaluation.utils.prepare_netcdf_simulations(filepath, var)

Prepare output in NetCDF file for comparison to observations.

Parameters:
  • filepath (str) – Full path to location of ParFlow emulator output file.

  • var (str) – Name of variable to slice from NetCDF file.

Returns:

  • a (NumPy array) – Array containing the extracted simulations data.

  • date_start (str) – ‘YYYY-MM-DD’ date indicating beginning of time range.

  • date_end (str) – ‘YYYY-MM-DD’ date indicating end of time range.

hydrodata.model_evaluation.utils.prepare_simulations_time_series(simulations, x, y, date_start, date_end, aggregate_level, aggregate_method)

Pre-process simulations data to be in comparable format to aggregated observations.

Parameters:
  • simulations (ndarray) – Array of size (t, y, x), where y and x are the same size as CONUS1 or CONUS2 grid, depending on which x, y values are included in the observations DataFrame. The t dimension should match the number of time values included in the ‘daily’ observations DataFrame series. This array should not be aggregated to the aggregate_level defined. That process happens within this function. The spatial extent should be the same size as specified by the grid_bounds parameter if it is not the size of the full CONUS1 or CONUS2 grid.

  • x (int) – x coordinate to index simulation array. The origin (0,0) is from the lower left of the grid. The value represents the count of conus grid cells away from the origin.

  • y (int) – y coordinate to index simulation array. The origin (0,0) is from the lower left of the grid. The value represents the count of conus grid cells away from the origin.

  • date_start (str) – ‘YYYY-MM-DD’ date indicating beginning of time range.

  • date_end (str) – ‘YYYY-MM-DD’ date indicating end of time range.

  • aggregate_level (str) – Level at which to aggregate time series. Options include ‘day’, ‘week’, ‘month’, etc.

  • aggregate_method (str) – Type of aggregation to conduct. Options include ‘mean’, ‘min’, ‘max’, etc.

Returns:

1-dimensional array containing the aggregated simulation results for the given x, y geo-location.

Return type:

1d array

hydrodata.model_evaluation.utils.query_attributes_conus_xy(observation_type, conus_version, site_list)

Query site attributes database table for conus grid values.

Parameters:
  • observation_type (str) – Type of observation. Examples include ‘streamflow’, ‘wtd’, ‘swe’.

  • conus_version (str) – Reference for whether simulations are on CONUS1 or CONUS2 grid. Options include ‘conus1’ and ‘conus2’.

  • site_list (list) – List of site ID strings to query conus x, y coordinates for.

Returns:

Pandas DataFrame consisting of site ID, x, and y values representing the CONUS x and y coordinate mapping for each site.

Return type:

DataFrame