Conversation
There was a problem hiding this comment.
Pull request overview
This PR improves data type handling in the waterdata module by replacing the _cleanup_cols function with a more comprehensive _type_cols function. The key motivation is that the previous approach using .dt.date for daily data did not return proper datetime objects.
Key changes:
- Refactored
_cleanup_colsto_type_colswith expanded type handling - Added support for datetime columns (time, datetime, last_modified) using
pd.to_datetime() - Added categorical type conversion for identifier and status columns
- Removed the
serviceparameter dependency, making the function more generic
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
dataretrieval/waterdata/utils.py
Outdated
| df[col] = pd.to_numeric(df[col], errors="coerce") | ||
| cols = set(df.columns) | ||
| numerical_cols = ["value", "contributing_drainage_area"] | ||
| time_cols = ["time", "datetime", "last_modified"] |
There was a problem hiding this comment.
Is there a "datetime" column in any of the endpoints? I went through all of them (including continuous) and I don't see it. However, I do see some others:
Time series metadata: begin, end, begin_utc, end_utc
Monitoring locations: construction_date
And some additional numerics:
Monitoring locations: altitude, altitude_accuracy, drainage_area, well_constructed_depth, hole_constructed_depth
There was a problem hiding this comment.
I'm not aware of a datetime column, but see no harm in including it.
| - The 'value' and 'contributing_drainage_area' columns are coerced to numeric types. | ||
| """ | ||
| if "time" in df.columns and service == "daily": | ||
| df["time"] = pd.to_datetime(df["time"]).dt.date |
There was a problem hiding this comment.
This piece of the function originally was just changing the "time" column to simply a date (not a timestamp) for the daily values endpoint only, so that the user wouldn't be confused about whether the value represents a daily aggregated value (min, max, mean, etc.) or a particular measurement. This logic was initially introduced to match what R dataRetrieval was doing: https://github.com/DOI-USGS/dataRetrieval/blob/develop/R/walk_pages.R#L141
There was a problem hiding this comment.
Right. It was my recollection that a dt.date lacks datetime functionality, furthermore, the parsing behavior of pd.to_datetime seems to have changed. By default, it correctly omits the time information. Maybe this was a pandas update, but in the current version, it seems correct to leave "time" as a datetime object.
There was a problem hiding this comment.
I am not seeing that, or I am misunderstanding. When I run your branch and pull from get_latest_daily, the "time" column shows up as "2025-12-01 00:00:00", whereas in the existing implementation, it shows up as "2025-12-01".
check, md = waterdata.get_latest_daily(
monitoring_location_id="USGS-05129115",
parameter_code="00060"
)
I like the existing implementation for daily summaries only, because the date cannot be confused with a singular measurement, and it indeed represents a "summary" value. However, if it causes problems by being inconsisent in the daily summary services, I'm open to applying a consistent rule.
There was a problem hiding this comment.
Ah, you're correct, dt.date does lack datetime functionality. It changes it to an object. Hm. It does make sense to give it a datetime type. Nevermind. We might then just want to say that the additional "00:00:00" added to it doesn't represent a singular value.
There was a problem hiding this comment.
did it display 00:00:00? it didn't for me, so this behavior was probably changed at some version of pandas.
Co-authored-by: Elise Hinman <121896266+ehinman@users.noreply.github.com>
ehinman
left a comment
There was a problem hiding this comment.
Now all I can think of is "lgtm". Approved.
This is an initial PR to imrove data type handling within the
waterdatamodule.Simply calling
pd.to_datetime(df['time'])appears to correctly handle daily data.