I was trying to get the period of record for data in the St. Johns River Water Management District’s Hydstra database, which houses hydrologic data. I found the report type HYREP PERIOD, which produces an output that contains the period of record information I needed. However, it was not in a convenient format to allow for easy joining to other Oracle-based tables.
This project steps through the data cleaning process I used to get the data into a usable format in R. I include brief snapshots (a few rows) of the data frames to illustrate each of the cleaning steps.
HYREP PERIOD data output
The HYREP PERIOD report produces an output that would be very good on a small scale (i.e., looking at only a few stations at a time).
With hundreds of stations, it made more sense to reformat this output to extract all of the relevant data into a table structure. There is a site identifier in the line above each block of data, and it is important that this information is linked to the data. Additionally, there are many rows that are not needed.
Importing the data
With unstructured data like this, the read_lines function from the readr package loads the file line by line. The text file contains many empty rows, so setting the skip_empty_rows argument to TRUE removes these rows. Wrapping read_lines in data.frame ensures that we start working with table-based data right away.
hy <- data.frame(V1 = read_lines("HY_POR_WATERCAT.txt", skip_empty_rows = T))
V1 |
---|
HYREP.PERIOD V77 Output 02/07/2019 |
File 20014731.A - State Rd 16 |
Var Description Start End Period Missing |
——- ———————————- —————- —————- ————– ——- |
227.10 Elev. (feet NAVD88) Continuous 11:00_06/30/2006 11:00_07/10/2018 12.03 Years 0% |
227.14 Elev. (feet NAVD88) Random Manual 12:01_09/22/1970 11:35_07/10/2018 47.80 Years 0% |
232.10 Elev. (feet NGVD29) Continuous 11:00_06/30/2006 11:00_03/01/2010 3.67 Years 0% |
232.14 Elev. (feet NGVD29) Random Manual 12:01_09/22/1970 11:14_03/01/2010 39.44 Years 0% |
HYREP.PERIOD V77 Output 02/07/2019 |
File 20014731.X - State Rd 16 |
Cleaning the data
Remove unnecessary rows
There are a number of lines that do not contain any data of interest. These lines include the strings “–”, “HYREP.PERIOD”, “Var”, “dup/”, or “Index”. The grepl function searches for any of these strings, and the filter function in the dplyr package allows us to omit any of these lines. In addition, the str_trim function in the stringr package strips whitespace from the start and end of each line.
hy <- hy %>%
filter(!grepl("--|HYREP.PERIOD|Var|dup/|Index", V1)) %>%
mutate(V1 = str_trim(V1))
V1 |
---|
File 20014731.A - State Rd 16 |
227.10 Elev. (feet NAVD88) Continuous 11:00_06/30/2006 11:00_07/10/2018 12.03 Years 0% |
227.14 Elev. (feet NAVD88) Random Manual 12:01_09/22/1970 11:35_07/10/2018 47.80 Years 0% |
232.10 Elev. (feet NGVD29) Continuous 11:00_06/30/2006 11:00_03/01/2010 3.67 Years 0% |
232.14 Elev. (feet NGVD29) Random Manual 12:01_09/22/1970 11:14_03/01/2010 39.44 Years 0% |
File 20014731.X - State Rd 16 |
227.10 Elev. (feet NAVD88) Continuous 00:00_01/01/2010 06:00_02/07/2019 9.10 Years 0% |
File 20014731.H - State Rd 16 |
227.10 Elev. (feet NAVD88) Continuous 11:00_06/30/2006 02:00_09/13/2016 10.21 Years 0% |
File 20014731.W - State Rd 16 |
Create a Site column
The structure of the data is now a line with information about the site followed by one or more lines of data. The site designator is an eight digit number included in each line above the data blocks. The str_extract function from stringr can be used to extract those eight consecutive digits using a regular expression. The fill function from the tidyr package fills down the values in the Site column so that each data row is associated with the correct Site. Once the Site variable is extracted, the rows containing the string “File” can be omitted.
hy <- hy %>%
mutate(Site = str_extract(V1, "\\d{8}")) %>%
select(Site, V1) %>%
fill(Site) %>%
filter(!grepl("File", V1))
Site | V1 |
---|---|
20014731 | 227.10 Elev. (feet NAVD88) Continuous 11:00_06/30/2006 11:00_07/10/2018 12.03 Years 0% |
20014731 | 227.14 Elev. (feet NAVD88) Random Manual 12:01_09/22/1970 11:35_07/10/2018 47.80 Years 0% |
20014731 | 232.10 Elev. (feet NGVD29) Continuous 11:00_06/30/2006 11:00_03/01/2010 3.67 Years 0% |
20014731 | 232.14 Elev. (feet NGVD29) Random Manual 12:01_09/22/1970 11:14_03/01/2010 39.44 Years 0% |
20014731 | 227.10 Elev. (feet NAVD88) Continuous 00:00_01/01/2010 06:00_02/07/2019 9.10 Years 0% |
20014731 | 227.10 Elev. (feet NAVD88) Continuous 11:00_06/30/2006 02:00_09/13/2016 10.21 Years 0% |
Create additional columns
Additional columns can be created by extracting pieces of the V1 column using str_extract and regular expressions. Only some of the variables and subvariables needed to be retained, so these can be filtered out.
hy <- hy %>%
mutate(Variable = str_extract(V1, "^\\d+\\.\\d+"),
StartDate = as.POSIXct(str_extract(V1, "\\d{2}\\:\\d{2}\\_\\d{2}/\\d{2}/\\d{4}"),
format = "%H:%M_%m/%d/%Y", tz = "EST"),
EndDate = as.POSIXct(str_sub(str_extract(V1, "\\d{2}\\:\\d{2}\\_\\d{2}/\\d{2}/\\d{4}.*?(\\d{2}\\:\\d{2}\\_\\d{2}/\\d{2}/\\d{4})"), -16),
format = "%H:%M_%m/%d/%Y", tz = "EST"),
Var = as.numeric(str_extract(Variable, ".*(?=\\.)")),
SubVar = as.numeric(str_extract(Variable, "(\\.[^\\.]+)$"))) %>%
filter(SubVar %in% c(0.1, 0.14), !(Var %in% c(104, 229, 252, 300, 551, 2327))) %>%
select(Site, Variable, Var, SubVar, StartDate, EndDate, V1)
Site | Variable | Var | SubVar | StartDate | EndDate | V1 |
---|---|---|---|---|---|---|
20014731 | 227.10 | 227 | 0.1 | 2006-06-30 11:00:00 | 2018-07-10 11:00:00 | 227.10 Elev. (feet NAVD88) Continuous 11:00_06/30/2006 11:00_07/10/2018 12.03 Years 0% |
20014731 | 227.14 | 227 | 0.14 | 1970-09-22 12:01:00 | 2018-07-10 11:35:00 | 227.14 Elev. (feet NAVD88) Random Manual 12:01_09/22/1970 11:35_07/10/2018 47.80 Years 0% |
20014731 | 232.10 | 232 | 0.1 | 2006-06-30 11:00:00 | 2010-03-01 11:00:00 | 232.10 Elev. (feet NGVD29) Continuous 11:00_06/30/2006 11:00_03/01/2010 3.67 Years 0% |
20014731 | 232.14 | 232 | 0.14 | 1970-09-22 12:01:00 | 2010-03-01 11:14:00 | 232.14 Elev. (feet NGVD29) Random Manual 12:01_09/22/1970 11:14_03/01/2010 39.44 Years 0% |
20014731 | 227.10 | 227 | 0.1 | 2010-01-01 | 2019-02-07 06:00:00 | 227.10 Elev. (feet NAVD88) Continuous 00:00_01/01/2010 06:00_02/07/2019 9.10 Years 0% |
20014731 | 227.10 | 227 | 0.1 | 2006-06-30 11:00:00 | 2016-09-13 02:00:00 | 227.10 Elev. (feet NAVD88) Continuous 11:00_06/30/2006 02:00_09/13/2016 10.21 Years 0% |
Summarize the data
Group rows
I was interested in the period of record for each variable and site combination, so the group_by and summarize functions from dplyr can be used to extract the earliest start date and the latest end date.
hy_sum <- hy %>%
group_by(Site, Var) %>%
summarize(Start = min(StartDate), End = max(EndDate))
Site | Var | Start | End |
---|---|---|---|
00100109 | 11 | 1990-08-03 16:15:00 | 2019-02-07 06:00:00 |
00171258 | 11 | 1991-10-14 15:31:00 | 2019-02-07 06:00:00 |
00220012 | 227 | 1993-05-13 14:09:00 | 2018-08-03 08:56:00 |
00220012 | 232 | 1993-05-13 14:09:00 | 2010-04-13 12:05:00 |
00220012 | 233 | 1993-05-13 14:09:00 | 2019-02-07 06:00:00 |
00220012 | 262 | 1993-05-13 14:09:00 | 2018-09-26 09:14:00 |
Add Event column
The database I needed to join to these data lists variables as text-based Event Types, as shown in this crosswalk table.
var_xwalk <- read_csv("Event_Var_Xwalk.csv", col_types = "cd")
Event Type | Var |
---|---|
Rain (RN) | 11 |
Vrain | 11 |
Water Level (WL) | 227 |
Water Level (WL) | 232 |
Water Level (WL) - Gauge Height | 233 |
Water Level (WL) | 233 |
Velocity | 237 |
Discharge | 262 |
The crosswalk table can be used to create an Event column that holds the Event Types. Because there are multiple possible event types for each variable, the first step is to include all possible event types in the Event column.
hy_sum <- hy_sum %>%
mutate(Event = paste(as.character(unique(var_xwalk$`Event Type`[var_xwalk$Var == Var])),
collapse = ";", sep = " "))
Site | Var | Start | End | Event |
---|---|---|---|---|
00100109 | 11 | 1990-08-03 16:15:00 | 2019-02-07 06:00:00 | Rain (RN);Vrain |
00171258 | 11 | 1991-10-14 15:31:00 | 2019-02-07 06:00:00 | Rain (RN);Vrain |
00220012 | 227 | 1993-05-13 14:09:00 | 2018-08-03 08:56:00 | Water Level (WL);Water Level (WL) - Gauge Height |
00220012 | 232 | 1993-05-13 14:09:00 | 2010-04-13 12:05:00 | Water Level (WL);Water Level (WL) - Gauge Height |
00220012 | 233 | 1993-05-13 14:09:00 | 2019-02-07 06:00:00 | Water Level (WL);Water Level (WL) - Gauge Height |
00220012 | 262 | 1993-05-13 14:09:00 | 2018-09-26 09:14:00 | Water Level (WL);Water Level (WL) - Gauge Height |
When multiple event types are present, these can be split apart using the base strsplit function, and then each separate event type can be in its own row using the unnest function from tidyr.
hy_sum <- hy_sum %>%
mutate(Event = strsplit(as.character(Event), ";")) %>%
unnest(Event)
Site | Var | Start | End | Event |
---|---|---|---|---|
00100109 | 11 | 1990-08-03 16:15:00 | 2019-02-07 06:00:00 | Rain (RN) |
00100109 | 11 | 1990-08-03 16:15:00 | 2019-02-07 06:00:00 | Vrain |
00171258 | 11 | 1991-10-14 15:31:00 | 2019-02-07 06:00:00 | Rain (RN) |
00171258 | 11 | 1991-10-14 15:31:00 | 2019-02-07 06:00:00 | Vrain |
00220012 | 227 | 1993-05-13 14:09:00 | 2018-08-03 08:56:00 | Water Level (WL) |
00220012 | 227 | 1993-05-13 14:09:00 | 2018-08-03 08:56:00 | Water Level (WL) - Gauge Height |
Group rows
Finally, I can summarize over each Site and Event combination to get the period of record of interest.
hy_sum <- hy_sum %>%
group_by(Site, Event) %>%
summarize(Start = min(Start), End = max(End))
Site | Event | Start | End |
---|---|---|---|
00100109 | Rain (RN) | 1990-08-03 16:15:00 | 2019-02-07 06:00:00 |
00100109 | Vrain | 1990-08-03 16:15:00 | 2019-02-07 06:00:00 |
00171258 | Rain (RN) | 1991-10-14 15:31:00 | 2019-02-07 06:00:00 |
00171258 | Vrain | 1991-10-14 15:31:00 | 2019-02-07 06:00:00 |
00220012 | Water Level (WL) | 1993-05-13 14:09:00 | 2019-02-07 06:00:00 |
00220012 | Water Level (WL) - Gauge Height | 1993-05-13 14:09:00 | 2019-02-07 06:00:00 |
The resulting data frame can be joined with Oracle tables containing matching Site and Event fields.