I was trying to get the period of record for data in the St. Johns River Water Management District’s Hydstra database, which houses hydrologic data. I found the report type HYREP PERIOD, which produces an output that contains the period of record information I needed. However, it was not in a convenient format to allow for easy joining to other Oracle-based tables.

This project steps through the data cleaning process I used to get the data into a usable format in R. I include brief snapshots (a few rows) of the data frames to illustrate each of the cleaning steps.

HYREP PERIOD data output

The HYREP PERIOD report produces an output that would be very good on a small scale (i.e., looking at only a few stations at a time).

With hundreds of stations, it made more sense to reformat this output to extract all of the relevant data into a table structure. There is a site identifier in the line above each block of data, and it is important that this information is linked to the data. Additionally, there are many rows that are not needed.

Importing the data

With unstructured data like this, the read_lines function from the readr package loads the file line by line. The text file contains many empty rows, so setting the skip_empty_rows argument to TRUE removes these rows. Wrapping read_lines in data.frame ensures that we start working with table-based data right away.

hy <- data.frame(V1 = read_lines("HY_POR_WATERCAT.txt", skip_empty_rows = T))
V1
HYREP.PERIOD V77 Output 02/07/2019
File 20014731.A - State Rd 16
Var Description Start End Period Missing
——- ———————————- —————- —————- ————– ——-
227.10 Elev. (feet NAVD88) Continuous 11:00_06/30/2006 11:00_07/10/2018 12.03 Years 0%
227.14 Elev. (feet NAVD88) Random Manual 12:01_09/22/1970 11:35_07/10/2018 47.80 Years 0%
232.10 Elev. (feet NGVD29) Continuous 11:00_06/30/2006 11:00_03/01/2010 3.67 Years 0%
232.14 Elev. (feet NGVD29) Random Manual 12:01_09/22/1970 11:14_03/01/2010 39.44 Years 0%
HYREP.PERIOD V77 Output 02/07/2019
File 20014731.X - State Rd 16

Cleaning the data

Remove unnecessary rows

There are a number of lines that do not contain any data of interest. These lines include the strings “–”, “HYREP.PERIOD”, “Var”, “dup/”, or “Index”. The grepl function searches for any of these strings, and the filter function in the dplyr package allows us to omit any of these lines. In addition, the str_trim function in the stringr package strips whitespace from the start and end of each line.

hy <- hy %>% 
    filter(!grepl("--|HYREP.PERIOD|Var|dup/|Index", V1)) %>% 
    mutate(V1 = str_trim(V1))
V1
File 20014731.A - State Rd 16
227.10 Elev. (feet NAVD88) Continuous 11:00_06/30/2006 11:00_07/10/2018 12.03 Years 0%
227.14 Elev. (feet NAVD88) Random Manual 12:01_09/22/1970 11:35_07/10/2018 47.80 Years 0%
232.10 Elev. (feet NGVD29) Continuous 11:00_06/30/2006 11:00_03/01/2010 3.67 Years 0%
232.14 Elev. (feet NGVD29) Random Manual 12:01_09/22/1970 11:14_03/01/2010 39.44 Years 0%
File 20014731.X - State Rd 16
227.10 Elev. (feet NAVD88) Continuous 00:00_01/01/2010 06:00_02/07/2019 9.10 Years 0%
File 20014731.H - State Rd 16
227.10 Elev. (feet NAVD88) Continuous 11:00_06/30/2006 02:00_09/13/2016 10.21 Years 0%
File 20014731.W - State Rd 16

Create a Site column

The structure of the data is now a line with information about the site followed by one or more lines of data. The site designator is an eight digit number included in each line above the data blocks. The str_extract function from stringr can be used to extract those eight consecutive digits using a regular expression. The fill function from the tidyr package fills down the values in the Site column so that each data row is associated with the correct Site. Once the Site variable is extracted, the rows containing the string “File” can be omitted.

hy <- hy %>% 
    mutate(Site = str_extract(V1, "\\d{8}")) %>% 
    select(Site, V1) %>% 
    fill(Site) %>% 
    filter(!grepl("File", V1))
Site V1
20014731 227.10 Elev. (feet NAVD88) Continuous 11:00_06/30/2006 11:00_07/10/2018 12.03 Years 0%
20014731 227.14 Elev. (feet NAVD88) Random Manual 12:01_09/22/1970 11:35_07/10/2018 47.80 Years 0%
20014731 232.10 Elev. (feet NGVD29) Continuous 11:00_06/30/2006 11:00_03/01/2010 3.67 Years 0%
20014731 232.14 Elev. (feet NGVD29) Random Manual 12:01_09/22/1970 11:14_03/01/2010 39.44 Years 0%
20014731 227.10 Elev. (feet NAVD88) Continuous 00:00_01/01/2010 06:00_02/07/2019 9.10 Years 0%
20014731 227.10 Elev. (feet NAVD88) Continuous 11:00_06/30/2006 02:00_09/13/2016 10.21 Years 0%

Create additional columns

Additional columns can be created by extracting pieces of the V1 column using str_extract and regular expressions. Only some of the variables and subvariables needed to be retained, so these can be filtered out.

hy <- hy %>% 
    mutate(Variable = str_extract(V1, "^\\d+\\.\\d+"),
           StartDate = as.POSIXct(str_extract(V1, "\\d{2}\\:\\d{2}\\_\\d{2}/\\d{2}/\\d{4}"),
                                  format = "%H:%M_%m/%d/%Y", tz = "EST"),
           EndDate = as.POSIXct(str_sub(str_extract(V1, "\\d{2}\\:\\d{2}\\_\\d{2}/\\d{2}/\\d{4}.*?(\\d{2}\\:\\d{2}\\_\\d{2}/\\d{2}/\\d{4})"), -16),
                                format = "%H:%M_%m/%d/%Y", tz = "EST"),
           Var = as.numeric(str_extract(Variable, ".*(?=\\.)")),
           SubVar = as.numeric(str_extract(Variable, "(\\.[^\\.]+)$"))) %>% 
    filter(SubVar %in% c(0.1, 0.14), !(Var %in% c(104, 229, 252, 300, 551, 2327))) %>% 
    select(Site, Variable, Var, SubVar, StartDate, EndDate, V1)
Site Variable Var SubVar StartDate EndDate V1
20014731 227.10 227 0.1 2006-06-30 11:00:00 2018-07-10 11:00:00 227.10 Elev. (feet NAVD88) Continuous 11:00_06/30/2006 11:00_07/10/2018 12.03 Years 0%
20014731 227.14 227 0.14 1970-09-22 12:01:00 2018-07-10 11:35:00 227.14 Elev. (feet NAVD88) Random Manual 12:01_09/22/1970 11:35_07/10/2018 47.80 Years 0%
20014731 232.10 232 0.1 2006-06-30 11:00:00 2010-03-01 11:00:00 232.10 Elev. (feet NGVD29) Continuous 11:00_06/30/2006 11:00_03/01/2010 3.67 Years 0%
20014731 232.14 232 0.14 1970-09-22 12:01:00 2010-03-01 11:14:00 232.14 Elev. (feet NGVD29) Random Manual 12:01_09/22/1970 11:14_03/01/2010 39.44 Years 0%
20014731 227.10 227 0.1 2010-01-01 2019-02-07 06:00:00 227.10 Elev. (feet NAVD88) Continuous 00:00_01/01/2010 06:00_02/07/2019 9.10 Years 0%
20014731 227.10 227 0.1 2006-06-30 11:00:00 2016-09-13 02:00:00 227.10 Elev. (feet NAVD88) Continuous 11:00_06/30/2006 02:00_09/13/2016 10.21 Years 0%

Summarize the data

Group rows

I was interested in the period of record for each variable and site combination, so the group_by and summarize functions from dplyr can be used to extract the earliest start date and the latest end date.

hy_sum <- hy %>% 
    group_by(Site, Var) %>% 
    summarize(Start = min(StartDate), End = max(EndDate))
Site Var Start End
00100109 11 1990-08-03 16:15:00 2019-02-07 06:00:00
00171258 11 1991-10-14 15:31:00 2019-02-07 06:00:00
00220012 227 1993-05-13 14:09:00 2018-08-03 08:56:00
00220012 232 1993-05-13 14:09:00 2010-04-13 12:05:00
00220012 233 1993-05-13 14:09:00 2019-02-07 06:00:00
00220012 262 1993-05-13 14:09:00 2018-09-26 09:14:00

Add Event column

The database I needed to join to these data lists variables as text-based Event Types, as shown in this crosswalk table.

var_xwalk <- read_csv("Event_Var_Xwalk.csv", col_types = "cd")
Event Type Var
Rain (RN) 11
Vrain 11
Water Level (WL) 227
Water Level (WL) 232
Water Level (WL) - Gauge Height 233
Water Level (WL) 233
Velocity 237
Discharge 262

The crosswalk table can be used to create an Event column that holds the Event Types. Because there are multiple possible event types for each variable, the first step is to include all possible event types in the Event column.

hy_sum <- hy_sum %>% 
    mutate(Event = paste(as.character(unique(var_xwalk$`Event Type`[var_xwalk$Var == Var])),
                         collapse = ";", sep = " "))
Site Var Start End Event
00100109 11 1990-08-03 16:15:00 2019-02-07 06:00:00 Rain (RN);Vrain
00171258 11 1991-10-14 15:31:00 2019-02-07 06:00:00 Rain (RN);Vrain
00220012 227 1993-05-13 14:09:00 2018-08-03 08:56:00 Water Level (WL);Water Level (WL) - Gauge Height
00220012 232 1993-05-13 14:09:00 2010-04-13 12:05:00 Water Level (WL);Water Level (WL) - Gauge Height
00220012 233 1993-05-13 14:09:00 2019-02-07 06:00:00 Water Level (WL);Water Level (WL) - Gauge Height
00220012 262 1993-05-13 14:09:00 2018-09-26 09:14:00 Water Level (WL);Water Level (WL) - Gauge Height

When multiple event types are present, these can be split apart using the base strsplit function, and then each separate event type can be in its own row using the unnest function from tidyr.

hy_sum <- hy_sum %>% 
    mutate(Event = strsplit(as.character(Event), ";")) %>%
    unnest(Event)
Site Var Start End Event
00100109 11 1990-08-03 16:15:00 2019-02-07 06:00:00 Rain (RN)
00100109 11 1990-08-03 16:15:00 2019-02-07 06:00:00 Vrain
00171258 11 1991-10-14 15:31:00 2019-02-07 06:00:00 Rain (RN)
00171258 11 1991-10-14 15:31:00 2019-02-07 06:00:00 Vrain
00220012 227 1993-05-13 14:09:00 2018-08-03 08:56:00 Water Level (WL)
00220012 227 1993-05-13 14:09:00 2018-08-03 08:56:00 Water Level (WL) - Gauge Height

Group rows

Finally, I can summarize over each Site and Event combination to get the period of record of interest.

hy_sum <- hy_sum %>% 
    group_by(Site, Event) %>% 
    summarize(Start = min(Start), End = max(End))
Site Event Start End
00100109 Rain (RN) 1990-08-03 16:15:00 2019-02-07 06:00:00
00100109 Vrain 1990-08-03 16:15:00 2019-02-07 06:00:00
00171258 Rain (RN) 1991-10-14 15:31:00 2019-02-07 06:00:00
00171258 Vrain 1991-10-14 15:31:00 2019-02-07 06:00:00
00220012 Water Level (WL) 1993-05-13 14:09:00 2019-02-07 06:00:00
00220012 Water Level (WL) - Gauge Height 1993-05-13 14:09:00 2019-02-07 06:00:00

The resulting data frame can be joined with Oracle tables containing matching Site and Event fields.