Water Quality Monitoring - Chesapeake Bay and its Tidal Waterways

This is the documentation for the Water Quality Monitoring program for the Chesapeake Bay and its surrounding waterways. This project was done in collaboration with Maryland's Department of Natural Resources (DNR). The product contains both a GUI and an R script. For those interested in simply using the GUI, please read the GUI documentation. For those who wish to understand and edit the code, please see the code documentation.

GUI Documentation

Overview and Installation

The GUI was developed using the R statistical package. However, there is no installation of R required in order to run this GUI, one must simply unpack the compressed folder and then there will be an executable file to run the specified GUI. The GUI has three modes:

By driver limitations, MDB Mode and Batch Mode are only available for Windows machines. In order to run the specified GUI, simply click the executable file which corresponds to the mode you wish to use. These are respectively named “run2011.exe”, “run.exe” (MDB Mode), and “runBatch.exe”. On Mac and Linux distributions you will only see the respective “MacRun.command” and “LinuxRun.desktop” files giving the appropriate system the double-click execution functionality to run the 2011 Mode GUI.

Using The GUI

Once the GUI is opened you will see a standard GUI. If MDB or Batch mode is used, you will be first asked to select a Microsoft Access database. After this point, both GUIs will enter the plotting mode. For those using MDB or Batch mode, at this time enter the year to which the database corresponds, and then click refresh lists. This will set up the dates so that dates can be dynamically chosen (dates cannot be dynamically chosen in 2011 Mode). At this point, choose the desired station, the desired plot, and (if MDB or Batch Mode) the start and end dates. After these parameters have been set, click generate. The plot should generate and be placed in the window. Also, if in MDB or Batch mode, the location to which the plot is saved will be shown in the terminal window. These plots will all be saved in the “GUIplots” folder in the main directory of the program. For those using the 2011 Mode, note that all of the plots are saved within the “plots” folder of the main directory of the program.

Types of Plots

The GUI creates various types of plots depending on one's need. All of them can be accessed by using the “Plot Type” dropdown menu to select the plot type one would wish to have. Listed below is a brief description of each type:

Error Handling Behavior

The GUI is designed to be robust to errors. For example, the only dates that are given in the drop-down menu in MDB and 2011 mode are dates that the program is able to produce data. In Batch Mode, the dates that the user can select from represent all of the possible choices given the set of all stations active in that year (in the dataset). Thus the dates act as a range to which the station will adjust to. For example, the start date chosen is the minimum start date, and thus for each station the GUI will find the closest available start date above (or equal to) that date. This is to allow users to select a large range of dates and then have the plots automatically adjust the range to the range where the station was active. Thus one may pick a range of dates such as 2/2 - 10/20 and if the station is only active from 5/3 - 9/30 the station will make the plots for 5/3 - 9/30. All such range adjustments are stated in the command line and thus also replicated in the command line's output file at “Output/BatchOutput.txt”

Plot Not Building In Batch Mode? Common Sources Of Error

If your plot is not building in Batch Mode, some common causes of error are:

  1. The station may not be set to active in the database. Go into the database and check the stations table. Make sure active_####, where #### denotes the year, is set to active
  2. Make sure that the name for the station is the same as the table name. A notable example of this error was StGeorges(Creek) in the 2009 database.
  3. Make sure that the station has some datapoints in that timeslot. If the timeslot is too narrow, then there may just be no datapoints for that station in that time due to many various reasons. Check the dataset to make sure there are actually datapoints in there.
  4. Make sure you chose the right database.

Default Behaviors

The default assumptions for the code are:

  1. A significance alpha of .01 is used.
  2. The parameter columns reference the same columns as the 2011 data.
  3. It is not leap year.
  4. The test values are the values given by the DNR in 2011.

To change any of these assumptions, please just change the respective values in “src/assets/globals.R”


Code Documentation

The code is written using a standard Model View Controller (MVC) model with an object-oriented structure for the internal data. The only necessary interfaces between functions are primitives, arrays of primitives, and the objects specified in this documentation. To understand the code, one must understand the internal data structure and the control flow of the program.

Running the Program

To run the program, use one of the drivers located in the “drivers” directory. Each driver sets up the program by giving it a state (GUI means GUI mode, MDB means use the dynamic database, etc.). The program references all sources from the main directory. Thus, to run the code, open R in the main directory. Then use the command “source("driver/pickDriver.R”)“ to run pickDriver.R. For references on how the code works, please read the following material.

The Internal Data Structure

Before understanding the control flow of the program, it is necessary to explain the data structure. This data structure is a set of objects built to hold the required information for the needed statistical testing. The objects are the Station, Slice, Area, Bay, Whole, World. A Station object is the entire analysis of a single given physical station for one parameter. This is not to be confused with the actual physical station, referred to throughout the documentation as "physical station,” in the real-world which measures multiple parameters. From this Station object, the larger structures are built by containment. A group of Stations all measuring the same parameter is called a Slice, and a group of Slices is called a Whole. A group of Stations that all correspond to the same physical station but measure different parameters (which is akin to the actual physical station) is called an Area, and a group of Areas is called a Bay. The object which contains a Whole and a Bay is called a World, though there is no World currently in use in the program as it seems unncessary to save that much data. Note that due to limitations of R, the methods and fields for every object are not encapsulated. However, throughout the program we use the informal rule that no field of an object may be changed outside of its class file. Lastly, we have a Regime object which extends Station and is to be used to aggregate a Slice (though it is currently not in use).

Note on Interfacing

Note that none of the methods and fields are encapsulated. However, as a general rule in the program in order to increase readability, do not write methods that change the values of the fields outside of the class declaration. There should be no need as these values should be based directly on the data and thus should be created when first gathering the data (or lazy-evaluated, there are currently no fields that are lazy-evaluated but this may be implemented for performance concerns).

Also notice that none of the object constructors should need to be interfaced with directly. They should be interfaced through the appropriate controller methods for which there is a controller for each object (except the World).

Station

A Station defines the object of a single physical station's readings on one parameter.

setClass("Station", representation(FailVec = "matrix", Output = "vector", 
    Name = "character", RawData = "vector", PercentFail = "numeric", WilcoxonP = "numeric", 
    SE = "numeric", CI = "numeric", N = "numeric", RPD = "numeric", MarkedTS = "ts", 
    MarkedData = "matrix", Delimiter = "numeric", Parameter = "numeric", DataDelimited = "vector", 
    Salinity = "numeric", Regime = "character", StartDate = "character", EndDate = "character", 
    NumDays = "numeric", DaysVec = "matrix", Mean = "numeric", Median = "numeric", 
    Classification = "character", StationID = "character", StartIndex = "numeric", 
    EndIndex = "numeric", MannKendallP = "numeric", SensEstimator = "numeric", 
    Skew = "numeric"))

Most of the fields of the station are self-explanatory.

Slice

A Slice is a group of Stations for the same parameter. It is used to analyze the rankings of physical stations against each other.

setClass("Slice", representation(StationsList = "list", Parameter = "numeric", 
    Length = "numeric"))

Area

An Area is a group of Stations that correspond to the same physical station but measure different parameters. It is used for generating plots such as the scatter plots.

setClass("Area", representation(ParamsStationList = "list", Length = "numeric"))

Whole

A Whole is a group of Slices. It is used in the batch generation of plots.

setClass("Whole", representation(SliceList = "list"))

Bay

A Bay is a group of Areas. It is used in the batch generation of plots.

setClass("Bay", representation(AreaList = "list"))

World

A World is an object that contains a Whole and a Bay. It is currently not in use.

setClass("World", representation(Bay = "Bay", Whole = "Whole"))

Regime

A Regime is like a Slice except that it instead is built by aggregating the physical stations' datas and acts like a pseudo-Station. As such it extends Station and only includes a list of names (along with the values of the station). It is currently not in use.

setClass("Regime", representation(StationList = "character"), contains = "Station")

Markers

A Marker is used to create the axis for plots and split the plots by months. If the time frame is more than NUM_DAYS_PLOT_MONTHS then the markers will have a tic at each month with a label. Notice that these tics are dynamic and will change according the start date, end date, and differences in month sizes. If the time frame is less than NUM_DAYS_PLOT_MONTHS but more than NUM_DAYS_PLOT_WEEKS, then the markers will be week based. If the number of days is less than this, then the markers will denote days. By default, NUM_DAYS_PLOT_MONTHS is 50 days and NUM_DAYS_PLOT_WEEKS is 15 days.

setClass("Markers", representation(MarkAts = "vector", MarkLabels = "character"))

Long-Term Station

A Long-Term station is used on the Corsica River station data only. It is redesigned for data covering many years.

setClass("LongTermStation", representation(Name = "character", RawData = "vector", 
    MonthMarkedTS = "ts", N = "numeric", MarkedTS = "ts", MarkedData = "matrix", 
    Parameter = "numeric", MannKendallP = "numeric", SensEstimator = "numeric", 
    RPY = "numeric"))

Most fields match the description of the Station. RPY stands for readings per year.

Long-Term Slice

A Long-Term Slice is a Slice used on the Corsica River physical station data only. It is redesigned for data covering many years.

setClass("LongTermSlice", representation(StationsList = "list", Parameter = "numeric", 
    Length = "numeric"))

Long-Term Area

A Long-Term Area is a slice used on the Corsica River station data only. It is redesigned for data covering many years.

setClass("LongTermArea", representation(ParamsStationList = "list", 
    Length = "numeric"))

Long-Term Markers

A Long-Term Marker is a marker redesigned to be used for multi-year data.

setClass("LongTermMarkers", representation(MarkAts = "vector", MarkLabels = "character"))

Overview of the Archetecture

The program, as noted earlier, uses the MVC model. The only interfaces any of the functions will have with the data are through the defined objects (and primitives) and that the GUI will interact indirectly through the controller. The main control of the function reads as follows:

  1. Define Program State
  2. Controller Operation
  3. GUI Display Serialized
  4. GUI Wait For Event
  5. Upon Event, Go To Step 2
  6. Validation

Note that if the GUI is not enabled, the program simply completes the controller operations (set by the Program State) and exits. If GUI mode is enabled, the first controller operation is bypassed. Note there is a validation mode that can be activiated for testing the code.

Note: All code references the working directory as the main directory of the program. If one opens R with the working directory of the program in the drivers folder (common mistake!) the program will not run correctly.

Directory Tree

Main

The bin folder contains all of the binary files. The docs folder contains the documentation. The drivers folder contain all of the driver files to run the program (they must be ran with R with the working directory of main). The GUIplots folder contains all of the plots one makes with the GUI. Plots contains all of the 2011 plots that come with the program. The scripts folder contains all of the batch scripts for making the binary files. The src folder contains the source code. It has an assets folder for all of the helper function, a classes folder for all of the classes (constructors are in the same file as the class declaration), validation holds all of the validation code, and views holds all of the code for the views.

Defining Program State

The program state operation defines all of the state variables for the run of the program. This includes defining states for which mode is being run, all of the functions and objects used within the program, and all of the globals that pertain specifically to DNR's data (such as which column the date information is contained).

The state variables are defined in the driver file. These are all within the “drivers” directory in the main directory. All executables are scripts which call a driver function.

Control of the program is then transferred to the main file, “main.R” in the src directory. Everything from this point on is in the src directory. The main file reads in the header file (“assets/header.R”) to receive the declarations for all objects and functions. Then the globals for DNR data are read from “assets/globals.R”.

At this point, the program diverges depending on which mode is selected. If the GUI is enabled, then the control is then passed over to the view (“views/mainView”) which develops the GUI. It then waits for an event to call the controller. If the GUI is not enabled, the code that follows calls the controller directly.

Controller Operation

The controller operations are all contained within “assets/control.R”. There is a control for every object but the World object. The control takes in the required arguments to build the given data structure and then uses flags to determine which of the appropriate outputs for that data structure to create. The building of the data structures is entirely contained within the appropriate object's class file within the “classes” directory.

All of the outputs are serialized. If they are generated from the GUI (thus in MDB or Batch Mode), then any plots will be saved into the “GUIplots” folder. Else they will be put into the “plots” folder and some will generate statistical outputs determining rankings that can be used for reports (this is currently disabled for the GUI).

GUI Operation

After the controller operations occur, the GUI selects what would be the file that is generated from the controller and displays that image. If it is not in MDB or Batch Mode, the controller is never actually called and instead the plot is taken directly from the “plots” folder. Once displayed, the program runs to the end of the operation where it is caught in a loop at the end of the driver file. Here the program waits for user interaction to cause the program to move to the event handler or for the program to terminate. User interaction with the program will only either change the current view or will use a controller operation and send the program back through those steps on a journey back to this same spot.

Validation

Included in the program is a set of validations. These are tests to ensure that the defined objects all build correctly and create the appropriate output given the controller. It runs on a smaller subset of the data. To turn on validation, simply change the state of VALIDATION in the driver to true. All of the tests are controlled by the “validation/validation.R” file. By changing the states of that folder, one can turn on and off the validation for the specific objects.


Interfacing With the Code

The code is highly structured in a way that makes extending the existing code simple. All one needs to do is give a new controller flag, add a new controller call, and have the function for the controller to call.

Adding New Plots / Output / Multiple-Station Test Statistic

To add a new plot, first examine the existing code as a guide. Determine whether your plot deals with a Slice or an Area object by noticing whether it deals with multiple physical stations of one parameter or multiple parameters on one physical station, respectively. Once decided, write your code taking in as an input the given object. Once this is completed, add a reference to your code in the “src/assets/header.R” file which contains all of the headers. Once the code is being referenced, please run the appropriate validations and use the R interpreter to check to make sure that your function is behaving as wanted. Next, go to the controller, “control.R”, and add a flag for the respective operation. Place your operation within a conditional which encloses the operation (with optional prints to inform the user of the progression of the program). Now any controller that you add the “operation=TRUE” flag to will develop your plot / output / test statistic! Add this flag to the validation code for the higher order structures (such as the Area or Whole) to make sure it is robust for different parameters and different stations.

Adding New GUI Features

Once the feature has been added to the controller, one must simply set up a new plot type, a new GUI state, and add the controller call to the event handler for the generate button. To set up the new plot type, add the string for your plot type to the PLOT_TYPES array in “src/globals.R”. For simplicity, add it to the end of the list. Now your plot type will show up in the dropdox box, but we need to add the functionality to it. To add in the state changing (i.e. make the GUI change what options it shows the user), add a new state to the plotTypeHander function in “src/assets/eventHandlers.R” that turns on and off the visibility of the other componants. Use the other states as a template. Now add a function for your new plot type to the event handler. Follow the lead of the other code and have the function call your controller. Then change newImage to the plot file's location and reset the plot viewer.

Adding New Data Features

If you wish to add more features that deal with the data, for example single physical station test statistics such as the Sen's Estimator (which is implemented but not used by default for performance reasons), the best way to do this is to have the station calculate the data and keep the result as one of its fields. To do so, follow the lead of the rest of the code. The objects are defined as S4 objects. To add a new field, add the field name and type to the class declaration. To make the field used, add the initialization statement to the constructor of the class object.
Address = {http://www.umbc.edu/hpcreu/2011/projects/team1.html},


Acknowledgments

Acknowledgments: These results were obtained as part of the REU Site: Interdisciplinary program in High Performance Computing (www.umbc.edu/hpcreu) in the Department of Mathematics and Statistics at the University of Maryland, Baltimore County (UMBC) in Summer 2012. This program is funded jointly by the National Science Foundation and the National Security Agency (NSF grant no. DMS-1156976), with additional support from UMBC, the Department of Mathematics and Statistics, the Center for Interdisciplinary Research and Consulting (CIRC), and the UMBC High Performance Computing Facility (HPCF).