This is the documentation for the Water Quality Monitoring program for the Chesapeake Bay and its surrounding waterways. This project was done in collaboration with Maryland's Department of Natural Resources (DNR). The product contains both a GUI and an R script. For those interested in simply using the GUI, please read the GUI documentation. For those who wish to understand and edit the code, please see the code documentation.
The GUI was developed using the R statistical package. However, there is no installation of R required in order to run this GUI, one must simply unpack the compressed folder and then there will be an executable file to run the specified GUI. The GUI has three modes:
By driver limitations, MDB Mode and Batch Mode are only available for Windows machines. In order to run the specified GUI, simply click the executable file which corresponds to the mode you wish to use. These are respectively named “run2011.exe”, “run.exe” (MDB Mode), and “runBatch.exe”. On Mac and Linux distributions you will only see the respective “MacRun.command” and “LinuxRun.desktop” files giving the appropriate system the double-click execution functionality to run the 2011 Mode GUI.
Once the GUI is opened you will see a standard GUI. If MDB or Batch mode is used, you will be first asked to select a Microsoft Access database. After this point, both GUIs will enter the plotting mode. For those using MDB or Batch mode, at this time enter the year to which the database corresponds, and then click refresh lists. This will set up the dates so that dates can be dynamically chosen (dates cannot be dynamically chosen in 2011 Mode). At this point, choose the desired station, the desired plot, and (if MDB or Batch Mode) the start and end dates. After these parameters have been set, click generate. The plot should generate and be placed in the window. Also, if in MDB or Batch mode, the location to which the plot is saved will be shown in the terminal window. These plots will all be saved in the “GUIplots” folder in the main directory of the program. For those using the 2011 Mode, note that all of the plots are saved within the “plots” folder of the main directory of the program.
The GUI creates various types of plots depending on one's need. All of them can be accessed by using the “Plot Type” dropdown menu to select the plot type one would wish to have. Listed below is a brief description of each type:
The GUI is designed to be robust to errors. For example, the only dates that are given in the drop-down menu in MDB and 2011 mode are dates that the program is able to produce data. In Batch Mode, the dates that the user can select from represent all of the possible choices given the set of all stations active in that year (in the dataset). Thus the dates act as a range to which the station will adjust to. For example, the start date chosen is the minimum start date, and thus for each station the GUI will find the closest available start date above (or equal to) that date. This is to allow users to select a large range of dates and then have the plots automatically adjust the range to the range where the station was active. Thus one may pick a range of dates such as 2/2 - 10/20 and if the station is only active from 5/3 - 9/30 the station will make the plots for 5/3 - 9/30. All such range adjustments are stated in the command line and thus also replicated in the command line's output file at “Output/BatchOutput.txt”
If your plot is not building in Batch Mode, some common causes of error are:
The default assumptions for the code are:
To change any of these assumptions, please just change the respective values in “src/assets/globals.R”
The code is written using a standard Model View Controller (MVC) model with an object-oriented structure for the internal data. The only necessary interfaces between functions are primitives, arrays of primitives, and the objects specified in this documentation. To understand the code, one must understand the internal data structure and the control flow of the program.
To run the program, use one of the drivers located in the “drivers” directory. Each driver sets up the program by giving it a state (GUI means GUI mode, MDB means use the dynamic database, etc.). The program references all sources from the main directory. Thus, to run the code, open R in the main directory. Then use the command “source("driver/pickDriver.R”)“ to run pickDriver.R. For references on how the code works, please read the following material.
Before understanding the control flow of the program, it is necessary to explain the data structure. This data structure is a set of objects built to hold the required information for the needed statistical testing. The objects are the Station
, Slice
, Area
, Bay
, Whole
, World
. A Station
object is the entire analysis of a single given physical station for one parameter. This is not to be confused with the actual physical station, referred to throughout the documentation as "physical station,” in the real-world which measures multiple parameters. From this Station
object, the larger structures are built by containment. A group of Stations
all measuring the same parameter is called a Slice
, and a group of Slices
is called a Whole
. A group of Stations
that all correspond to the same physical station but measure different parameters (which is akin to the actual physical station) is called an Area
, and a group of Areas
is called a Bay
. The object which contains a Whole
and a Bay
is called a World
, though there is no World
currently in use in the program as it seems unncessary to save that much data. Note that due to limitations of R
, the methods and fields for every object are not encapsulated. However, throughout the program we use the informal rule that no field of an object may be changed outside of its class file. Lastly, we have a Regime
object which extends Station
and is to be used to aggregate a Slice
(though it is currently not in use).
Note that none of the methods and fields are encapsulated. However, as a general rule in the program in order to increase readability, do not write methods that change the values of the fields outside of the class declaration. There should be no need as these values should be based directly on the data and thus should be created when first gathering the data (or lazy-evaluated, there are currently no fields that are lazy-evaluated but this may be implemented for performance concerns).
Also notice that none of the object constructors should need to be interfaced with directly. They should be interfaced through the appropriate controller methods for which there is a controller for each object (except the World
).
A Station
defines the object of a single physical station's readings on one parameter.
setClass("Station", representation(FailVec = "matrix", Output = "vector",
Name = "character", RawData = "vector", PercentFail = "numeric", WilcoxonP = "numeric",
SE = "numeric", CI = "numeric", N = "numeric", RPD = "numeric", MarkedTS = "ts",
MarkedData = "matrix", Delimiter = "numeric", Parameter = "numeric", DataDelimited = "vector",
Salinity = "numeric", Regime = "character", StartDate = "character", EndDate = "character",
NumDays = "numeric", DaysVec = "matrix", Mean = "numeric", Median = "numeric",
Classification = "character", StationID = "character", StartIndex = "numeric",
EndIndex = "numeric", MannKendallP = "numeric", SensEstimator = "numeric",
Skew = "numeric"))
Most of the fields of the station are self-explanatory.
FailVec
is the data vector of booleans corrosponding to whether the given read was failing according to the DNR set thresholds. Output
is a vector that can be used to print a simple output. Name
contains the character name. RawData
contains the raw data only for the parameter corresponding to the station. PercentFail
is the percent fail for the physical station on the given parameter. WilcoxonP
is its P value from the Wilcoxon test. SE
is the standard error for the percent fail calculation.CI
is the normal approximated confidence interval around the percent failure. N
is the total number of data reads. RPD
is the number of reads per day. Delimiter
defines the number of days to put into a single point for the cleaned graph. It is aggregated using the AggregateData
function (which by default is set to the mean). Parameter
gives the number corresponding to the parameter the Station
is measured on. DataDelimited
is the data delimited by the AggregateData
function. Salinity
is a measure of the physical station's mean salinity during the reads. Mean
is the mean value of the data. Median
is the median value. StartDate
and EndDate
give the start and end date for the Station
. NumDays
gives the number of days the Station
is calculated for. Classification
gives the classification of the Station', that is the classification of the physical station for the single parameter, as “Good”, “Borderline”, or “Bad” in accordance to the Wilcoxon Signed Rank test. DaysVec
gives the data delimited to each day. StationID
gives the physical station's ID. StartIndex
and EndIndex
give the appropriate indices for where the data was found in the table for the time being measured.MannKendallP
gives the p-value for the Seasonal Mann-Kendall test conducted over the given time period.SensEstimator
gives Sen's trend estimation. The calculation of this value is off by default.Skew
gives the value of the skew for the Station
.A Slice
is a group of Stations
for the same parameter. It is used to analyze the rankings of physical stations against each other.
setClass("Slice", representation(StationsList = "list", Parameter = "numeric",
Length = "numeric"))
StationsList
is the list of Stations' in the
SliceParameter
is the parameter the Stations
are all defined onLength
is the length of StationsList
An Area
is a group of Stations
that correspond to the same physical station but measure different parameters. It is used for generating plots such as the scatter plots.
setClass("Area", representation(ParamsStationList = "list", Length = "numeric"))
ParamsStationList
is the list of Stations
in the Area
Length
is the length of ParamsStationListA Whole
is a group of Slices
. It is used in the batch generation of plots.
setClass("Whole", representation(SliceList = "list"))
A Bay
is a group of Areas
. It is used in the batch generation of plots.
setClass("Bay", representation(AreaList = "list"))
A World
is an object that contains a Whole
and a Bay
. It is currently not in use.
setClass("World", representation(Bay = "Bay", Whole = "Whole"))
A Regime
is like a Slice
except that it instead is built by aggregating the physical stations' datas and acts like a pseudo-Station
. As such it extends Station
and only includes a list of names (along with the values of the station). It is currently not in use.
setClass("Regime", representation(StationList = "character"), contains = "Station")
A Marker
is used to create the axis for plots and split the plots by months. If the time frame is more than NUM_DAYS_PLOT_MONTHS
then the markers will have a tic at each month with a label. Notice that these tics are dynamic and will change according the start date, end date, and differences in month sizes. If the time frame is less than NUM_DAYS_PLOT_MONTHS
but more than NUM_DAYS_PLOT_WEEKS
, then the markers will be week based. If the number of days is less than this, then the markers will denote days. By default, NUM_DAYS_PLOT_MONTHS
is 50 days and NUM_DAYS_PLOT_WEEKS
is 15 days.
setClass("Markers", representation(MarkAts = "vector", MarkLabels = "character"))
A Long-Term station is used on the Corsica River station data only. It is redesigned for data covering many years.
setClass("LongTermStation", representation(Name = "character", RawData = "vector",
MonthMarkedTS = "ts", N = "numeric", MarkedTS = "ts", MarkedData = "matrix",
Parameter = "numeric", MannKendallP = "numeric", SensEstimator = "numeric",
RPY = "numeric"))
Most fields match the description of the Station
. RPY stands for readings per year.
A Long-Term Slice is a Slice
used on the Corsica River physical station data only. It is redesigned for data covering many years.
setClass("LongTermSlice", representation(StationsList = "list", Parameter = "numeric",
Length = "numeric"))
A Long-Term Area is a slice used on the Corsica River station data only. It is redesigned for data covering many years.
setClass("LongTermArea", representation(ParamsStationList = "list",
Length = "numeric"))
A Long-Term Marker is a marker redesigned to be used for multi-year data.
setClass("LongTermMarkers", representation(MarkAts = "vector", MarkLabels = "character"))
The program, as noted earlier, uses the MVC model. The only interfaces any of the functions will have with the data are through the defined objects (and primitives) and that the GUI will interact indirectly through the controller. The main control of the function reads as follows:
Note that if the GUI is not enabled, the program simply completes the controller operations (set by the Program State) and exits. If GUI mode is enabled, the first controller operation is bypassed. Note there is a validation mode that can be activiated for testing the code.
Note: All code references the working directory as the main directory of the program. If one opens R with the working directory of the program in the drivers folder (common mistake!) the program will not run correctly.
Main
The bin folder contains all of the binary files. The docs folder contains the documentation. The drivers folder contain all of the driver files to run the program (they must be ran with R with the working directory of main). The GUIplots folder contains all of the plots one makes with the GUI. Plots contains all of the 2011 plots that come with the program. The scripts folder contains all of the batch scripts for making the binary files. The src folder contains the source code. It has an assets folder for all of the helper function, a classes folder for all of the classes (constructors are in the same file as the class declaration), validation holds all of the validation code, and views holds all of the code for the views.
The program state operation defines all of the state variables for the run of the program. This includes defining states for which mode is being run, all of the functions and objects used within the program, and all of the globals that pertain specifically to DNR's data (such as which column the date information is contained).
The state variables are defined in the driver file. These are all within the “drivers” directory in the main directory. All executables are scripts which call a driver function.
Control of the program is then transferred to the main file, “main.R” in the src directory. Everything from this point on is in the src directory. The main file reads in the header file (“assets/header.R”) to receive the declarations for all objects and functions. Then the globals for DNR data are read from “assets/globals.R”.
At this point, the program diverges depending on which mode is selected. If the GUI is enabled, then the control is then passed over to the view (“views/mainView”) which develops the GUI. It then waits for an event to call the controller. If the GUI is not enabled, the code that follows calls the controller directly.
The controller operations are all contained within “assets/control.R”. There is a control for every object but the World
object. The control takes in the required arguments to build the given data structure and then uses flags to determine which of the appropriate outputs for that data structure to create. The building of the data structures is entirely contained within the appropriate object's class file within the “classes” directory.
All of the outputs are serialized. If they are generated from the GUI (thus in MDB or Batch Mode), then any plots will be saved into the “GUIplots” folder. Else they will be put into the “plots” folder and some will generate statistical outputs determining rankings that can be used for reports (this is currently disabled for the GUI).
After the controller operations occur, the GUI selects what would be the file that is generated from the controller and displays that image. If it is not in MDB or Batch Mode, the controller is never actually called and instead the plot is taken directly from the “plots” folder. Once displayed, the program runs to the end of the operation where it is caught in a loop at the end of the driver file. Here the program waits for user interaction to cause the program to move to the event handler or for the program to terminate. User interaction with the program will only either change the current view or will use a controller operation and send the program back through those steps on a journey back to this same spot.
Included in the program is a set of validations. These are tests to ensure that the defined objects all build correctly and create the appropriate output given the controller. It runs on a smaller subset of the data. To turn on validation, simply change the state of VALIDATION in the driver to true. All of the tests are controlled by the “validation/validation.R” file. By changing the states of that folder, one can turn on and off the validation for the specific objects.
The code is highly structured in a way that makes extending the existing code simple. All one needs to do is give a new controller flag, add a new controller call, and have the function for the controller to call.
To add a new plot, first examine the existing code as a guide. Determine whether your plot deals with a Slice
or an Area
object by noticing whether it deals with multiple physical stations of one parameter or multiple parameters on one physical station, respectively. Once decided, write your code taking in as an input the given object. Once this is completed, add a reference to your code in the “src/assets/header.R” file which contains all of the headers. Once the code is being referenced, please run the appropriate validations and use the R interpreter to check to make sure that your function is behaving as wanted. Next, go to the controller, “control.R”, and add a flag for the respective operation. Place your operation within a conditional which encloses the operation (with optional prints to inform the user of the progression of the program). Now any controller that you add the “operation=TRUE” flag to will develop your plot / output / test statistic! Add this flag to the validation code for the higher order structures (such as the Area or Whole) to make sure it is robust for different parameters and different stations.
Once the feature has been added to the controller, one must simply set up a new plot type, a new GUI state, and add the controller call to the event handler for the generate button. To set up the new plot type, add the string for your plot type to the PLOT_TYPES
array in “src/globals.R”. For simplicity, add it to the end of the list. Now your plot type will show up in the dropdox box, but we need to add the functionality to it. To add in the state changing (i.e. make the GUI change what options it shows the user), add a new state to the plotTypeHander
function in “src/assets/eventHandlers.R” that turns on and off the visibility of the other componants. Use the other states as a template. Now add a function for your new plot type to the event handler. Follow the lead of the other code and have the function call your controller. Then change newImage to the plot file's location and reset the plot viewer.
If you wish to add more features that deal with the data, for example single physical station test statistics such as the Sen's Estimator (which is implemented but not used by default for performance reasons), the best way to do this is to have the station calculate the data and keep the result as one of its fields. To do so, follow the lead of the rest of the code. The objects are defined as S4 objects. To add a new field, add the field name and type to the class declaration. To make the field used, add the initialization statement to the constructor of the class object.
Address = {http://www.umbc.edu/hpcreu/2011/projects/team1.html},
Acknowledgments: These results were obtained as part of the REU Site: Interdisciplinary program in High Performance Computing (www.umbc.edu/hpcreu) in the Department of Mathematics and Statistics at the University of Maryland, Baltimore County (UMBC) in Summer 2012. This program is funded jointly by the National Science Foundation and the National Security Agency (NSF grant no. DMS-1156976), with additional support from UMBC, the Department of Mathematics and Statistics, the Center for Interdisciplinary Research and Consulting (CIRC), and the UMBC High Performance Computing Facility (HPCF).