-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathDHS_MA.Rmd
689 lines (549 loc) · 40.3 KB
/
DHS_MA.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
% Darwin Harbour Sediment Sampling Design - Middle Arm
% Murray Logan
% 22-03-2019
<top>
Synopsis
==========
The purpose of this work was to inform the spatial component of the
Darwin Harbour Sediment Monitoring sampling design. In particular, to
determine the 'best' location for 100 sediment monitoring sites in the
Middle Arm (including West Arm) sections of Darwin Harbour.
To help inform this process, there were two broad sources of data
available:
1. Hydrodynamic modelling of the entire Darwin Harbour. These data
(availed via geoTiffs), comprise broadly tidal bed shear and velocity
as well as wave driven forces at 10m resolution and will be used to
isolate areas likely to experience deposition (rather than erosion) of
sediments.
2. Munksgaard sediment chemical survey from 2012 provided by Lynda
Radke (as an Excel workbook). These data provided background
information that was used to predict the full spatial distribution
of a range of sediment chemicals. These spatial distributions
then helped tune and evaluate a range of sampling designs.
The basic procedure involved the following steps:
1. Read in a process the data sources.
2. Fit a barrier spatial model to each of the Munksgaard sediment
chemicals and predict/develop spatial layers for the Middle Arm section.
3. Develop masks out of the hydrodynamic model data and use them to
exclude areas of likely erosion from the chemical spatial layers.
4. Use spatial layers representing shipping channels, ports and other
exclusion zones to establish additional masks to apply alongside
hydrodynamic modelling masks to further restrict the sampling domains
and prevent sampling configurations containing sites in the exclusion
zones.
5. Explore three different sample generation routines for a range of
sample sizes to establish an optimal sampling design. The five
routines will be:
a) Using the masked chemical spatial layers to inform Conditioned Latent Hypercube Sampling - this will
generate samples of nominated sizes that are located in a manner that most represents the underlying patterns
in the chemical spatial layers.
b) Completely random sampling - this will select the nominated number of samples from within the masked area
and is completely naive to any underlying spatial patterns (and hence is only likely to be representative of the underlying patterns when
the number of samples is large).
c) A regular sampling grid - this will select approximately the nominated number of samples configured in a
regular grid within the masked area. Like the completely random sampling, the regular sampling grid is completely
naive to the underlying spatial patterns, yet it does guarantee a more even spatial coverage.
d) A spatially balanced design - this will yield a spatially balanced design in which sampling sites are spread out throughout the
spatial domain.
e) A high dimensional spatially balanced design - this will yield a design in which sampling sites are spread in multiple
dimensions (spatial and according to patterns in the underlying chemical distributions).
In addition to the 100 long-term monitoring sites, there are 20
designated sites. These sites are to be sampled more regularly and are
for the purpose of compliance monitoring specific areas. Although
these sites are additional to the long-term samples, they do form part
of the overall design and thus need to be considered when considering
candidate configurations.
All code for this project is available on github <https://github.com/AIMS/darwin-harbour-sampling.git>
Data processing
=================
## GIS data
A shapefile of Darwin Harbour (see Figure @fig:Map1) was utilized in
order to define the initial sampling domain(s). This project focused
on the Middle Arm. For the purpose of the sediment
monitoring program, Middle Arm was defined as Middle Arm, West Arm and Blackmore River.
{#fig:Map1 width=70%}
<!-- the above figure is generated in DHS_readData_MA.R -->
## Munksgaard 2012 chemical sediment data
Munksgaard 2012 chemical sediment data were provided by Lynda Radke in
the form of an Excel workbook. These data were consolidated together
into a single flat csv text file to support analyses. The spatial
configuration of the Munksgaard sediment sampling sites are
illustrated in Figure @fig:Map2 (circular points). Primarily only
the sites within the Middle Arm will be used to inform
the current exploration of future sampling
designs.
## Designated sampling sites
In addition to the long-term monitoring sites, a number of more
regularly sampled designated (priority) sites were provided by Lynda Radke in the
form of an Excel workbook. These sites formed additional sites, yet
they needed to be considered in the formulation on site configurations
and are illustrated in Figure @fig:Map2 as red points. Note: none of
the designated sites fall in the defined Middle Arm
region and thus they were not considered further.
{#fig:Map2 width=80%}
<!-- the above figure is generated in DHS_readData.R -->
\newpage
## Hydrodynamic and wave modelling
When designing a sediment monitoring program, it is important to
consider the erosive, transportation and deposition forces operating
on the seabed. Ideally all, if not most, of the sampling sites should
be located in areas that are more likely to experience sediment
deposition than erosion or transportation.
Various hydrodynamic and wave modelling products were made available
by Dr. Richard Brinkman (AIMS) that provide estimates of erosive,
transportation and deposition likelihoods and include:
- current velocity (50th and 75 percentiles - these were calculated
from a 30 day characteristic spring-neap tidal cycle. Higher
velocity equates to higher likelihood of sediment transport and
erosion and thus lower probability of sediment deposition.
- seabed shear stress (50th and 75 percentiles - these were derived
from the current velocity and are a measure of the shear forces
likely to be experienced on the sea bed due to tidal movement.
Higher bed shear stress equates to higher likelihood of sediment
transport and erosion and thus lower probability of sediment
deposition.
- wave derived orbital velocity magnitude at seabed - a shallow water
wave model was applied using a 10 m/s wind forcing from a range of
directions (0, 90, 140, 270 and 315 degrees) to simulate the likely
orbital velocity experienced by the seabed. Higher orbital velocity
equates to a higher likelihood of sediment transport and erosion and
thus lower probability of sediment deposition.
- wave derived seabed shear stress - the same shallow water wave model
was expressed in terms of wave bed shear stress. Higher shear stress
equates to higher likelihood of sediment transport and erosion and
thus lower probability of sediment deposition.
Visual representations of each of the above products within the Middle Arm focal areas are depicted in Figures
@fig:hydro_middle_arm -- @fig:waves_middle_arm.
{#fig:hydro_middle_arm width=100%}
{#fig:waves_middle_arm width=100%}
<!-- the above figures are generated in DHS_readData_MA.R -->
### Masks
The seabed shear stress products provided spatial modelling of the
expected forces acting on the sea bed during a typical spring-neap
tidal cycle. In so doing, they provided proxies for the likelihood
for sediment erosion, transportation and deposition. These products
were used to create masks that exclude areas of high erosive or
transportation potential.
To establish a mask (to focus only on deposition areas), thresholds
need to be established for what represent the critical values below
which deposition is likely. Figures @fig:Hjulstroms and
@fig:shearStress provide these for a range of sediment particle sizes
for water velocity and sea bed stress respectively.
{#fig:Hjulstroms width=100%}
<!-- the above figure came from Hemerson -->
{#fig:shearStress width=60%}
#### Middle Arm sampling domain
Figure @fig:particle_sizes illustrates the distribution of percentage
abundance of a range of particle sizes class categories from the
Munksgaard sampling program. On average, particles in the size
classes 4-62$\mu$m (Silt) and 62-250$\mu$m (fine sand) made up 25.4
and 43.4 percent of the sediment samples respectively. Hence, it was
important that future Middle Arm sampling designs focus on sites that
will ensure deposition of particles of these sizes. According to
Figures @fig:Hjulstroms and @fig:shearStress, particles at or above
29$\mu$m (middle of the silt range) correspond to critical deposition
values of approximately 0.2 m/s velocity and 0.1 seabed shear stress.
The hydrodynamic modeling seabed shear stress products represent the
50th and 75th percentile values. In the case of the 50th percentile,
this means that 50 percent of the time, seabed shear stress was
estimated to be above this value and 50 percent of values where below.
If a mask was based on setting the threshold to correspond to the 50th
percentile, then the masked layer represents areas where sediment
deposition is likely to occur more regularly than erosion and
transport.
Nevertheless, it was important to also establish the distribution of
seabed shear stress across the seabed in order to better understand
the distribution of values. Figures @fig:hist_hydro_middle_arm and
@fig:hist_waves_middle_arm illustrate the frequency of seabed shear stresses
(and velocity) for the Middle Arm area. The 50th percentiles for seabed
shear stress appear to drop off the peak at around 0.2 m/s, hence this
appears to be a sensible threshold value.
{#fig:particle_sizes_MA width=80%}
<!-- the above figure is generated in DHS_readData.R -->
{#fig:hist_hydro_middle_arm width=80%}
{#fig:hist_waves_middle_arm width=80%}
<!-- the above two figures were generated in DHS_thresholdLayers_MA.R -->
Both current velocity and seabed shear stress were derived from the
same model and were thus closely correlated. Similarly, both wave
derived orbital velocity and wave derived seabed shear stress were
correlated. In each case, the shear stress proxies are intended to be
expressions of the forces that are likely to be acting on the seabed
(from tides and waves respectively). Hence only the seabed shear
stress versions of the hydrodynamic and wave models were used as the
proxy estimates of tidal and wave forces.
Figure @fig:masks_MA illustrates the resulting masks for the Middle
Arm area using a threshold of 0.2 m/s (or 0.3 m/s for the 75th
percentiles). Each of the wave derived seabed shear masks were added
together and then joined with the 50 percentile hydrodynamic seabed
shear mask. The resulting mask is illustrated in Figure @fig:mask_middle_arm.
This mask represents (in blue shading) the areas most
likely to experience more deposition than erosion and thus suitable
areas for sediment monitoring. When we compared this mask to the
spatial extent of the Munksgaard 2012 sediment monitoring site
configuration, it was evident that while they broadly overlap, there
are potentially suitable areas that were not sampled in the Munksgaard
monitoring program. Moreover, there may also have been areas that were
sampled that may not have been net deposition areas.
<!--
Alarmingly,
most of the area is masked out on the basis of hydrodynamically derived seabed shear stress (bedshear* and velocity*), and what remains is virtually all then
masked out by the wave model derived seabed shear stresses. Consultation about the appropriate threshold values will be required.
-->
{#fig:masks_MA width=90%}
{#fig:mask_middle_arm width=80%}
## Exclusion zone masks
In addition to using hydrodynamic modelling masks to exclude areas
that might be considered unsuitable for sediment monitoring, there are
areas throughout Darwin Harbour that are not practical or appropriate
for monitoring. These include the areas of a high probability of hard
seabed [@Siwabessy-2019], shipping channel and port exclusion
zones as well as other exclusions zones associated with major
infrastructure of military bases. Hence additional masks were
developed from spatial
layers provided by Lynda Radke.
Figure @fig:Map3 illustrates the sampling masks
associated with Middle Arm.
{#fig:Map3 width=70%}
\clearpage
Spatial Model fitting
===============
A target or set of targets is required against which the effectiveness
and accuracy of candidate sampling designs can be tuned or gauged.
This target should represent the full underlying conditions and in
essence represent a saturated sampling design - a sampling design in
which all possible locations/sites are sampled. Whilst this is
logistically not possible, given an adequate set of baseline data,
statistical spatial models can be generated to estimate the underlying
patterns. The resulting predicted layers can be used to represent the
targets.
Spatial models are complex statistical models that attempt to recreate
the full feature space from which a limited set of samples were
collected. In so doing, they attempt to incorporate two-dimensional
patterns and correlations to allow prediction to areas in between
samples.
In the simplest cases, simple surfaces can be derived by linear
interpolation between all the sampling points. However, when samples
are distributed unevenly, there are strong spatial dependencies and/or
the bounding domain is not a simple rectangle, more complex
methodologies are required.
Ecological and environmental processes are often correlated through
space. To account for these spatial dependencies within a spatial
model it is useful to incorporate a Gaussian Random Field (GRF) which
specifies a spatially dependent covariance structure in which
locations that are closer to one another in space will in general be
more highly correlated to one another than locations that are further
apart.
Large or complex spatial models soon become intractable using a
traditional frequentist modelling framework. By contrast, equivalent
Bayesian models are typically very computationally expensive.
Integrated Nested Laplace Approximation [INLA: @Rue-2009-319] is a
Bayesian approximation framework that offers the philosophical
advantages of a full Bayesian approach, yet with the computational
efficiency of frequentist approaches.
We can consider a GRF to be stationary if the degree of correlation
between two points is dependent only on the distance between the
points, or non-stationary if we allow the correlation function to vary
over the landscape. An extreme form of non-stationary model occurs
when there are physical barriers that disrupt or block the flow of
contagious processes. In such cases, just because two locations are
in close proximity, does not necessarily mean that they will be highly
correlated. Consider a simple example of the diffusion of a dye in
water throughout a tank. The dye will spread out from the source and
gradually disperse throughout the tank. Consequently, the correlation
between the concentration of dye at any two locations during
dispersion will be dependent on the distance between the two
locations. If however, the tank had a partial barrier that restricted
the flow of dye molecules, then two locations either side of the
barrier might have very different dye concentrations despite being in
close proximity. Barrier models are able to account for these
obstructions.
$$
\begin{align}
y_i &\sim{} \Gamma(\mu_i, \alpha)\\
log(\mu_i) &= \beta_0 + u(s_i) + \varepsilon_i\\[1em]
\beta_0 &\sim{} N(0,1000)\\
u(s_i) &\sim{} GRF(r, \sigma_u)\\
\varepsilon_i &\sim{} N(0, \sigma^2)\\[1em]
\pi(\sigma_e) &\sim{} \lambda_e e^{-\lambda_e \sigma_e}\\
\pi(\sigma_u) &\sim{} \lambda_0 e^{-\lambda_0 \sigma_u}\\
\pi\left(\frac{1}{r}\right) & \sim{} \lambda_1 e^{-\lambda_1 \frac{1}{r}}
\end{align}
$$
where $y_i$ is the $ith$ observation of the target chemical variable,
$\varepsilon_i$ is the independent and individually variable random
effect used to model the very short range erratic dependencies and
$u(s_i)$ is the Gaussian Random Field and is used to model the
long-range structural (autocorrelation) dependencies. A diffuse prior
is applied to the intercept ($\beta_0$) and $\varepsilon_i$ a vector
of independent Gaussians. The Matern family spatial random effect
($u(s_i)$) and its covariance is defined by two parameters: the range
($r$: represents the length scale of the spatial dependencies) and
standard deviation ($\simga_u$: representing the ratio of spatial to
independent noise). The smaller the range, the lower the correlation
between two proximal locations.
In a boundary model, two different range parameters ($r$) are applied.
One of the range parameters is applied to the boundary area (in this
case land) and the other to the focal area (in this case Harbour). By
setting the boundary range smaller (and close to zero) than the focal
area range, the dependency structure across boundaries is disrupted
and approach zero.
Hyper-parameter priors for the random effects ($\sigma_e$, $\sigma_u$)
as well as $r$ are defined in terms of a $\lambda$ parameter which is
determined by the scale of the response (on the log scale in this
case). Both $\lambda_e$ and $\lambda_0$ were set to the median value
of the target response (on a log scale). and the focal area $r$ was
set to half the width of the spatial domain after [@Bakka-2016-1608].
The random field was approximated via a Constrained Refined Delaunay
Triangulation (CRDT) mesh. The mesh comprised of an inner region that
surrounded all the Munksgaard sediment monitoring sites as well as an
outer mesh provides a buffer to help ensure estimates close to the
inner mesh boundary are robust. In doing so, the maximum permissible
triangle edge length for the inner and outer mesh regions was set to
0.01 and 0.04 (units of latlong projection) respectively. The smaller
the values, the finer the mesh. This mesh was then projected onto the
location of the observed sample location.
The above models were fit for each of the sediment chemical recorded
in the Munksgaard 2012 sediment sampling program. Figure
@fig:inla_model_middle.arm_Zn provides an example of the major elements
of one of the chemical (Zinc) spatial models. Equivalent figures for
the other chemicals are presented in Appendix A.
Figure {@fig:inla_model_middle.arm_Zn}a depicts the random field mesh
with the Muksgaard 2012 sampling sites and Harbour boundary overlayed.
Figure {@fig:inla_model_middle.arm_Zn}b illustrates the boundary used
for the barrier in the spatial model and Figure
{@fig:inla_model_middle.arm_Zn}c illustrates the final predicted spatial
layer (with original sample data overlayed) for Zinc within the Middle
Arm area. For comparison, both predicted (model output) and observed
(Munksgaard samples) are presented on the same colour scale. Figure
{@fig:inla_model_middle.arm_Zn}c illustrates the final predicted spatial
layer for Zinc within the Middle Arm area and where the colour scale is
based purely on the range of predicted values.
{#fig:inla_model_middle.arm_Zn width=80%}
Although the shrinkage in models is a little high (there is a tendency
for the magnitude of change over space to be dampened), generally the
models do a very good job of depicting the relative patterns of change
in space. For this application, the absolute scale of the changes in
patterns are not important (only the relative patterns), since in
order to gauge the accuracy of any candidate sampling designs, it is
necessary to standardize the patterns anyway. Hence it is more
important that the model depict the general patterns in the observed
data than the exact values of the observed data.
\clearpage
Sampling
=============
Ideally, a good sampling design should comprise a configuration of
sites that collectively represent the broader area as efficiently as
possible. In this context, efficiency is a compromise between
complete representation (resulting from saturating the spatial domain
with sites) and minimizing sampling effort.
There are numerous approaches for generating candidate sampling
configurations. Irrespective of the approach, there must be a metric
by which the suitability of the configuration can be gauged. If we
assume that full saturation must provide maximum representation, then
all other configurations can be gauged relative to the full
saturation. Hence a measure of how well a configuration is likely to
represent the full spatial domain is the difference between some
empirical property calculated from the candidate configuration and
full saturation. For example, we could calculate the difference
between the estimated mean Magnesium from a candidate configuration
and the equivalent mean calculated from the full saturation. The
magnitude of this difference is thus a measure of the inaccuracy and
thus suitability of the candidate configuration.
In the current application, there are numerous sediment chemicals that
can be used to describe the underlying conditions within the spatial
domain. Consequently, the metric needs to incorporate the differences
across each of the chemicals. The following metric will be adopted.
$$
Mean Error = \frac{1}{n} \sum_{i=1:n}{\frac{abs(\bar{c_i} - \bar{s_i})}{\bar{r_i}}}
$$
$$
Max Error = \max_{i=1:n}{\frac{abs(\bar{c_i} - \bar{s_i})}{\bar{r_i}}}
$$
$$
Min Error = \min_{i=1:n}{\frac{abs(\bar{c_i} - \bar{s_i})}{\bar{r_i}}}
$$
where $\bar{c}$ and $\bar{s}$ are the estimated domain means of the
$ith$ chemical (out of $n$) from the candidate and full saturation
configurations respectively and the normalizing constant ($r_i$) is
the difference between maximum and minimum predicted values for the
$ith$ chemical.
There are three metrics presented so capture three broad
characteristics of the 'accuracy' of the candidate sampling designs:
- MeanError - this is a measure of the average deviation between the estimated zone mean (from the candidate model) and target mean from across all chemical species.
- MaxError - this is a measure of the maximum deviation between the estimated zone mean (from the candidate model) and the target mean from across all chemical species.
As a maximum, it can be used to compare the worst performing aspects of each candidate and thus acts as a worst case scenario.
- MinError - this is a measure of the minimum deviation between the estimated zone mean (from the candidate model) and the target mean from across all chemical species.
As a minimum, it can be used to compare the best performing aspects of each candidate and thus acts as a best case scenario.
#### Random sampling
The simplest approach to generating a spatial sampling design is to
repeatedly simulate sampling from the spatial domain with a range of
sample sizes and use the above metric to help determine the optimum
sampling size and configuration. Given a sufficiently large sample
size, random sampling should provide an unbiased and representative
sampling design. However, it is highly likely that at low sample
sizes this approach will not yield highly representative samples (high
$Error$). Yet increasing the sample size should result (on average)
in lower $Error$ (= greater accuracy). To counter the natural
stochasticity associated with simulation, we repeat each sample size
five times.
#### Sampling on a regular grid
In the simple random sampling approach above, each simulated random
draw was independent of all other draws. As a result, all
configurations are possible - even those in which multiple samples are
aggregated together in close proximity. In the absence of any prior
knowledge about the underlying environmental conditions and
heterogeneity, an even and regular spread of samples can ensure that
the sampling design does offer general representativeness. Grids of
increasing sample size offer progressively finer granularity and thus
the ability to detect smaller scale perturbations in space.
#### Conditioned Latin Hypercube Sampling
In drawing both random samples and regular grid samples, the process
is completely naive to the underlying environmental conditions.
Consequently, they were only likely to be representative once a large
number of samples were been collected. Conditional Latin Hypercube
Sampling (cLHS) is a robust and efficient statistical sampling
procedure that generates samples based on the variability within a
multidimensional covariate feature space [@Minasny-2006-1378]. Since
the sampling process is supervised by the underlying conditions (or
proxies thereof), for a given sample size, the resulting candidate
sampling configurations are more likely to yield samples that are
representative of the complex underlying conditions.
In introducing the procedure, @Minasny-2006-1378 provided a real world
example of a soil mapping project in which relatively small samples
sizes generated by cLHS closely represented the original distribution
of a set of environmental covariates. Furthermore, @Minasny-2006-153
found cLHS superior (in terms of accuracy) to a range of alternative
sampling procedures including random and stratified random sampling.
#### Spatially balanced design
Whilst regular grid sampling designs do space samples out throughout
the spatial domain, they do have the potential to introduce biases due
to any underlying systematic processes that might align with the
design (albeit unintentionally) and do not necessarily provide good
representation. On the other hand, random sampling designs offer some
degree of protection from those biases. The shear nature of
randomization means that it is possible that some sampling locations
can be clustered in very close proximity. When this is the case, not
only does it waste valuable sampling effort (as there is not need to
provide multiple estimates of the same location), it introduces
another bias. Any resulting estimates from the sampling design will
be biased towards the clustered sample conditions as those conditioned
are effectively weighted higher by virtue of the greater sampling
effort.
The ideal design is to be able to have a random configuration that
still prevents the clustering of samples. In affect, a way to
generate random locations whose probability of selection is
proportional to the distance from all already selected sites. This is
the inspiration behind **Spatially balanced designs**.
There are numerous spatially balanced designed techniques. Some such
as Generalized Random-Tessellation Stratified (GRTS), map the full two
dimensional spatial domain into a single dimension (in a way that
preserves the spatial ordering) before applying a systematic $\pi$ps
sampling technique to ensure a balanced distribution of samples
throughout the spatial domain. @Grafstrom-2012 introduced an
alternative technique in which unequal inclusion probabilities are
generated via a pivotal method.
A further (and perhaps alternative) ideal is to be able to have a
balanced design not only based on spatial proximity, but also on the
basis of dissimilarity of underlying conditions. Spatial areas that
are homogeneous with respect to some underlying sampling conditions
require fewer sampling locations to characterise the underlying
patterns than areas that are relatively heterogeneous. The spatially
balanced design via pivotal method allows any number of dimensions to
determine the inclusion probabilities.
A key determinant in selecting which of the above techniques to adopt
is based on an evaluation of the purpose of the sampling design. If
for example, the purpose is to characterise the overall condition
mean, then a 2D spatially balanced design is arguably most appropriate
as it should represent the general underlying conditions. If on the
other hand, the purpose is to be able to model the underlying patterns
and understand where any changes in these patterns occur, then
arguably a design that has been optimised around the underlying
conditions (such as a n-dimensional spatially balanced design or
conditioned latin hypercube sampling technique) is arguably more
appropriate.
## Middle Arm
For a range of sample sizes (5, 10, 20, 30, 40, 50, 60, 70, 80, 90,
100, 120, 200, 1000), for Middle Arm, each of the above sampling
approaches was repeated five times. For each run, the $Error$ metric
was calculated. The results are presented in Figures
@fig:samplingEffort (mean error) and @fig:samplingEffortMax (max
error). As expected, as the sample sizes increase, the error
declines. The regular sampling design performs worst. The
random sampling is better than the regular grid sampling. Whilst
clusters of samples might be appropriate for representing conditions
when the conditions cluster correspondingly, totally random samples
are highly unlikely to resemble the correct cluster configuration.
The regular grid sampling suffered from a lack of available locations from which sample (once masking from hydrodynamic modelling and shipping channels etc were taken into account).
Even a sample size of 10 was too many for a grid. Hence, although a range of sample sizes were requested, in reality only a small number were actually collected.
The non uniform distribution of cLHS on the other hand was directly
due to the clustering patterns in the underlying conditions and thus
it was not surprising that this technique had the least error.
Interestingly, the reduction in error after a sample size of 50 is
relatively mild (notwithstanding that the figure is presented on a
log-y scale). For each of sample sizes 50, 60, 70, 80 and 100, the
best (based on lowest error) cLHS configuration is presented in
Figure. Comma delimited text files are also available with the
Latitude and Longitudes of these coordinates.
{#fig:samplingEffort width=80%}
{#fig:samplingEffortMax width=80%}
{#fig:samplingEffortMin width=80%}
On the basis of Figures @fig:samplingEffort, @fig:samplingEffortMax
and @fig:samplingEffortMin we could conclude that a sample size of 100
within Middle Arm is a sound choice, although it is likely that as few
as 50 could still potentially yield similar overall patterns. The
sample size of 100 also accommodates a buffer against sample
loss. Nevertheless, this entire simulation process is contingent on a
number of unverifiable assumptions:
1. that the Munksgaard 2012 sediment sampling data are representative of the typical conditions and spatial patterns.
2. all Munksgaard 2012 sediment chemicals are equally useful and informative.
3. the INLA models are able to fully represent the true underlying conditions.
4. the costs and logistics of sampling are equal irrespective of location.
The conditioned latin hypercube sampling technique consistently
outperforms the other techniques. Interestingly, there was very
little difference between the 2D and nD Spatially balanced designs.
This suggests that either the sediment chemicals were relatively
homogeneous over space or else patterns in one chemical species was
countered by patterns in another chemical species.
The conditioned latin hypercube sampling technique appeared to be able
to tune its design on the underlying sediment chemical patterns better
than the spatially balanced designs. Thus if the above assumptions
are reasonable and the main intention of the sampling was to be able
to describe the patterns in the sediment chemicals, then the sampling
design derived from this technique seems most appropriate.
If however, the purpose of the sampling design was be provide an
unbiased representative sample of the general conditions across the
spatial domain, then it could be argued that the 2D spatially balanced
design was most appropriate - particularly if there was any doubt in
the above assumptions.
{#fig:SamplingDesign_clhs width=100%}
{#fig:SamplingDesign_regular width=100%}
{#fig:SamplingDesignspatiallybalanced2 width=100%}
{#fig:SamplingDesignspatiallybalanced width=100%}
{#fig:SamplingDesign_spatiallybalanced2_final width=100%}
Conclusions
==============
Overall, 2D spatially balanced sampling designs for Middle Arm would seem most appropriate. These designs are immune to
any uncertainty in previous data and spatial modelling and should
yield well balanced spatial configurations. A total of 100 samples
collected from Middle Arm was likely to yield
representative samples from which to construct a variety of
spatio-temporal models into the future.
Appendix A
===========
{#fig:inla_model_middle.arm_V width=80%}
{#fig:inla_model_middle.arm_Cr width=80%}
{#fig:inla_model_middle.arm_Co width=80%}
{#fig:inla_model_middle.arm_Ni width=80%}
{#fig:inla_model_middle.arm_Cu width=80%}
{#fig:inla_model_middle.arm_Zn width=80%}
{#fig:inla_model_middle.arm_As width=80%}
{#fig:inla_model_middle.arm_Se width=80%}
{#fig:inla_model_middle.arm_Mo width=80%}
{#fig:inla_model_middle.arm_Cd width=80%}
{#fig:inla_model_middle.arm_Pb width=80%}
This document was produced from markdown using knitr on `r version$version.string` on a `r version$platform` system.
```{r, results='markdown'}
sessionInfo()
```
References
====================