Rivulus-Genetic-Diversity/Rivulus-Genetic-Diversity.Rmd at main · anthonysnead/Rivulus-Genetic-Diversity · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Rivulus-Genetic_Diversity"
author: "Anthony Snead"
date: "`r Sys.Date()`"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Data Import and Formatting

Here we import our genetic data, assign individuals to population groupings, and format the population genetic data for downstream analysis

## Genetic Data Import

Here we import our population genetic data. The file includes columns that are not needed as well as a sister species indicated by the presence of the R22.Dup marker.

```{r, Data Import}
# Read and format the population genetic data for the spatial analysis and the genid object
Population.Data <- read.csv(paste(getwd(), "Data/rivulus_genetic_data.csv", sep = "/")) %>%
  # R22.Dup is a microsatellite marker species to K. hermaphroditus. Therefore, we remove any individuals with that marker as they are not our species of interest.
  dplyr::filter(is.na(R22.Dup) & is.na(R22.Dup.1)) %>%
  #get rid of the columns
  dplyr::select(-R22.Dup, -R22.Dup.1) %>%
  # remove individual with greater than 10% missing data
  dplyr::filter(rowSums((is.na(.[,6:ncol(.)])/32)*100) <= 10) %>%
  # replace the NA values
  replace(is.na(.), "?") %>%
  # gather the loci names into a new column
  tidyr::gather(., Loci, Val, -(1:5), na.rm = FALSE) %>%
  # remove the ".1" from the end of loci names
  dplyr::mutate(Loci = gsub("\\.[0-9]*$","", Loci)) %>%
  # group by all variables except the lovi values
  dplyr::group_by(across(1:6)) %>%
  # join loci values and separate by ":"
  dplyr::summarise(Val = paste(Val[!(is.na(Val)|Val=="")], collapse=":")) %>%
  # make loci names column headings again
  tidyr::spread(Loci, Val)%>%
  # ungroup
  ungroup() %>%
  # rename the ID column
  dplyr::rename(ID = Lines.Loci) %>%
  # get rid of the - male in ID columns
  dplyr::mutate(ID = gsub("- male","", ID)) %>%
  # arrange the data frame
  dplyr::arrange(Long, Lat)
```

## Create Population Groupings

We lump together populations that are within 10 km of each other because the coarsest Ocean General Circulation Model (OGCM) we use is ~9 km resolution. Here we create buffers and merge them together to get our groupsing for the downstream analysis.

```{r, Create Populations}

# make the population groups
Population.Groups <- Population.Data %>%
  # select coordinates
  dplyr::select(Lat, Long) %>%
  # get unique entries for buffers
  unique(.) %>%
  # convert population data to simple feature.
  sf::st_as_sf(x = .,
               coords = c("Long", "Lat"),
               crs = "+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0") %>%
  sf::st_transform(., crs = "+proj=laea +lat_0=37.32 +lon_0=-113.04 +x_0=0 +y_0=0 +datum=WGS84 +units=m +no_defs") %>%
  # make 10km buffer
  sf::st_buffer(., dist = 10000) %>%
  # make valid
  sf::st_make_valid(.) %>%
  # combine areas with overlapping buffers
  sf::st_union(.) %>%
  # make the object valid
  sf::st_make_valid(.) %>%
  # cast to a polygon
  sf::st_cast(., "POLYGON") %>%
  # make a spatial
  sf::as_Spatial(.)
```

## Assign Population Numbers

We use our population groupings to assign the population number back to the genetic data for downstream analysis.We then rename the populations base on where they are.

```{r, Population Assignment}

# assign the populations to the genetic data and filter out locations with less than 10 samples
Population.Genetic.Data <- Population.Data %>%
  # convert population data to simple feature.
  sf::st_as_sf(x = .,
               coords = c("Long", "Lat"),
               crs = "+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0") %>%
  # spatial join to polygon to assign the population ID
  sf::st_join(x = .,
              y = (sf::st_as_sf(Population.Groups,
                               coords = c("Longitude", "Latitude"),
                               crs = "+proj=laea +lat_0=37.32 +lon_0=-113.04 +x_0=0 +y_0=0 +datum=WGS84 +units=m +no_defs") %>%
                     sf::st_transform(., crs = "+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0")  %>%
                     cbind(., Population_ID = 1:20)),
              join = st_intersects,
              left = FALSE) %>%
  # convert to data frame with geometry column
  as.data.frame(.) %>%
  # split geometry to lat and long
  tidyr::separate(., col = geometry, into = c("Longitude", "Latitude"), sep = ",") %>%
  # extract number
  dplyr::mutate(Longitude = tidyr::extract_numeric(Longitude),
                Latitude = tidyr::extract_numeric(Latitude)) %>%
  # move the population Id column.
  dplyr::relocate(c(Population_ID), .after = ID) %>%
  # move columns around
  dplyr::relocate(c(Longitude, Latitude, REGION, Year), .before = ID) %>%
  # change numbers to acronyms
  dplyr::mutate(Population_ID = dplyr::case_when(Population_ID == 1 ~ "UH",
                                                 Population_ID == 2 ~ "RH",
                                                 Population_ID == 3 ~ "TC",
                                                 Population_ID == 4 ~ "LC",
                                                 Population_ID == 5 ~ "TA",
                                                 Population_ID == 6 ~ "NC",
                                                 Population_ID == 7 ~ "LK",
                                                 Population_ID == 8 ~ "UK",
                                                 Population_ID == 9 ~ "EI",
                                                 Population_ID == 10 ~ "SS",
                                                 Population_ID == 11 ~ "EG",
                                                 Population_ID == 12 ~ "SL",
                                                 Population_ID == 13 ~ "FL",
                                                 Population_ID == 14 ~ "LB",
                                                 Population_ID == 15 ~ "VR",
                                                 Population_ID == 16 ~ "BB",
                                                 Population_ID == 17 ~ "CC",
                                                 Population_ID == 18 ~ "TB",
                                                 Population_ID == 19 ~ "IR",
                                                 Population_ID == 20 ~ "NS")) %>%
  # group
  dplyr::group_by(Population_ID) %>%
  # remove populations with less than 10 individuals
  dplyr::filter(n() >= 10) %>%
  # ungroup
  dplyr::ungroup(.)

# The acroynmes represent where the samples are from. NS = New Symerna, EG = Everglades, LB = Lower Bogue, EI = Exuma Island, FL = Fort Lauderdale, IR = Indian River, UK = Upper Keys, SL = Saint Lucie, VR = Vero Beach, LC = Long Caye, TC = Twin Cayes, TA = Turneffe Atoll, SS = San Salvador, CC = Charlotte County, BB = Bunche Beach, LK = Lower Keys, UH = Utila Honduras, RH = Roatan Honduras, NC = Northern Caye, TB = Tampa Bay


# the base world data for maps
world <- rnaturalearth::ne_countries(scale = "large", returnclass = "sf")

# map of samples total
Sample.Map <- ggplot2::ggplot(data = world) +
   geom_sf(fill= "antiquewhite",
            color = "black") +
  geom_point(data = Population.Genetic.Data, aes(x = Longitude, y = Latitude, fill = as.factor(Population_ID)),
             shape=21, size = 2, color = "black") +
  geom_text(data = Population.Genetic.Data, aes(x = Longitude, y = Latitude), label = Population.Genetic.Data$Population_ID, nudge_x = 0.2, nudge_y = 0.2, check_overlap = TRUE) +
    coord_sf(xlim = c(-90,-72),
             ylim = c(15,30))
```

## Convert Genetic Data

Here we convert our genetic data from into a genid object.
```{r, Convert Genetic Data}

# make genind object for population genetic analysis used in the Pairwise Comparison
Population.Genetic.Data.genid <- Population.Genetic.Data %>%
  dplyr::select(-(1:6)) %>% # get the columns we need
  adegenet::df2genind(X = .,
                      sep = ":",
                      ncod = NULL,
                      ind.names = Population.Genetic.Data$ID,
                      loc.names = NULL,
                      pop = Population.Genetic.Data$Population_ID,
                      NA.char= "?",
                      ploidy = 2,
                      type = "codom",
                      strata = Population.Genetic.Data[,3:6],
                      hierarchy = NULL)
```


## Genetic Diversity

```{r, unbiased shannon - zahl}

# unbiased shannon estimator from Konopiski 2020
ShannonGen <- function(gInd, estimator = NULL) {
  MLE = function(X) {
    X = X[X > 0]
    n = sum(X)
    - sum(X / n * log(X / n))
  }
  Z = function(X){
    X=X[X>0]
    Y=X[X>1]
    n=sum(X)
    -n*sum(X/n*log(X/n))-(n-1)/n*sum((n-X)*(-X/(n-1)*log(X/(n-1))) )-(n-1)/n*sum(-Y*(Y-1)/(n-1)*log((Y-1)/(n-1)))
  }
  CS = function(X) {
    x = X
    x = x[x > 0]
    n = sum(x)
    f1 = sum(x == 1)
    C_head = 1 - f1 / n
    a = -sum(C_head * (x / n) * log(C_head * (x / n)) / (1 - (1 - C_head *
                                                                (x / n)) ^ n))
    a
  }
  Ch = function(X) {
    x = X
    x = x[x > 0]
    n = sum(x)
    UE <- sum(x / n * (digamma(n) - digamma(x)))
    f1 <- sum(x == 1)
    f2 <- sum(x == 2)
    if (f1 > 0)
    {
      A <-
        1 - ifelse(f2 > 0, (n - 1) * f1 / ((n - 1) * f1 + 2 * f2), (n - 1) * f1 /
                     ((n - 1) * f1 + 2))
      B = sum(x == 1) / n * (1 - A) ^ (-n + 1) * (-log(A) - sum(sapply(1:(n - 1), function(k) {
        1 / k * (1 - A) ^ k})))
    }
    if (f1 == 0) {
      B = 0
    }
    if (f1 == 1 & f2 == 0) {
      B = 0
    }
    UE + B
  }
  inputs <- lapply(adegenet::seppop(gInd),
                   function (i)
                     sapply(adegenet::seploc(i),
                            function(j) colSums(j@tab,na.rm = TRUE)))
  out <- list()
  if (is.null(estimator))
    estimator <- "z"

  if ("sh" %in% estimator) {
    out[["Shannon_1949"]] <-
      as.data.frame(lapply(inputs, function (i)
        sapply(i, MLE)))
  }
  if ("z" %in% estimator) {
    out[["Zahl_1977"]] <-
      as.data.frame(lapply(inputs, function (i)
        sapply(i, Z)))
  }
  if ("cs" %in% estimator) {
    out[["Chao_Shen_2003"]] <-
      as.data.frame(lapply(inputs, function (i)
        sapply(i, CS)))
  }
  if ("ch" %in% estimator) {
    out[["Chao_et_al_2013"]] <-
      as.data.frame(lapply(inputs, function (i)
        sapply(i, Ch)))
  }
  return(out)
}

Zahl <- ShannonGen(Population.Genetic.Data.genid)$Zahl_1977 %>%
  # get mean
  dplyr::summarise_all(., mean) %>%
  # pivot
  tidyr::pivot_longer(., cols = 1:ncol(.), names_to = "Pop", values_to = "Z")

```
```{r, Genetic Diversity}
#calculate genotypic diversity metrics
Population.Genetic.Diversity <- poppr::poppr(Population.Genetic.Data.genid, total = TRUE, sample = 999, cutoff = 0.1)

# get rarefied genetic diversity
Population.Genetic.Diversity.Rare <- poppr::diversity_ci(Population.Genetic.Data.genid,
                                                         total = TRUE, rarefy = TRUE,
                                                         n.rare = 10)

# combine genotype rarefied measures with standard genetic diversity measures
Population.Genetic.Diversity.Table <- data.frame(
  Pop = names(apply((t(sapply(adegenet::seppop(Population.Genetic.Data.genid), function(ls) adegenet::summary(ls)$Hobs))),
                    MARGIN = 1, FUN = mean)),
  Hobs = apply((t(sapply(adegenet::seppop(Population.Genetic.Data.genid), function(ls) adegenet::summary(ls)$Hobs))),
               MARGIN = 1, FUN = mean),
  Hexp = apply((t(sapply(adegenet::seppop(Population.Genetic.Data.genid), function(ls) adegenet::summary(ls)$Hexp))),
               MARGIN = 1, FUN = mean),
  Ar = PopGenReport::allel.rich(Population.Genetic.Data.genid, min.alleles = NULL)$mean.richness) %>%
  dplyr::left_join(., (as.data.frame(Population.Genetic.Diversity.Rare$est) %>%
                     tibble::rownames_to_column(., var = "Pop"))) %>%
  dplyr::left_join(., (Population.Genetic.Diversity %>%
                         dplyr::select(Pop, eMLG))) %>%
  dplyr::left_join(., Zahl) %>%
  dplyr::relocate(., "eMLG", .before = "Hobs") %>%
  dplyr::select(-H, -lambda, -E.5)
```

```{r, sample size versus diversity}
# create table for correlation test
Diversity_N <- Population.Genetic.Diversity.Table %>%
  dplyr::left_join(., (Population.Genetic.Diversity %>%
                         dplyr::select(Pop, N))) %>%
  dplyr::select(N, eMLG, Z, G, Hexp, Hobs, Ar)

#eMLG
cor.test(Diversity_N$N, Diversity_N$eMLG)

#Z
cor.test(Diversity_N$N, Diversity_N$Z)

#G
cor.test(Diversity_N$N, Diversity_N$G)

#Hexp
cor.test(Diversity_N$N, Diversity_N$Hexp)

#Hobs
cor.test(Diversity_N$N, Diversity_N$Hobs)

#Ar
cor.test(Diversity_N$N, Diversity_N$Ar)

```

# Abiotic Variables
## CMS

We use the connectivity-modeling-system (CMS) to estimate ocean current mediated connectivity by simulating an ocean current environment and releasing particles that can represent larva, eggs ,or other propagules. CMS requires ocean current data (nest files), two parameter files, and a release file. It generates either netCDF or ASCII files with the location of each particle at the designated time step. Here we generate the release files and process the output.

### Generate Release Files

The release file is a tab delimited text file with 9 columns (polygon, longitude, latitude, depth, number, year, month, day, and second). Polygon is used with the seascape module if particles are allowed to settle. Even if the particles are not allowed to settle, the polygon column is still required and must be an integer. Longitude, latitude, and depth columns represent the location of the particle release location both geographically and vertically. The number is the number of particles released at that specific combination of location and time. If there is no horizontal diffusivity coefficient, all the particles released at the same location and time will follow the same trajectory. The year, month, day, and second columns dictate the time of release. Particles cannot be released over land, and if they are, CMS will not notify you of the issue. CMS dictates a particle is on land if two of the closest eight points are dictated as land my the nest files.

To get an accurate representation of ocean current connectivity, 1000 locations were selected and 5 particles released form each location at a random point every day for 10 years. Because locations could not be over land, we had to troubleshoot release locations using the example HYCOM files from the specific experiments used in our simulations and additional buffer zones. Therefore, the following code requires importing the HYCOM data, sampling the points, and creating the release files.
#### Release Points

##### HYCOM Data

CMS enables users to use different nest files that are different in resolution. We used two HYCOM experiments (GOM 50.1 & GLBu 0.08). The GOM 50.1 is only the Gulf of Mexico but is available at a 1/25° resolution, while the GLBu 0.08 is global but only at a 1/12° resolution. Here we import sample netCDF files invert them so that the land mask is not equal to one, convert them to polygons, and add a 5 km buffer to them. We then use them later to filter out locations that CMS would dictate as land.


```{r, HYCOM data}

# get netcdf, convert to polygon, and calculate the buffer
GOM <- raster::raster(paste(getwd(), "Data/Sample_HYCOM/hycom_gomu_501_2012010100_t000.nc", sep = "/"), varname = "water_v") %>%
  # crop the the study extent to make sure that cells outside the domain are not retained inappropriately
  raster::crop(., raster::extent(c(xmin = -98.1 , xmax = -76.5 , ymin = 18 , ymax = 32))) %>%
  # invert the raster so that the land mask is the only cells with data
  raster::calc(., fun = function(x){
    if(is.na(x[1])){
      return(1)}
    else{ return(NA)}}) %>%
  # convert to polygon
  raster::rasterToPolygons(., dissolve = TRUE) %>%
  # convert to sf object
  sf::st_as_sf(x = .,
               coords = c("Long", "Lat"),
               crs = "+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0") %>%
  # project for the buffer
  sf::st_transform(., crs = "+proj=laea +lat_0=37.32 +lon_0=-113.04 +x_0=0 +y_0=0 +datum=WGS84 +units=m +no_defs") %>%
  # calculate the 5 km buffer
  sf::st_buffer(., dist = 5000) %>%
  # unproject because CMS take lat/long
  sf::st_transform(., crs = "+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0")

# get netcdf, convert to polygon, and calculate the buffer
GBLu <- raster::raster(paste(getwd(), "Data/Sample_HYCOM/hycom_GLBu0.08_191_1995080100_t000.nc", sep = "/"), varname = "water_v") %>%
  # crop the the study extent to make sure that cells outside the domain are not retained inappropriately
  raster::crop(., raster::extent(c(xmin = -91 , xmax = -76 , ymin = 18 , ymax = 32))) %>%
  # invert the raster so that the land mask is the only cells with data
  raster::calc(., fun = function(x){
    if(is.na(x[1])){
      return(1)}
    else{ return(NA)}}) %>%
  # convert to polygon
  raster::rasterToPolygons(., dissolve = TRUE) %>%
  # convert to sf object
  sf::st_as_sf(x = .,
               coords = c("Long", "Lat"),
               crs = "+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0") %>%
  # project for the buffer
  sf::st_transform(., crs = "+proj=laea +lat_0=37.32 +lon_0=-113.04 +x_0=0 +y_0=0 +datum=WGS84 +units=m +no_defs") %>%
  # calculate the 5 km buffer
  sf::st_buffer(., dist = 5000) %>%
  # unproject because CMS take lat/long
  sf::st_transform(., crs = "+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0")
```

##### Sample Locations

Here we use the population polygons to sample release locations for CMS. This step required troubleshooting to determine the appropriate buffer to add to the population group polygons to ensure that the particles were not considered over land and not overlapping with an adjacent population.

```{r, Release Locations}

# Get a data frame of points along the edge each polygon group for release points
Population.Points <- Population.Groups %>%
  # convert to sf
  sf::st_as_sf(x = .,
               coords = c("X", "Y"),
               crs = "+proj=laea +lat_0=37.32 +lon_0=-113.04 +x_0=0 +y_0=0 +datum=WGS84 +units=m +no_defs") %>%
  # make 2km buffer to push the points farther from the coast
  sf::st_buffer(., dist = 2000) %>%
  # make valid
  sf::st_make_valid(.) %>%
  # combine areas with overlapping buffers
  sf::st_union(.) %>%
  # make the object valid
  sf::st_make_valid(.) %>%
  # cast to a polygon
  sf::st_cast(., "POLYGON") %>%
  # transform to lat/long cms takes lat/long
  sf::st_transform(., crs = "+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0") %>%
  # get points along the edge
  sf::st_segmentize(., dfMaxLength = 0.1) %>%
  # convert to coordinates
  sf::st_coordinates(.) %>%
  # make a data frame
  as.data.frame(.) %>%
  # select important columns
  dplyr::select(X, Y, L2) %>%
  # rename the population ID column
  dplyr::rename(Population_ID = L2) %>%
  # reduce to third decimal
  dplyr::mutate(X = format(round(X, 3), nsmall = 3),
                Y = format(round(Y, 3), nsmall = 3)) %>%
  # get unique points
  dplyr::distinct(.) %>%
  # make sf object
  sf::st_as_sf(.,
               coords = c("X", "Y"),
               crs = "+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0") %>%
  # add location columns and columns that indicate if a point is within the land mask for each HYCOM experiment
  dplyr::mutate(GOM = lengths(sf::st_within(., GOM)),
                GBLu = lengths(sf::st_within(., GBLu)),
                Longitude = sf::st_coordinates(.)[,1],
                Latitude = sf::st_coordinates(.)[,2]) %>%
  # convert to data frame
  as.data.frame(.) %>%
  # filter out locations by the finest resolution nest that over over land
  dplyr::filter((Latitude > 18 & Latitude < 32 & Longitude > -98.1 & Longitude < -76.5 & GOM == 0) |
                  (Latitude < 18 & GBLu == 0) | (Longitude >= -76.5 & GBLu == 0)) %>%
  # select important columns
  dplyr::select(Longitude, Latitude, Population_ID)
```

##### Create Release files

Here we use the locations to create the release files. We first create a master release file data frame that includes all the years.

```{r, Release Data Frame}

# set the seed for reproducability
set.seed(42)

# make file of release points that are over water
Release.Points <- Population.Points %>%
  # convert the longitude to same format as output for join
  dplyr::mutate(Longitude = 360+Longitude) %>%
  # get rid of population that did not have enough samples
  dplyr::filter(Population_ID !=2 & Population_ID != 15 & Population_ID != 16) %>%
  # group
  dplyr::group_by(Population_ID) %>%
  # sample 1000 locations randomly
  dplyr::sample_n(1000, replace = TRUE) %>%
  # replicate each row by the total number of days from the start of the simulation to the end.
  dplyr::slice(rep(row_number(), (10*12*31))) %>%
  # add columns for CMS and fill Year column
  dplyr::mutate(Depth = 0,
                Number = 10,
                Year = rep(2000:2009, each = (1000*12*31))) %>%
  # group
  dplyr::group_by(Population_ID, Year) %>%
  # fill Month column
  dplyr::mutate(Month = rep(1:12, each = (1000*31))) %>%
  # group
  dplyr::group_by(Population_ID, Year, Month) %>%
  # fill day column
  dplyr::mutate(Day = rep(1:31, each = 1000)) %>%
  # ungroup
  dplyr::ungroup() %>%
  # fill second with a random time
  dplyr::mutate(Second = round(runif(nrow(.), 0, 86400))) %>%
  # filter out all the days that do not exist and Feb 29th because the CMS software ignores it anyways
  dplyr::filter((Month %in% c(1,3,5,7,8,10,12) & Day <= 31) | (Month %in% c(4,6,9,11) & Day <= 30) | (Month == 2 & Day <= 28)) %>%
  # make sure columns are in the correct order for CMS
  dplyr::relocate(Population_ID, .before = Longitude)
```

Even with an HPC, it was more effective to run each year seperatley. Therefore, we write a release file for each year.

```{r, write files}

# write release file for 2000
write.table(Release.Points %>%
               dplyr::filter(Year == 2000),file = paste(getwd(), "CMS/Input/Release_Points_2000.txt", sep = "/"),quote = FALSE, sep = '\t', col.names = FALSE, row.names = FALSE)

# write release file for 2001
write.table(Release.Points %>%
               dplyr::filter(Year == 2001),file = paste(getwd(), "CMS/Input/Release_Points_2001.txt", sep = "/"),quote = FALSE, sep = '\t', col.names = FALSE, row.names = FALSE)

# write release file for 2002
write.table(Release.Points %>%
               dplyr::filter(Year == 2002),file = paste(getwd(), "CMS/Input/Release_Points_2002.txt", sep = "/"),quote = FALSE, sep = '\t', col.names = FALSE, row.names = FALSE)

# write release file for 2003
write.table(Release.Points %>%
               dplyr::filter(Year == 2003),file = paste(getwd(), "CMS/Input/Release_Points_2003.txt", sep = "/"),quote = FALSE, sep = '\t', col.names = FALSE, row.names = FALSE)

# write release file for 2004
write.table(Release.Points %>%
               dplyr::filter(Year == 2004),file = paste(getwd(), "CMS/Input/Release_Points_2004.txt", sep = "/"),quote = FALSE, sep = '\t', col.names = FALSE, row.names = FALSE)

# write release file for 2005
write.table(Release.Points %>%
               dplyr::filter(Year == 2005),file = paste(getwd(), "CMS/Input/Release_Points_2005.txt", sep = "/"),quote = FALSE, sep = '\t', col.names = FALSE, row.names = FALSE)

# write release file for 2006
write.table(Release.Points %>%
               dplyr::filter(Year == 2006),file = paste(getwd(), "CMS/Input/Release_Points_2006.txt", sep = "/"),quote = FALSE, sep = '\t', col.names = FALSE, row.names = FALSE)

# write release file for 2007
write.table(Release.Points %>%
               dplyr::filter(Year == 2007),file = paste(getwd(), "CMS/Input/Release_Points_2007.txt", sep = "/"),quote = FALSE, sep = '\t', col.names = FALSE, row.names = FALSE)

# write release file for 2008
write.table(Release.Points %>%
               dplyr::filter(Year == 2008),file = paste(getwd(), "CMS/Input/Release_Points_2008.txt", sep = "/"),quote = FALSE, sep = '\t', col.names = FALSE, row.names = FALSE)

# write release file for 2009
write.table(Release.Points %>%
               dplyr::filter(Year == 2009),file = paste(getwd(), "CMS/Input/Release_Points_2009.txt", sep = "/"),quote = FALSE, sep = '\t', col.names = FALSE, row.names = FALSE)
```

### Post Processesing
```{r, CMS Post-Processing Functions}

# the function processes the CMS output files
CMS.Processing.Function <- function(pathlist, polygons, timestep, releasepoints){

  library(dplyr)
  library(sf)
  library(sfheaders)
  library(tidyverse)

  # read in the population polygon and convert to sf
  Populations <- sf::st_as_sf(polygons,
                               coords = c("X", "Y"),
                               crs = "+proj=laea +lat_0=37.32 +lon_0=-113.04 +x_0=0 +y_0=0 +datum=WGS84 +units=m +no_defs") %>%
    # transform
    sf::st_transform(., crs = "+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0") %>%
    # add population ID
    cbind(., Population_ID = (1:20))

  releasepoints <- releasepoints %>%
    # get only important columns
    dplyr::select(Population_ID, Longitude, Latitude) %>%
    # get only distinct values
    distinct(.) %>%
    # convert lat and long to character for join
    dplyr::mutate( Longitude = as.character(Longitude),
                   Latitude = as.character(Latitude))

  #a function to assign the sources
  Assignment.Fun <- function(path, polygon, points){

  # read in the trajectory file
  CMS.Output <- readr::read_table(path,
                                  col_names = c("Location", "Particle", "Time", "Longitude", "Latitude", "Depth", "Distance", "Exit_Code", "Release_Date"))  %>%
    # remove the columns we do not need
    dplyr::select(-Depth, -Exit_Code, -Release_Date)

  # assign the sources
  Assigned.Release <- CMS.Output %>%
    # filter to retain only the first release point
    dplyr::filter(Time == 0) %>%
    # convert the Latitude and longitude to character vectors for the left join
    dplyr::mutate(Longitude = as.character(Longitude),
                  Latitude = as.character(Latitude)) %>%
    # join with release locations
    dplyr::left_join(., points) %>%
    # select the variables we need
    dplyr::select(Location, Population_ID, Particle) %>%
    # relabel the source column
    dplyr::rename(Source = Population_ID) %>%
    # convert location and particle to characters for joining
    dplyr::mutate(Location = as.character(Location),
                  Particle = as.character(Particle))

  # Assign the sink and join to the source
  Source.Sink <- CMS.Output %>%
    # get particle that have moved
    dplyr::filter(Time != 0) %>%
    # convert to sf object
    sf::st_as_sf(.,
               coords = c("Longitude", "Latitude"),
               crs = "+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0") %>%
    # join it with polygons
    sf::st_join(x = .,
                y = polygon,
              join = st_intersects,
              left = TRUE) %>%
    # convert to data frame
    sfheaders::sf_to_df(., fill = TRUE) %>%
    # select important columns
    dplyr::select(-sfg_id, -point_id) %>%
    # rename columns
    dplyr::rename(Longitude = x,
                 Latitude = y,
                 Sink = Population_ID) %>%
    # convert location and particle to character for left join
    dplyr::mutate(Location = as.character(Location),
                  Particle = as.character(Particle)) %>%
    # join with the sources
    dplyr::left_join(., Assigned.Release) %>%
    # get rid of rows that do not have a sink
    dplyr::filter(!is.na(Sink))

  return(Source.Sink)
  }

  # calculate connectivity values
  Connectivity.Values <- lapply(pathlist, FUN = Assignment.Fun, polygon = Populations, points = releasepoints) %>% # apply function to all files
    # bind the lists together
    bind_rows(.) %>%
    # group by unique particles for each source sink relationship
    dplyr::group_by(Source, Sink, Location, Particle) %>%
    # arrange by time
    dplyr::arrange(Time, by.group = TRUE) %>%
    # calculate the difference between times
    dplyr::mutate(Time_diff = Time - lag(Time)) %>%
    # mutate to get connectivity values
    dplyr::mutate(Time_diff = replace(Time_diff, Time_diff != timestep, timestep), # retain only time steps that are consecutive and if they are not give it a value of the timestep
                  Time = (sum(Time_diff, na.rm = TRUE)), # get total amount of time in the polygon
                  Distance = min(Distance)) %>% # get min distance traveled
    # get distinction combinations to not count particles twice
    dplyr::distinct(Source, Sink, Location, Particle, Distance, Time, .keep_all = TRUE) %>%
    # group by source sink
    dplyr::group_by(Source, Sink) %>%
    # summarise
    dplyr::summarise(n = dplyr::n_distinct(Location, Particle), # number of particles
                     Distance = mean(Distance), # average distance
                     Time = mean(Time, na.rm = TRUE)) %>% # mean time within polygon
    # ungroup
    dplyr::ungroup(.)

return(Connectivity.Values)}

```

```{r, CMS Post-Processing}
# processes the ocean current data
Ocean.Connectivity <- CMS.Processing.Function(pathlist = list.files(paste(getwd(), "CMS/Output/", sep ="/"), recursive  = TRUE, full.names = TRUE), polygons = Population.Groups, releasepoints = Release.Points, timestep = 1800) %>%
  # convert to characters
  dplyr::mutate(across(1:2, ~ as.character(.))) %>%
  # rename populations
  dplyr::mutate(across(1:2, ~ dplyr::case_when(. == "1" ~ "UH",
                                               . == "2" ~ "RH",
                                               . == "3" ~ "TC",
                                               . == "4" ~ "LC",
                                               . == "5" ~ "TA",
                                               . == "6" ~ "NC",
                                               . == "7" ~ "LK",
                                               . == "8" ~ "UK",
                                               . == "9" ~ "EI",
                                               . == "10" ~ "SS",
                                               . == "11" ~ "EG",
                                               . == "12" ~ "SL",
                                               . == "13" ~ "FL",
                                               . == "14" ~ "LB",
                                               . == "15" ~ "VR",
                                               . == "16" ~ "BB",
                                               . == "17" ~ "CC",
                                               . == "18" ~ "TB",
                                               . == "19" ~ "IR",
                                               . == "20" ~ "NS"))) %>%
  # filter out populations not analyzed
  dplyr::filter(Source != "RH" & Source != "VR" & Source != "BB" & Sink != "RH" & Sink != "VR" & Sink != "BB")

Ocean.Connectivity.NS <- Ocean.Connectivity %>%
   dplyr::left_join(.,
                    Population.Genetic.Diversity,
                    by = c("Source" = "Pop")) %>%
  dplyr::mutate(Time = (n*Time)) %>%
  dplyr::mutate(Time = Time/sum(Time))

Ocean.Connectivity.Standard <- Ocean.Connectivity %>%
   dplyr::left_join(.,
                    Population.Genetic.Diversity,
                    by = c("Source" = "Pop")) %>%
  dplyr::mutate(Time = (n*Time)*(eMLG/12)) %>%
  dplyr::mutate(Time = Time/sum(Time))

```

### Centrality measures

```{r, Centrality Function}
# function makes graphs and estimates centrality measures from the graph
Centrality_Function <- function(Data, Variable, Distance){

  # load libraries
  library(tidyverse)
  library(igraph)

  # create graph
  Graph.Close <- Data %>%
    # rename to what igraph expects
    dplyr::rename(From = Source,
                  To = Sink,
                  weight = !!Variable) %>%
    # select the important columns
    dplyr::select(From, To, weight) %>%
    # filter out weights less than 0
    dplyr::filter(weight > 0) %>%
    # make directed graph
    igraph::graph_from_data_frame(., directed = TRUE, vertices = NULL)

  # create graph
  Graph.Strength <- Data %>%
    # rename to what igraph expects
    dplyr::rename(From = Source,
                  To = Sink,
                  weight = !!Variable) %>%
    # select the important columns
    dplyr::select(From, To, weight) %>%
    # filter out weights less than 0
    dplyr::filter(weight > 0) %>%
    # make directed graph
    igraph::graph_from_data_frame(., directed = TRUE, vertices = NULL)

  # if it is not a distance
  if(Distance == FALSE){

    # log the weights if not 0 and multiply by negative 1
    E(Graph.Close)$weight <- -1*ifelse(E(Graph.Close)$weight != 0, log(E(Graph.Close)$weight), 0)
  }

  # make results data frame fore each centrality measure
  Results <- tibble::enframe((igraph::closeness(Graph.Close, vids = igraph::V(Graph.Close),
                                                 mode = "in", normalized = TRUE)),
                                      name = "Population_ID", value = paste(Variable,"Close.In", sep ="_")) %>%
  dplyr::left_join(., tibble::enframe((igraph::strength(Graph.Strength, vids = igraph::V(Graph.Strength), mode = "in", loops = FALSE)),
                                      name = "Population_ID", value = paste(Variable,"Strength.In", sep ="_")))

  #return results
  return(Results)
}
```

```{r, estimate centrality}

# get standardized OC centrality measures
OC.Centrality.NS<- Ocean.Connectivity.NS %>%
  Centrality_Function(.,
                                     Variable = "Time", Distance = FALSE)

# get standardized OC centrality measures
OC.Centrality.Standard <- Ocean.Connectivity.Standard %>%
  Centrality_Function(.,
                                     Variable = "Time", Distance = FALSE)
```
## Distance

Rivulus are unlikely to travel over land. Therefore, we use distance over water as our distance measure.To do so, we first get centroids of all the populations, a land mask, and create a cost transition matrix. We then wrote a function that iterates across all points to get the shortest distance avoiding the land mask.

### Centroids

We get the centroid of each population.

```{r, Centroids}
# get centroids for calculating the distance over water
Centroids <- sf::st_as_sf(Population.Groups,
                                   coords = c("X", "Y"),
                                   crs = "+proj=laea +lat_0=37.32 +lon_0=-113.04 +x_0=0 +y_0=0 +datum=WGS84 +units=m +no_defs") %>%
  # get centroids
  sf::st_centroid(.) %>%
  # transform
  sf::st_transform(., crs = "+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0") %>%
  # get coordinates
  sf::st_coordinates(.)
```

### Land.mask

We get a land mask from the world data we imported for maps.

```{r, Land mask}
# make land mask
Land.mask <- world %>%
  # crop
  sf::st_crop(., xmin = -100, xmax = -65, ymin = 0, ymax = 32) %>%
  # convert to raster
  raster::rasterize(., (raster::raster(nrow = 1000, ncol = 1000, ext=extent(-100, -65, 0, 32))))
```

### Transition object

We create a transition object from the land mask.

```{r, transition object}
# make transition object
Water.Cost <- Land.mask %>%
  # make water 1 and land 999
  raster::calc(., fun = function(x){
    if(is.na(x[1])){
      return(1)}
    else{ return(999)}}) %>%
  # make into transition
  gdistance::transition(., function(x) 1/mean(x), 16) %>%
  # correct
  gdistance::geoCorrection(., type = "c", scl = FALSE)

```

### Ocean distance function

We write a function to iterate across all points and get the distance over water.

```{r, Ocean distance function}
# function to get distances over water
Ocean.Distance.Fun <- function(points, cost_raster, landmask){

  library(gdistance)
  library(gissr)
  library(tidyverse)
  library(raster)

  # empty list
  Distance <- list()

  # loop to get distances
  for(i in 1:nrow(points)){
    # get cost raster for the point
    Cost <- gdistance::accCost(cost_raster, points[i,]) %>%
      # mask land
      raster::mask(., landmask, inverse = TRUE)

    # get shortest path and calcualte distance
    Distance[[i]] <- gdistance::shortestPath(cost_raster, points[i,], points[-i,], output = "SpatialLines") %>%
      # get distance
      gissr::sp_length(.) %>%
      # make into data frame
      data.frame(Pop1 = i,
               Pop2 = (1:nrow(points))[-i],
               Distance = .)
  }

  return(dplyr::bind_rows(Distance))
}

```

### Ocean Distance

We get the distance over marine habitat for all populations.

```{r, Ocean distances}
Ocean.Distance.df <- Ocean.Distance.Fun(Centroids, Water.Cost, Land.mask) %>%
  # make population acronyms to match the genetic data
  dplyr::mutate(across(1:2, ~ dplyr::case_when(. == 1 ~ "UH",
                                               . == 2 ~ "RH",
                                               . == 3 ~ "TC",
                                               . == 4 ~ "LC",
                                               . == 5 ~ "TA",
                                               . == 6 ~ "NC",
                                               . == 7 ~ "LK",
                                               . == 8 ~ "UK",
                                               . == 9 ~ "EI",
                                               . == 10 ~ "SS",
                                               . == 11 ~ "EG",
                                               . == 12 ~ "SL",
                                               . == 13 ~ "FL",
                                               . == 14 ~ "LB",
                                               . == 15 ~ "VR",
                                               . == 16 ~ "BB",
                                               . == 17 ~ "CC",
                                               . == 18 ~ "TB",
                                               . == 19 ~ "IR",
                                               . == 20 ~ "NS")))

```
## Landscape Metrics

Here we calculate landscape metrics for each population. These metrics are related to landscape configuration including metrics like habitat area and connectedness. We first start by importing and formatting the rasters before calculating the metrics.

### Raster Formatting

We start by creating a function that will clip, format, and project our rasters.

```{r, Raster Formatting Function}
# function to crop and reproject the land cover data for each population where our habitat is 1 and all other habitat is 2
Raster.Processing.Fun <- function(path, Clipping.DF, min.code, max.code){

  # get libraries we need
  library(raster)
  library(tidyverse)
  library(sf)

  # get list of raster files
  File.List <- list.files(path = path, full.names = TRUE)

  # make empty list to hold the rasters
  Cropped.Raster.List <- list()

  # iterate through all the populations
  for(i in 1:nrow(Clipping.DF)) {

    # iterate throug all the raster
    for( z in 1:length(File.List)){

      # import raster
      Raster <- raster::raster(File.List[[z]])

      # test if the extent of the population is within the extent of the raster
      if(sf::st_bbox(Clipping.DF$geometry[i])[[1]] >= sf::st_bbox(Raster)[[1]] &
         sf::st_bbox(Clipping.DF$geometry[i])[[2]] >= sf::st_bbox(Raster)[[2]] &
         sf::st_bbox(Clipping.DF$geometry[i])[[3]] <= sf::st_bbox(Raster)[[3]] &
         sf::st_bbox(Clipping.DF$geometry[i])[[4]] <= sf::st_bbox(Raster)[[4]]) {

        # get the cropped raster
        Cropped.Raster.List[i] <- raster::mask(raster::crop(Raster,
                                                            raster::extent(Clipping.DF[i,])),
                                               Clipping.DF[i,]) %>%
          # project the raster to meters
          raster::projectRaster(., crs = Clipping.DF$Projection[[i]]) %>%
          # recategorize all the cells if they are between two values
          raster::calc(., fun = function(x){
            ifelse(x > min.code & x < max.code, 1, 2)})
      }
    }
    }

  # name the rasters by the populations used to crop them
  names(Cropped.Raster.List) <- Clipping.DF$Population_ID

  # return the rasters
  return(Cropped.Raster.List)
}
```


We create a data frame that has the population buffers, their names, and the projection that should be used to convert them to UTM.

```{r, Clipping Data frame}
# make sf object for easier graphing and to combine with population labels
Labeled.Pop.Groups <- sf::st_as_sf(Population.Groups, group = TRUE) %>%
  # unproject
  sf::st_transform(., crs = "+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0") %>%
  #combine with population labels
  cbind(., Population_ID = c("UH", "RH", "TC", "LC", "TA", "NC", "LK", "UK", "EI", "SS", "EG", "SL", "FL", "LB", "VR", "BB", "CC", "TB", "IR", "NS")) %>%
  # combine with the projection for each population
  cbind(., Projection = c( "+proj=utm +zone=16 +ellps=clrk66 +units=m +no_defs",
                           "+proj=utm +zone=16 +ellps=clrk66 +units=m +no_defs",
                           "+proj=utm +zone=16 +ellps=clrk66 +units=m +no_defs",
                           "+proj=utm +zone=16 +ellps=clrk66 +units=m +no_defs",
                           "+proj=utm +zone=16 +ellps=clrk66 +units=m +no_defs",
                           "+proj=utm +zone=16 +ellps=clrk66 +units=m +no_defs",
                           "+proj=utm +zone=17 +ellps=clrk66 +units=m +no_defs",
                           "+proj=utm +zone=17 +ellps=clrk66 +units=m +no_defs",
                           "+proj=utm +zone=18 +ellps=clrk66 +units=m +no_defs",
                           "+proj=utm +zone=18 +ellps=clrk66 +units=m +no_defs",
                           "+proj=utm +zone=17 +ellps=clrk66 +units=m +no_defs",
                           "+proj=utm +zone=17 +ellps=clrk66 +units=m +no_defs",
                           "+proj=utm +zone=17 +ellps=clrk66 +units=m +no_defs",
                           "+proj=utm +zone=18 +ellps=clrk66 +units=m +no_defs",
                           "+proj=utm +zone=17 +ellps=clrk66 +units=m +no_defs",
                           "+proj=utm +zone=17 +ellps=clrk66 +units=m +no_defs",
                           "+proj=utm +zone=17 +ellps=clrk66 +units=m +no_defs",
                           "+proj=utm +zone=17 +ellps=clrk66 +units=m +no_defs",
                           "+proj=utm +zone=17 +ellps=clrk66 +units=m +no_defs",
                           "+proj=utm +zone=17 +ellps=clrk66 +units=m +no_defs"))

```

We actually format the rasters with the data frame and path to the rasters
```{r, Format Rasters}
# crop and project the rasters
Cropped.Rasters <- Raster.Processing.Fun(path = paste(getwd(),"Data/GIS/", sep = "/"),
                                         Clipping.DF = Labeled.Pop.Groups,
                                         min.code = 9,
                                         max.code = 15)
```

### Calcualte Landscape Metrics

Here we calculate the landscape metrics.

```{r, Landscape Metrics}
# calculate the landscape metrics for each population
Landscape.Metrics <- lapply(Cropped.Rasters, FUN = function(x) {
  x %>%
    # calculate the landscape metrics
    landscapemetrics::calculate_lsm(., level = "class", directions = 8, count_boundary = FALSE,
                                    consider_boundary = FALSE, edge_depth = 1, cell_center = FALSE,
                                    neighbourhood = 8, what = c("lsm_c_ca",
                                                                "lsm_c_cohesion",
                                                                "lsm_c_ed",
                                                                "lsm_c_np")) %>%
    # only keep it for our class
    dplyr::filter(class == 1)
}) %>%