methane-cycling-soil-microbiomes/summary_figures.Rmd at main · delaney-beals/methane-cycling-soil-microbiomes · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "RBC 16S analysis: microbial community summary"
output: html_document
---

Resource: <https://astrobiomike.github.io/amplicon/dada2_workflow_ex#taxonomic-summaries>

# Summary figures for RBC 16S amplicon sequencing data

The objective is to create summary figures of the microbial taxa that are present in my data that show the general trends of what microbes are in each sample. These will then feed into more specific pairwise comparisons to where I see if there is any significant difference in the microbial communities when we look at June vs. October, riparian vs. upland soil, gDNA vs. cDNA, etc. These figures will be a continuation of the summary figures that we make towards the end of the official "June_Oct_16S_MERGE_analysis" pipeline.

Here we will use the phyloseq package to do most of the work. My understanding of how this works is that from our processing and analysis pipeline we have generated three things:

-   raw count table: samples x ASVs = \# of times each ASV was counted in each sample

-   taxa table: ASVs x taxonomic classification = what each ASV corresponds to in terms of Kingdom, Phylum, etc.

-   metadata table: any additional information about our samples such as month, soil type, etc.

Then with the phyloseq package we will combine all three of these tables into one phyloseq object from which we can pull out specific information and perform transformations and whatnot while keeping all the information linked in the correct way.

### A note on relative abundance

For most, if not all, of the following figures we are showing the relative abundance of different ASVs (or groups of ASVs) for each sample. Relative abundance is a method of normalization, so we will be starting with raw counts and then calculating relative abundance from there. To do that, we tally up all the counts from all ASVs found in a single sample, then calculate the fraction that each ASV makes up of that summed total. Basically, we are calculating what percent each ASV makes up each sample. The **key point** is that in order to keep things reproducible, we need to make sure that we are using ALL of the ASVs available for each sample when we calculate the total. Further down we will get into filtering/subsetting data into categories and then calculating relative abundance, but that runs the risk of us accidentally leaving out some ASV counts that would affect the total ASV number. There are two ways I am going to use to make sure that doesn't happen:

1.  when I am filtering first and then calculating relative abundance, I will compare the total counts of my filtered table to my original raw counts table and only perform the relative abundance calculation when both those tables are identical in sums. The way to do this is to create a category for "unclassified" or "other" taxa each time I perform a filtering step so that those ASVs are still counted in my final numbers, even though I don't necessarily care what taxa those ASVs belong to.

2.  alternatively, I can create a new phyloseq object from the original that transforms the count table to relative abundance values before doing anything else. This way, I can filter and subset by taxa of interest as much as I want and the actual values for each ASV won't change, as those were calculated when I had all possible ASVs included. However, I might run into issues down the line because some functions require integers (i.e. non-transformed counts) as input, but I have not yet run into this issue.

The first part of this notebook will use the strategy outline in point 1, and then I will see how using a pre-transformed relative abundance count table in my phyloseq object looks in part 2.

```{r message=FALSE, warning=FALSE}
library(tidyverse)
library(phyloseq)
library(vegan)
library(DESeq2)
library(dendextend)
library(viridis)
library(ggplot2)
library(gridExtra)
library(cowplot)

# double check what packages are installed & loaded
(.packages())
```

## Raw count phyloseq

```{r}
# Load in data
# Need to specify row.names = 1 so that sample names do not appear as their own column. Instead they are the row headings (names).
bact_count_table <- read.table("data/June_Oct_MERGE_ASV_counts.tsv", header = T, row.names = 1, check.names = F, sep = "\t")

bact_taxa_table <- as.matrix(read.table("data/June_Oct_MERGE_ASV_taxa_table.tsv", header = T, row.names = 1, check.names = F, sep = "\t"))

bact_sample_info_table <- read.table("data/June_Oct_16S_MERGE_metadata.csv", header = T, row.names = 1, sep = ",")

# using the phyloseq package, we will combine our three tables into one phyloseq object
otu_table <- phyloseq::otu_table(bact_count_table, taxa_are_rows=T)
tax_table <- phyloseq::tax_table(bact_taxa_table)
sample_data <- phyloseq::sample_data(bact_sample_info_table)

# now generate the phyloseq object
ASV_physeq <- phyloseq(otu_table, tax_table, sample_data)
```

### Kingdom subset

We are primarily interested in the Bacteria kingdom of taxa, which the majority of our ASVs likely are. We are going to first subset all our counts table into Bacteria and non-Bacteria. When I look at the taxa table, I can see that some ASVs do not have any taxonomy assigned, not even to the first level of Kingdom, so these will need to go into their own "unclassified" group to keep our ASV totals consistent.

The **tax_glom** function from phyloseq merges species that have the same taxonomy at a certain taxonomic rank. Column/ranks to the right of the rank chose to use for agglomeration will be replace with NA because they are effectively meaningless following agglomeration.

```{r}
# using phyloseq to make a count table that has summed all ASVs that were in the same kingdom
king_counts_tab <- otu_table(tax_glom(ASV_physeq, taxrank="Kingdom"))

# making a vector of kingdom names to set as row names
rownames(king_counts_tab) <- as.vector(as.vector(tax_table(tax_glom(ASV_physeq, taxrank="Kingdom"))[,1]))
```

We also have to account for sequences that weren't assigned any taxonomy even at the kingdom level these came into R as 'NAs' in the taxonomy table, but their counts are still in the count table so we can get that value for each sample by subtracting the column sums of this new table (that has everything that had a kingdom assigned to it) from the column sums of the starting count table (that has all representative sequences).

```{r}
unclassified_tax_counts_king <- colSums(bact_count_table) - colSums(king_counts_tab)

# and we'll add this row to our kingdom count table:
king_and_unidentified_counts_tab <- rbind(king_counts_tab, "Unclassified"=unclassified_tax_counts_king)

# calculate proportions
major_king_proportions_tab <- apply(king_and_unidentified_counts_tab, 2, function(x) x/sum(x)*100)

major_king_proportions_tab_t <- t(major_king_proportions_tab)

write.csv(major_king_proportions_tab_t, "king_prop.csv", sep = "\t", quote = F, col.names = NA)
```

Looking at the table we created, it seems that the Bacteria kingdom is where most of the ASVs we found in all of our samples is coming from, followed by the Unclassified category.

This is still too broad of information and can be hard to identify trends just based on kingdom-level taxonomic classification. Therefore, I don't think it's worth it to calculate relative abundances just yet. We can subset our counts further based on lower taxonomic groups, including looking specifically at the phyla that are within the bacteria kingdom, for example.


### Phyla in Bacteria subset

First we need to group all the counts that were assigned to the same phylum and then update our table to reflect the name of the phylum rather than the ASV identifier.

```{r}
# making count table broken down by phylum
phyla_counts_tab <- otu_table(tax_glom(ASV_physeq, taxrank="Phylum"))

# making a table that holds the kingdom and phylum level info for each ASV
phy_tax_king_tab <- tax_table(tax_glom(ASV_physeq, taxrank="Phylum"))
phy_tax_tab <- data.frame("Kingdom"=phy_tax_king_tab[,1], "Phylum"=phy_tax_king_tab[,2], row.names = row.names(phy_tax_king_tab))

# changing the row names like above so that they correspond to the taxonomy, rather than an ASV identifier
rownames(phyla_counts_tab) <- as.vector(phy_tax_tab$Phylum)
```

```{r}
# Accounting for ASVs that were in Bacteria but not assigned a phylum level.
unclassified_tax_counts_phyla <- colSums(bact_count_table) - colSums(phyla_counts_tab)

# and we'll add this row to our phyla count table:
phyla_and_unidentified_counts_tab <- rbind(phyla_counts_tab, "Unclassified_Phylum"=unclassified_tax_counts_phyla)
```

Now we want to look at just the phyla that are within the Bacteria kingdom.

```{r}
# making a vector of just the Bacteria phyla
bact_phyla_vec <- as.vector(phy_tax_tab[phy_tax_tab$Kingdom == "Bacteria", "Phylum"])

# making a table of the counts of the Bacterial phyla
bact_phyla_counts_tab <- phyla_counts_tab[row.names(phyla_counts_tab) %in% bact_phyla_vec, ]
```

There are also possibly some some sequences that were resolved to the level of Bacteria, but not any further, and therefore would be missing from our phylum table we can find the sum of them by subtracting the Bacterial phyla count table from just the Bacteria row from the original kingdom-level count table

```{r}
bact_no_phy_counts <- king_and_unidentified_counts_tab[row.names(king_and_unidentified_counts_tab) %in% "Bacteria", ] - colSums(bact_phyla_counts_tab) # These values will be added to our counts table as "Unresolved Bacteria"

# since the counts assigned "Bacteria" are all accounted for (as they are all further subsetted within the phylum table, including the unidentified Bacteria), let's make a table of our kingdom counts for everything other than bacteria
non_bact <- king_and_unidentified_counts_tab[-1,]

# now combining the tables:
phyla_taxa_counts_tab <- rbind(non_bact, bact_phyla_counts_tab, "Unresolved_Bacteria"= bact_no_phy_counts)
```

To check we didn't miss any other sequences, we can compare the column sums of our new tables to the column sums of the original raw count table to see if they are the same. If "TRUE", we know nothing fell through the cracks and we can confidently calculate relative abundance.

```{r}
# Starting with just the kingdom classification table
identical(colSums(bact_count_table), colSums(king_and_unidentified_counts_tab)) # TRUE

# This is our combined table of non-bacteria kingdoms and the bacteria kingdom broken into different phyla (including Bacteria that didn't have a phylum assignment)
identical(colSums(bact_count_table), colSums(phyla_taxa_counts_tab)) # TRUE

# calculate relative abundance in PERCENT
major_phyla_proportions_tab <- apply(phyla_taxa_counts_tab, 2, function(x) x/sum(x)*100)
```

We have calculated the relative abundance of our table that has the phyla within the Bacteria kingdom as well as categories for any ASVs that are not assigned phyla within the Bacteria kingdom. We can use this table to generate some summary figures. If we check the dimensions of this table at this point,

```{r}
dim(major_phyla_proportions_tab)
```

we see there are currently 40 rows, which might be a little busy for a summary figure. Many of these taxa make up a very small percentage, so we're going to filter some out. This is a completely arbitrary decision solely to ease visualization and interpretation, entirely up to your data and you. Here, we'll only keep rows (taxa) that make up greater than 1% in any individual sample. Those that got filtered out in this step will be kept in their own category.

```{r}
temp_filt_major_phyla_proportions_tab <- as.data.frame(major_phyla_proportions_tab[apply(major_phyla_proportions_tab, 1, max) > 1, ])
# how many are above 1%
dim(temp_filt_major_phyla_proportions_tab)
# now we have 15, much more manageable for an overview figure

# though each of the filtered taxa made up less than 1% alone, together they may add up and should still be included in the overall summary so we're going to add a row called "Other" that keeps track of how much we filtered out (which will also keep our totals at 100%)
filtered_phyla <- colSums(major_phyla_proportions_tab) - colSums(temp_filt_major_phyla_proportions_tab)

filt_major_phyla_proportions_tab <- rbind(temp_filt_major_phyla_proportions_tab, "Other (low %)"=filtered_phyla)
```

Now that we have a nice proportions table ready to go, we can make some figures with it. First we need to get the information ready for plotting.

```{r}
# first let's make a copy of our table that's safe for manipulating
phyla_tax <- filt_major_phyla_proportions_tab

# and add a column of the taxa names so that it is within the table, rather than just as row names (this makes working with ggplot easier)
phyla_tax$Major_Taxa <- row.names(phyla_tax)

# now we'll transform the table into narrow, or long, format (also makes plotting easier)
phyla_tax_long <- pivot_longer(phyla_tax, !Major_Taxa, names_to = "Sample", values_to = "Proportion") %>% data.frame()

# now we want a table metadata of each sample to merge into our plotting table so we can use that more easily in our plotting function. Here we're making a new table by pulling what we want from the sample information table
sample_info_for_merge<-data.frame("Sample"=row.names(bact_sample_info_table), "month"=bact_sample_info_table$month, "soil"=bact_sample_info_table$soil, "template"=bact_sample_info_table$template, "replicate"=bact_sample_info_table$replicate, "monthsoil"=bact_sample_info_table$monthsoil, stringsAsFactors=F)

# and here we are merging this table with the plotting table we just made
phyla_tax_info <- merge(phyla_tax_long, sample_info_for_merge)
```

#### Figures

One common way to look at this is with stacked bar charts for each taxon per sample:

```{r}
ggplot(phyla_tax_info, aes(x=Sample, y=Proportion, fill=Major_Taxa)) +
  geom_bar(width=0.6, stat="identity", color = "black") +
  theme_bw() +
  theme(axis.text.x=element_text(angle=90, vjust=0.4, hjust=1), legend.title=element_blank()) +
  labs(x="Sample", y="% of 16S rRNA gene copies recovered", title="All samples")
```

We can also simply the bar plot slightly by pooling three technical replicates, but this just sums the ratios for each taxa, giving us a total of 300% instead of 100%.

```{r}
ggplot(phyla_tax_info, aes(x=replicate, y=Proportion, fill=Major_Taxa)) +
  geom_bar(width=0.6, stat="identity", color = "black") +
  theme_bw() +
  theme(axis.text.x=element_text(angle=90, vjust=0.4, hjust=1), legend.title=element_blank()) +
  labs(x="Sample", y="% of 16S rRNA gene copies recovered", title="All samples")
```

Looks like Proteobacteria is making up a lot of the Bacterial phyla.

We can also represent this same information in a box plot. This time, only the colors are the same for the replicates, we aren't pooling or summing the relative abundances for technical replicates.

```{r}
ggplot(phyla_tax_info, aes(Major_Taxa, Proportion)) +
  geom_jitter(aes(color=factor(replicate)), size=2.5, width=0.15, height=0) +
  geom_boxplot(fill=NA, outlier.color=NA) +
  theme_bw() +
  theme(axis.text.x=element_text(size=12, angle=45, hjust=1), legend.title=element_blank()) +
  labs(x="Major Taxa", y="% of 16S rRNA gene copies recovered", title="All samples")


# reordering the labels
ex <- phyla_tax_info %>%
  mutate(Major_Taxa = fct_relevel(Major_Taxa,  "Unclassified", "Other (low %)","Verrucomicrobia","Planctomycetes",   "Nitrospirae", "Latescibacteria", "Gemmatimonadetes", "Firmicutes", "candidate_division_WPS-1", "Chloroflexi", "Bacteroidetes",   "Actinobacteria", "Acidobacteria", "Proteobacteria","Unresolved_Bacteria", "Archaea"))

phyla_tax_info_reorder <- ex %>%
  mutate(template = fct_relevel(template,  "gDNA","cDNA"))

my_theme_grid <- theme_bw() +
        theme(text=element_text(size = 12),
        axis.text.x=element_text(size=10, color = "black"),
        axis.text.y = element_text(size=10, color = "black"),
        title = element_text(size = 10, color = "black"),
        legend.title=element_blank()
        )

# same plot but horizontal and with new label order
aba <- ggplot(phyla_tax_info_reorder, aes(Proportion, Major_Taxa)) +
  geom_jitter(aes(color=factor(monthsoil), shape=factor(template)),
              size=2.5, width=0.15, height=0) +
  my_theme_grid +
  scale_x_continuous(breaks = seq(0,100, by = 10)) +
  labs(x="% of 16S rRNA gene copies recovered", y="",title="Major taxa and Bacterial phyla")

print(aba)

# change color scheme of plot
# custom color plot
fill_col = c("#b2182b", "#fb9a99", "#2166ac", "#a6cee3")


ab <- ggplot(phyla_tax_info_reorder, aes(Proportion, Major_Taxa)) +
  geom_jitter(aes(fill = monthsoil, shape = template, color = "black"),
              size = 2.5, width = 0.15, height = 0) +
  scale_shape_manual(values = c(21, 24)) +
  scale_fill_manual(values = fill_col) +
  scale_color_manual(values = "black") +  # This sets the outline to black
  my_theme_grid +
  scale_x_continuous(breaks = seq(0, 100, by = 10)) +
  labs(x = "% of 16S rRNA gene copies recovered", y = "", title = "Major taxa and Bacterial phyla") +
    guides(color = "none", fill = guide_legend(override.aes = list(color = fill_col)))

ab

```

Again, we can see that most of the ASVs are assigned as Proteobacteria, so let's look further into this phylum by doing similar filtering to what we did earlier.

### Classes in Proteobacteria

First will agglomerate the count and taxa table to the Class level.

```{r}
# making count table broken down by class
class_counts_tab <- otu_table(tax_glom(ASV_physeq, taxrank="Class"))

# making a table that holds the phylum and class level info for each ASV
class_tax_phy_tab <- tax_table(tax_glom(ASV_physeq, taxrank="Class"))
class_tax_tab <- data.frame("Phylum"=class_tax_phy_tab[,2], "Class"=class_tax_phy_tab[,3], row.names = row.names(class_tax_phy_tab))

# changing the row names like above so that they correspond to the taxonomy, rather than an ASV identifier
rownames(class_counts_tab) <- as.vector(class_tax_tab$Class)
```

Because we are getting down even further in taxonomy, we are going to drop the higher-level-only classified ASVs so that our figures only show classes. This is in contrast to the previous figures where we showed a mix of taxonomic levels in the figures.

We will look specifically at the classes that are within the Proteobacteria phylum, since all of the reported methanotrophs and methylotrophs are members of Proteobacteria.

```{r}
# making a vector of just the Proteobacteria classes
proteo_classes_vec <- as.vector(class_tax_tab[class_tax_tab$Phylum == "Proteobacteria", "Class"])

# making a table of the counts of the Proteobacterial classes
proteo_class_counts_tab <- class_counts_tab[row.names(class_counts_tab) %in% proteo_classes_vec, ]

# taking into account ASVs that were assigned Proteobacteria but not any further
proteo_no_class_counts <-
  phyla_taxa_counts_tab[row.names(phyla_taxa_counts_tab) %in% "Proteobacteria", ] - colSums(proteo_class_counts_tab)
```

```{r}
# since the counts assigned "Proteobacteria" are all accounted for (as they are all further subsetted within the class table), let's make a table of our  counts for everything other than proteobacteria and just call that "non-proteobacteria" since we don't really care what it is.
non_proteo <- colSums(phyla_and_unidentified_counts_tab[-1,])

# now combining the tables:
major_proteo_counts_tab <- rbind("Non_Proteobacteria" = non_proteo,
                               proteo_class_counts_tab,
                               "Unresolved_Proteobacteria"=
                               proteo_no_class_counts)

# check to make sure we aren't missing any sequences from our original raw count table.
identical(colSums(major_proteo_counts_tab), colSums(bact_count_table)) # TRUE

# Calculate relative abundance in PERCENT
major_proteo_proportions_tab <- apply(major_proteo_counts_tab, 2, function(x) x/sum(x)*100)

# if we check the dimensions of this table at this point
dim(major_proteo_proportions_tab)
# we see there are currently 8 rows, which is fine for plotting.
```

Reformat our proportions/relative abundance table for plotting.

```{r}
# first let's make a copy of our table that's safe for manipulating
proteo_tax <- as.data.frame(major_proteo_proportions_tab)

# and add a column of the taxa names so that it is within the table, rather than just as row names (this makes working with ggplot easier)
proteo_tax$Classes <- row.names(proteo_tax)

# now we'll transform the table into narrow, or long, format (also makes plotting easier)
proteo_tax_long <- pivot_longer(proteo_tax, !Classes, names_to = "Sample", values_to = "Proportion") %>% data.frame()

# now we want a table metadata of each sample to merge into our plotting table so we can use that more easily in our plotting function. Here we're making a new table by pulling what we want from the sample information table
sample_info_for_merge <- data.frame(
  "Sample"=row.names(bact_sample_info_table),
  "month"=bact_sample_info_table$month,
  "soil"=bact_sample_info_table$soil,
  "template"=bact_sample_info_table$template,
  "replicate"=bact_sample_info_table$replicate,
  "monthsoil"=bact_sample_info_table$monthsoil,
  stringsAsFactors=F)

# and here we are merging this table with the plotting table we just made
proteo_tax_info <- merge(proteo_tax_long, sample_info_for_merge)
```

#### Figures

Starting with a stacked bar chart.

```{r}
ggplot(proteo_tax_info,
       aes(x=Sample, y=Proportion, fill=Classes)) +
  geom_bar(width=0.6, stat="identity", color = "black") +
  theme_bw() +
  theme(axis.text.x=element_text(angle=90, vjust=0.4, hjust=1), legend.title=element_blank()) +
  labs(x="Sample", y="% of 16S rRNA gene copies recovered", title="All samples")

# with pooled technical replicates
ggplot(proteo_tax_info,
       aes(x=replicate, y=Proportion, fill=Classes)) +
  geom_bar(width=0.6, stat="identity", color = "black") +
  theme_bw() +
  theme(axis.text.x=element_text(angle=90, vjust=0.4, hjust=1), legend.title=element_blank()) +
  labs(x="Sample", y="% of 16S rRNA gene copies recovered", title="All samples")
```

Box plot of the same information.

```{r}
ggplot(proteo_tax_info, aes(Classes, Proportion)) +
  geom_jitter(aes(color=factor(replicate)), size=2.5, width=0.15, height=0) +
  geom_boxplot(fill=NA, outlier.color=NA) +
  theme_bw() +
  theme(axis.text.x=element_text(size=12, angle=45, hjust=1), legend.title=element_blank()) +
  labs(x="Major Taxa", y="% of 16S rRNA gene copies recovered", title="All samples")

# reordering the labels
proteo_tax_info_reorder <- proteo_tax_info %>%
  mutate(Classes = fct_relevel(Classes,  "Non_Proteobacteria","Unresolved_Proteobacteria", "Oligoflexia","Hydrogenophilalia","Gammaproteobacteria",   "Deltaproteobacteria", "Betaproteobacteria", "Alphaproteobacteria"))

proteo_tax_info_reorder1 <- proteo_tax_info_reorder %>%
  mutate(template = fct_relevel(template,  "gDNA","cDNA"))

# same plot but horizontal and with new label order
bab <- ggplot(proteo_tax_info_reorder1, aes(Proportion, Classes)) +
  geom_jitter(aes(color=factor(monthsoil), shape=factor(template)),
              size=2.5, width=0.15, height=0) +
  my_theme_grid +
  scale_x_continuous(breaks = seq(0,100, by = 10)) +
  labs(x="% of 16S rRNA gene copies recovered", y = "", title = "Major classes within Proteobacteria")

print(bab)

# new color scheme
ba <- ggplot(proteo_tax_info_reorder1, aes(Proportion, Classes)) +
  geom_jitter(aes(fill = monthsoil, shape = template, color = "black"),
              size = 2.5, width = 0.15, height = 0) +
  scale_shape_manual(values = c(21, 24)) +
  scale_fill_manual(values = fill_col) +
  scale_color_manual(values = "black") +
  my_theme_grid +
  scale_x_continuous(breaks = seq(0, 100, by = 10)) +
  labs(x = "% of 16S rRNA gene copies recovered", y = "", title = "Major classes within Proteobacteria") +
    guides(color = "none", fill = guide_legend(override.aes = list(color = fill_col)))

ba
```

```{r}
# stack two grids
grid.arrange(ab, ba, ncol = 1)

# share a legend
legend3 <- get_legend(ab)
taxa_combo <- plot_grid(ab + theme(legend.position = "none"),
          ba + theme(legend.position = "none"),
          nrow=2,  rel_heights = c(1, 0.8))


plot_grid(taxa_combo, legend3,
          ncol=2, rel_widths = c(1, 0.3), rel_heights = c(1, 0.3))
```

Since the non-proteobacteria category accounts for a lot of the ASV counts, we can create a table that leaves that out, so we can just focus on the relative abundances of the Proteobacteria.

```{r}
# making a copy of our proportions table that doesn't include the "non-proteobacteria" row
just_proteo <- as.data.frame(major_proteo_proportions_tab[-1,])

# add a column of the taxa names so that it is within the table, rather than just as row names (this makes working with ggplot easier)
just_proteo$Classes <- row.names(just_proteo)

# transform the table into narrow, or long, format (also makes plotting easier)
just_proteo_long <- pivot_longer(just_proteo, !Classes, names_to = "Sample", values_to = "Proportion") %>% data.frame()

# merge this table with out metadata table
just_proteo_info <- merge(just_proteo_long, sample_info_for_merge)
```

Remake figures with Proteobacteria-only table

```{r}
ggplot(just_proteo_info,
       aes(x=Sample, y=Proportion, fill=Classes)) +
  geom_bar(width=0.6, stat="identity", color = "black") +
  theme_bw() +
  theme(axis.text.x=element_text(angle=90, vjust=0.4, hjust=1), legend.title=element_blank()) +
  labs(x="Sample", y="% of 16S rRNA gene copies recovered", title="All samples")

ggplot(just_proteo_info, aes(Classes, Proportion)) +
  geom_jitter(aes(color=factor(replicate)), size=2.5, width=0.15, height=0) +
  geom_boxplot(fill=NA, outlier.color=NA) +
  theme_bw() +
  theme(axis.text.x=element_text(size=12, angle=45, hjust=1), legend.title=element_blank()) +
  labs(x="Major Taxa", y="% of 16S rRNA gene copies recovered", title="All samples")
```

Let's also look at a bubble plot to see a further breakdown among our sample characteristics.

```{r}
colours = c("#21ADFF","#4FFF00")

ggplot(just_proteo_info, aes(x = Sample, y = Classes)) +
  geom_point(aes(size = Proportion, fill = soil),
             alpha = 5, shape = 21) +
  scale_size_continuous(limits = c(0.00001, 10), range = c(1,10),
                        breaks = c(0.1, 1, 10)) +
  facet_grid(template ~ month, margins = T) +
  labs( x= "", y = "", size = "Relative Abundance (%)", fill = "") +
  scale_fill_manual(values = colours) +
  theme_bw() +
  theme(text = element_text(size = 10),
        axis.text.x = element_text(size = 10,
                                   color = "black",
                                   angle = -90,
                                   hjust = 0),
        axis.text.y = element_text(size = 10,
                                   color = "black")
        )
```

### Gammaproteobacteria

I want to look more into what's going on with October upland cDNA replicate 3, so I am going to look at the orders within Gammaproteobacteria.

First we need to group all the counts that were assigned to the same class and then update our table to reflect the name of the class rather than the ASV identifier.

```{r}
# making count table broken down by class
class_counts_tab <- otu_table(tax_glom(ASV_physeq, taxrank="Class"))

# making a table that holds the phylum and class level info for each ASV
class_tax_phy_tab <- tax_table(tax_glom(ASV_physeq, taxrank="Class"))
class_tax_tab <- data.frame("Phylum"=class_tax_phy_tab[,2], "Class"=class_tax_phy_tab[,3], row.names = row.names(class_tax_phy_tab))

# changing the row names like above so that they correspond to the taxonomy, rather than an ASV identifier
rownames(class_counts_tab) <- as.vector(class_tax_tab$Class)
```

```{r}
# Accounting for ASVs that were not assigned a class level.
unclassified_tax_counts_class <- colSums(bact_count_table) - colSums(class_counts_tab)

# and we'll add this row to our class count table:
class_and_unidentified_counts_tab <- rbind(class_counts_tab, "No_Class_assigned"=unclassified_tax_counts_class)

# check that we are on the right track
identical(colSums(bact_count_table), colSums(class_and_unidentified_counts_tab))
```

Now we want to look at just the classes that are within the gammaproteobacterial class.

```{r}
# making count table broken down by order
order_counts_tab <- otu_table(tax_glom(ASV_physeq, taxrank="Order"))

# making a table that holds the class and order level info for each ASV
order_tax_class_tab <- tax_table(tax_glom(ASV_physeq, taxrank="Order"))

order_tax_tab <- data.frame("Class"=order_tax_class_tab[,3], "Order"=order_tax_class_tab[,4], row.names = row.names(order_tax_class_tab))

# changing the row names like above so that they correspond to the taxonomy, rather than an ASV identifier
rownames(order_counts_tab) <- as.vector(order_tax_tab$Order)
```

Because we are getting down even further in taxonomy, we are going to drop the higher-level-only classified ASVs so that our figures only show orders.

We will look specifically at the classes that are within the Gammaproteobacteria class

```{r}
# making a vector of just the Gammaproteobacteria orders
gamma_orders_vec <- as.vector(order_tax_tab[order_tax_tab$Class == "Gammaproteobacteria", "Order"])

# making a table of the counts of the Gammaproteobacteria orders
gamma_order_counts_tab <- order_counts_tab[row.names(order_counts_tab) %in% gamma_orders_vec, ]

# taking into account ASVs that were assigned Gammaproteobacteria but not any further
gamma_no_order_counts <- class_and_unidentified_counts_tab[row.names(class_and_unidentified_counts_tab) %in% "Gammaproteobacteria", ] - colSums(gamma_order_counts_tab)

# since the counts assigned "Gammaproteobacteria" are all accounted for (as they are all further subsetted within the phylum table, including the unidentified Gammaproteobacteria), let's make a table of our class counts for everything other than gamma
non_gamma <- class_and_unidentified_counts_tab[-2,]
non_gamma_count <- colSums(non_gamma)

# now combining the tables:
order_taxa_counts_tab <- rbind("Other" = non_gamma_count, gamma_order_counts_tab, "Unresolved_Gammaproteobacteria"= gamma_no_order_counts)

identical(colSums(bact_count_table), colSums(order_taxa_counts_tab))

major_order_proportions_tab <- apply(order_taxa_counts_tab, 2, function(x) x/sum(x)*100)
```

```{r}
# first let's make a copy of our table that's safe for manipulating
gamma_tax <- as.data.frame(major_order_proportions_tab)

# and add a column of the taxa names so that it is within the table, rather than just as row names (this makes working with ggplot easier)
gamma_tax$Order <- row.names(gamma_tax)

# now we'll transform the table into narrow, or long, format (also makes plotting easier)
gamma_tax_long <- pivot_longer(gamma_tax, !Order, names_to = "Sample", values_to = "Proportion") %>% data.frame()

# now we want a table metadata of each sample to merge into our plotting table so we can use that more easily in our plotting function. Here we're making a new table by pulling what we want from the sample information table
sample_info_for_merge <- data.frame("Sample"=row.names(bact_sample_info_table),
                                  "month"=bact_sample_info_table$month,
                                  "soil"=bact_sample_info_table$soil,
                                  "template"=bact_sample_info_table$template,
                                  "replicate"=bact_sample_info_table$replicate,
                                  stringsAsFactors=F)

# and here we are merging this table with the plotting table we just made
gamma_tax_info <- merge(gamma_tax_long, sample_info_for_merge)
```

```{r}
ggplot(gamma_tax_info, aes(Order, Proportion)) +
  geom_jitter(aes(color=factor(replicate)), size=2.5, width=0.15, height=0) +
  geom_boxplot(fill=NA, outlier.color=NA) +
  theme_bw() +
  theme(axis.text.x=element_text(size=12, angle=45, hjust=1), legend.title=element_blank()) +
  labs(x="Gammaproteobacteria orders", y="% of 16S rRNA gene copies recovered", title="All samples")

# switching axes
ggplot(gamma_tax_info, aes(Proportion, Order)) +
  geom_jitter(aes(color=factor(replicate)), size=2.5, width=0.15, height=0) +
  geom_boxplot(fill=NA, outlier.color=NA) +
  theme_bw() +  my_theme +
  theme(axis.text.x=element_text(size=12), legend.title=element_blank()) +
  labs(x="% of 16S rRNA gene copies recovered" , y= "Gammaproteobacteria orders")
```

### Methylococcales

```{r}
# Accounting for ASVs that were not assigned a order level.
unclassified_tax_counts_order <- colSums(bact_count_table) - colSums(order_counts_tab)

# and we'll add this row to our class count table:
order_and_unidentified_counts_tab <- rbind(order_counts_tab, "No_Order_assigned"=unclassified_tax_counts_order)

# check that we are on the right track
identical(colSums(bact_count_table), colSums(order_and_unidentified_counts_tab))
```

```{r}
# making count table broken down by family
fam_counts_tab <- otu_table(tax_glom(ASV_physeq, taxrank="Family"))

# making a table that holds the order and family level info for each ASV
fam_tax_order_tab <- tax_table(tax_glom(ASV_physeq, taxrank="Family"))

fam_tax_tab <- data.frame("Order"=fam_tax_order_tab[,4], "Family"=fam_tax_order_tab[,5], row.names = row.names(fam_tax_order_tab))

# changing the row names like above so that they correspond to the taxonomy, rather than an ASV identifier
rownames(fam_counts_tab) <- as.vector(fam_tax_tab$Family)
```

```{r}
# making a vector of just the Methylococcales families
meth_fam_vec <- as.vector(fam_tax_tab[fam_tax_tab$Order == "Methylococcales", "Family"])

# making a table of the counts of the Gammaproteobacteria orders
meth_fam_counts_tab <- fam_counts_tab[row.names(fam_counts_tab) %in% meth_fam_vec, ]

# taking into account ASVs that were assigned Gammaproteobacteria but not any further
meth_no_fam_counts <- order_and_unidentified_counts_tab[row.names(order_and_unidentified_counts_tab) %in% "Methylococcales", ] - colSums(meth_fam_counts_tab)

# since the counts assigned "Gammaproteobacteria" are all accounted for (as they are all further subsetted within the phylum table, including the unidentified Gammaproteobacteria), let's make a table of our class counts for everything other than gamma
non_meth <- order_and_unidentified_counts_tab[-7,]
non_meth_count <- colSums(non_meth)

# now combining the tables:
fam_taxa_counts_tab <- rbind("Other" = non_meth_count, meth_fam_counts_tab, "Unresolved_Methylococcales"= meth_no_fam_counts)

identical(colSums(bact_count_table), colSums(fam_taxa_counts_tab))

major_family_proportions_tab <- apply(fam_taxa_counts_tab, 2, function(x) x/sum(x)*100)

```

```{r}
# first let's make a copy of our table that's safe for manipulating
meth_tax <- as.data.frame(major_family_proportions_tab)

# and add a column of the taxa names so that it is within the table, rather than just as row names (this makes working with ggplot easier)
meth_tax$Family <- row.names(meth_tax)

# now we'll transform the table into narrow, or long, format (also makes plotting easier)
meth_tax_long <- pivot_longer(meth_tax, !Family, names_to = "Sample", values_to = "Proportion") %>% data.frame()

# and here we are merging this table with the plotting table we just made
meth_tax_info <- merge(meth_tax_long, sample_info_for_merge)
```

```{r}
ggplot(meth_tax_info, aes(Proportion,Family )) +
  geom_jitter(aes(color=factor(replicate)), size=2.5, width=0.15, height=0) +
  geom_boxplot(fill=NA, outlier.color=NA) +
  theme_bw() +  my_theme +
  theme(axis.text.x=element_text(size=12), legend.title=element_blank()) +
  labs(x="% of 16S rRNA gene copies recovered", y="Methylococcales families", title="All samples")
```

### Methylococcaceae

```{r}
# Accounting for ASVs that were not assigned a family level.
unclassified_tax_counts_fam <- colSums(bact_count_table) - colSums(fam_counts_tab)

# and we'll add this row to our class count table:
fam_and_unidentified_counts_tab <- rbind(fam_counts_tab, "No_Family_assigned"=unclassified_tax_counts_fam)

# check that we are on the right track
identical(colSums(bact_count_table), colSums(fam_and_unidentified_counts_tab))
```

```{r}
# making count table broken down by genus
gen_counts_tab <- otu_table(tax_glom(ASV_physeq, taxrank="Genus"))

# making a table that holds the family and genus level info for each ASV
gen_tax_fam_tab <- tax_table(tax_glom(ASV_physeq, taxrank="Genus"))

gen_tax_tab <- data.frame("Family"=gen_tax_fam_tab[,5], "Genus"=gen_tax_fam_tab[,6], row.names = row.names(gen_tax_fam_tab))

# changing the row names like above so that they correspond to the taxonomy, rather than an ASV identifier
rownames(gen_counts_tab) <- as.vector(gen_tax_tab$Genus)
```

```{r}
# making a vector of just the Methylococcaceae genera
meth_gen_vec <- as.vector(gen_tax_tab[gen_tax_tab$Family == "Methylococcaceae", "Genus"])

# making a table of the counts of the Gammaproteobacteria orders
meth_gen_counts_tab <- gen_counts_tab[row.names(gen_counts_tab) %in% meth_gen_vec, ]

# taking into account ASVs that were assigned Gammaproteobacteria but not any further
meth_no_gen_counts <- fam_and_unidentified_counts_tab[row.names(fam_and_unidentified_counts_tab) %in% "Methylococcaceae", ] - colSums(meth_gen_counts_tab)

# since the counts assigned "Gammaproteobacteria" are all accounted for (as they are all further subsetted within the phylum table, including the unidentified Gammaproteobacteria), let's make a table of our class counts for everything other than gamma
non_methy <- fam_and_unidentified_counts_tab[-5,]
non_methy_count <- colSums(non_methy)

# now combining the tables:
gen_taxa_counts_tab <- rbind("Other" = non_methy_count,
                             meth_gen_counts_tab,
                             "Unresolved_Methylococcaceae"= meth_no_gen_counts)

identical(colSums(bact_count_table), colSums(gen_taxa_counts_tab))

major_gen_proportions_tab <- apply(gen_taxa_counts_tab, 2, function(x) x/sum(x)*100)
```

```{r}
# first let's make a copy of our table that's safe for manipulating
methy_tax <- as.data.frame(major_gen_proportions_tab)

# and add a column of the taxa names so that it is within the table, rather than just as row names (this makes working with ggplot easier)
methy_tax$Genus <- row.names(methy_tax)

# now we'll transform the table into narrow, or long, format (also makes plotting easier)
methy_tax_long <- pivot_longer(methy_tax, !Genus, names_to = "Sample", values_to = "Proportion") %>% data.frame()

# and here we are merging this table with the plotting table we just made
methy_tax_info <- merge(methy_tax_long, sample_info_for_merge)
```

```{r}
ggplot(methy_tax_info, aes(Genus, Proportion)) +
  geom_jitter(aes(color=factor(replicate)), size=2.5, width=0.15,
              height=0) +
  geom_boxplot(fill=NA, outlier.color=NA) +
  theme_bw() +
  theme(axis.text.x=element_text(size=12, angle=45, hjust=1),
        legend.title=element_blank()) +
  labs(x="Methylococcaceae genera",
       y="% of 16S rRNA gene copies recovered", title="All samples")
```

### Phyla in Archaea subset
```{r}
# making a vector of just the Archaea phyla
arch_phyla_vec <- as.vector(phy_tax_tab[phy_tax_tab$Kingdom == "Archaea", "Phylum"])

# making a table of the counts of the Archaeal phyla
arch_phyla_counts_tab <- phyla_counts_tab[row.names(phyla_counts_tab) %in% arch_phyla_vec, ]
```

```{r}
arch_no_phy_counts <- king_and_unidentified_counts_tab[row.names(king_and_unidentified_counts_tab) %in% "Archaea", ] - colSums(arch_phyla_counts_tab) # These values will be added to our counts table as "Unresolved Archaea"

# since the counts assigned "Archaea" are all accounted for (as they are all further subsetted within the phylum table, including the unidentified Archaea), let's make a table of our kingdom counts for everything other than Archaea
non_arch <- king_and_unidentified_counts_tab[-2,]

# now combining the tables:
phyla_taxa_counts_tab_arch <- rbind(non_arch, arch_phyla_counts_tab,
                                    "Unresolved_Archaea"= arch_no_phy_counts)
```

To check we didn't miss any other sequences, we can compare the column sums of our new tables to the column sums of the original raw count table to see if they are the same. If "TRUE", we know nothing fell through the cracks and we can confidently calculate relative abundance.

```{r}
# Starting with just the kingdom classification table
identical(colSums(bact_count_table), colSums(king_and_unidentified_counts_tab)) # TRUE

# This is our combined table of non-archaea kingdoms and the archaea kingdom broken into different phyla (including Archaea that didn't have a phylum assignment)
identical(colSums(bact_count_table), colSums(phyla_taxa_counts_tab_arch)) # TRUE

# calculate relative abundance in PERCENT
major_phyla_proportions_tab_arch <- apply(phyla_taxa_counts_tab_arch, 2, function(x) x/sum(x)*100)
```

We have calculated the relative abundance of our table that has the phyla within the Bacteria kingdom as well as categories for any ASVs that are not assigned phyla within the Bacteria kingdom. We can use this table to generate some summary figures. If we check the dimensions of this table at this point,

```{r}
dim(major_phyla_proportions_tab_arch)
```

```{r}
temp_filt_major_phyla_proportions_tab <- as.data.frame(major_phyla_proportions_tab[apply(major_phyla_proportions_tab, 1, max) > 1, ])
# how many are above 1%
dim(temp_filt_major_phyla_proportions_tab)
# now we have 15, much more manageable for an overview figure

# though each of the filtered taxa made up less than 1% alone, together they may add up and should still be included in the overall summary so we're going to add a row called "Other" that keeps track of how much we filtered out (which will also keep our totals at 100%)
filtered_phyla <- colSums(major_phyla_proportions_tab) - colSums(temp_filt_major_phyla_proportions_tab)

filt_major_phyla_proportions_tab <- rbind(temp_filt_major_phyla_proportions_tab, "Other (low %)"=filtered_phyla)
```

Now that we have a nice proportions table ready to go, we can make some figures with it. First we need to get the information ready for plotting.

```{r}
# first let's make a copy of our table that's safe for manipulating
phyla_tax_arch <- as.data.frame(major_phyla_proportions_tab_arch)

# and add a column of the taxa names so that it is within the table, rather than just as row names (this makes working with ggplot easier)
phyla_tax_arch$Major_Taxa <- row.names(phyla_tax_arch)

# now we'll transform the table into narrow, or long, format (also makes plotting easier)
phyla_tax_long_arch <- pivot_longer(phyla_tax_arch, !Major_Taxa, names_to = "Sample", values_to = "Proportion") %>% data.frame()

# now we want a table metadata of each sample to merge into our plotting table so we can use that more easily in our plotting function. Here we're making a new table by pulling what we want from the sample information table
sample_info_for_merge_arch<-data.frame("Sample"=row.names(bact_sample_info_table), "month"=bact_sample_info_table$month, "soil"=bact_sample_info_table$soil, "template"=bact_sample_info_table$template, "replicate"=bact_sample_info_table$replicate, "monthsoil"=bact_sample_info_table$monthsoil, stringsAsFactors=F)

# and here we are merging this table with the plotting table we just made
phyla_tax_info_arch <- merge(phyla_tax_long_arch, sample_info_for_merge)

# since there is so much bacteria, let's remove it from the table so it doesn't crowd out the figures
phyla_tax_arch_no_bact <- phyla_tax_arch[-1,]
phyla_tax_arch_no_bact$Major_Taxa <- row.names(phyla_tax_arch_no_bact)
phyla_tax_long_arch_no_bact <- pivot_longer(phyla_tax_arch_no_bact, !Major_Taxa, names_to = "Sample", values_to = "Proportion") %>% data.frame()
phyla_tax_info_arch_no_bact <- merge(phyla_tax_long_arch_no_bact, sample_info_for_merge)
```

#### Figures

One common way to look at this is with stacked bar charts for each taxon per sample:

```{r}
ggplot(phyla_tax_info_arch_no_bact, aes(x=Sample, y=Proportion, fill=Major_Taxa)) +
  geom_bar(width=0.6, stat="identity", color = "black") +
  theme_bw() +
  theme(axis.text.x=element_text(angle=90, vjust=0.4, hjust=1), legend.title=element_blank()) +
  labs(x="Sample", y="% of 16S rRNA gene copies recovered", title="All samples")

# removing Eukaryota from the plot
phyla_tax_info_arch_no_bact_euk <- phyla_tax_info_arch_no_bact %>%
  filter(Major_Taxa != "Eukaryota")

ggplot(phyla_tax_info_arch_no_bact_euk, aes(x=Sample, y=Proportion, fill=Major_Taxa)) +
  geom_bar(width=0.6, stat="identity", color = "white") +
  theme_bw() +
  theme(axis.text.x=element_text(angle=90, vjust=0.4, hjust=1), legend.title=element_blank()) +
  labs(x="Sample", y="% of 16S rRNA gene copies recovered", title="All samples") +
   scale_fill_manual(values = colors)

# Set specific colors
colors["Crenarchaeota"] <-"#af4b91"
colors["Diapherotrites"] <-"#541352ff"
colors["Pacearchaeota"] <-"#FFC100"
colors["Thaumarchaeota"] <-"#00a0e1"
colors["Euryarchaeota"] <-"red"

# reorder taxa
taxaLevels_arch <- c(sort(setdiff(unique(phyla_tax_info_arch_no_bact_euk$Major_Taxa),
                                  c("Euryarchaeota","Crenarchaeota","Diapherotrites", "Pacearchaeota", "Thaumarchaeota", "Woesearchaeota", "Unclassified", "Unresolved_Archaea"))),"Euryarchaeota","Crenarchaeota","Diapherotrites", "Pacearchaeota", "Thaumarchaeota", "Woesearchaeota", "Unclassified", "Unresolved_Archaea" )
phyla_tax_info_arch_no_bact_euk$Major_Taxa <- factor(phyla_tax_info_arch_no_bact_euk$Major_Taxa, levels = taxaLevels_arch)


ggplot(phyla_tax_info_arch_no_bact_euk, aes(x=Sample, y=Proportion, fill=Major_Taxa)) +
  geom_bar(width=0.6, stat="identity", color = "white") +
  theme_bw() +
  theme(axis.text.x=element_text(angle=90, vjust=0.4, hjust=1), legend.title=element_blank()) +
  labs(x="Sample", y="% of 16S rRNA gene copies recovered", title="All samples") +
  scale_fill_manual(values = colors) +
  ylim(0,3)
```
Box plot of the same information.

```{r}
# custom color plot
fill_col = c("#b2182b", "#fb9a99", "#2166ac", "#a6cee3")

# reordering the labels
phyla_tax_info_arch_reorder  <- phyla_tax_info_arch %>%
  mutate(Phyla = fct_relevel(Major_Taxa,  "Unresolved_Archaea", "Unclassified",  "Eukaryota","Bacteria","Woesearchaeota","Thaumarchaeota",   "Pacearchaeota", "Euryarchaeota", "Diapherotrites", "Crenarchaeota"))

phyla_tax_info_arch_reorder <- phyla_tax_info_arch_reorder %>%
  mutate(template = fct_relevel(template,  "gDNA","cDNA"))

# new color scheme
major_taxa_arch_phyla <- ggplot(phyla_tax_info_arch_reorder, aes(Proportion, Phyla)) +
  geom_jitter(aes(fill = monthsoil, shape = template, color = "black"),
              size = 2.5, width = 0.15, height = 0) +
  scale_shape_manual(values = c(21, 24)) +
  scale_fill_manual(values = fill_col) +
  scale_color_manual(values = "black") +
  my_theme_grid +
    labs(x = "% of 16S rRNA gene copies recovered", y = "", title = "Major taxa and Archaeal phyla") +
    guides(color = "none", fill = guide_legend(override.aes = list(color = fill_col)))

major_taxa_arch_phyla
```


### Classes within Archaea
First will agglomerate the count and taxa table to the Class level.

```{r}
# FINDING ALL "UNCLASSIFIED" ASVS FOR ARCHAEAL CLASS PLOT
# Extract the taxonomy table
taxonomy_table <- phyloseq::tax_table(ASV_physeq)

# Extract the OTU count table
otu_table1 <- otu_table(ASV_physeq)

# Identify ASVs assigned to "Archaea" kingdom but not a phylum
archaea_no_phylum <- rownames(taxonomy_table)[taxonomy_table[, "Kingdom"] == "Archaea" & is.na(taxonomy_table[, "Phylum"])]

# Identify ASVs assigned to a phylum but not a class
archaea_phylum_no_class <- rownames(taxonomy_table)[taxonomy_table[, "Kingdom"] == "Archaea" &
                                                     !is.na(taxonomy_table[, "Phylum"]) &
                                                     is.na(taxonomy_table[, "Class"])]

# Identify ASVs assigned to "Archaea" kingdom with a class
archaea_with_class <- rownames(taxonomy_table)[taxonomy_table[, "Kingdom"] == "Archaea" &
                                                !is.na(taxonomy_table[, "Class"])]

# Sum the OTU counts for these ASVs
otu_sum_archaea_no_phylum <- colSums(otu_table1[rownames(otu_table1) %in% archaea_no_phylum, ])
otu_sum_archaea_phylum_no_class <- colSums(otu_table1[rownames(otu_table1) %in% archaea_phylum_no_class, ])
otu_sum_archaea_with_class <- colSums(otu_table1[rownames(otu_table1) %in% archaea_with_class, ])

# check to make sure I didn't lose any Archaea ASVs
archaea_all <- rbind(otu_sum_archaea_no_phylum,otu_sum_archaea_phylum_no_class,otu_sum_archaea_with_class)
archaea_all_sums <- colSums(archaea_all)


archaea_total <- rownames(taxonomy_table)[taxonomy_table[, "Kingdom"] == "Archaea"]
otu_sum_archaea_total <- colSums(otu_table1[rownames(otu_table1) %in% archaea_total, ])

identical(archaea_all_sums, otu_sum_archaea_total) # true

# find relative abundance
bact_count_sums <- colSums(bact_count_table)

otu_sum_archaea_phylum_no_class_prop <- (otu_sum_archaea_phylum_no_class/bact_count_sums)*100
otu_sum_archaea_no_phylum_prop <- (otu_sum_archaea_no_phylum/bact_count_sums)*100

combined_df <- t(data.frame(
  Unresolved_Archaeal_phyla = otu_sum_archaea_phylum_no_class_prop,
  Unresolved_Archaea = otu_sum_archaea_no_phylum_prop,
  row.names = colnames(bact_count_table)
))
```

```{r}
# making count table broken down by class
class_counts_tab_arch <- otu_table(tax_glom(ASV_physeq, taxrank="Class"))

# making a table that holds the kingdom, phylum, and class level info for each ASV
class_taxa_tab_arch <- phyloseq::tax_table(tax_glom(ASV_physeq, taxrank="Class"))

class_tax_tab_arch <- data.frame("Kingdom"=class_taxa_tab_arch[,1], "Phylum"=class_taxa_tab_arch[,2], "Class"=class_taxa_tab_arch[,3], row.names = row.names(class_taxa_tab_arch))

# changing the row names like above so that they correspond to the taxonomy, rather than an ASV identifier
rownames(class_counts_tab_arch) <- as.vector(class_tax_tab_arch$Class)

# making a vector of just the Archaea classes
arch_classes_vec <- as.vector(class_tax_tab_arch[class_tax_tab_arch$Kingdom == "Archaea", "Class"])

# making a table of the counts of the Archaea classes
arch_class_counts_tab <- class_counts_tab_arch[row.names(class_counts_tab_arch) %in% arch_classes_vec, ]

colSums(arch_class_counts_tab)

identical(otu_sum_archaea_with_class, colSums(arch_class_counts_tab))

# find relative abundance
bact_count_sums <- colSums(bact_count_table)
archaea_all/bact_count_sums

arch_class <- as.data.frame(arch_class_counts_tab)

arch_class_prop <- (arch_class/bact_count_sums)*100

#combine proportions table with unclassified
all_arch_class_prop <- rbind(combined_df, arch_class_prop)

```

Reformat our proportions/relative abundance table for plotting.
```{r}
# first let's make a copy of our table that's safe for manipulating
arch_classes <- as.data.frame(all_arch_class_prop)

# and add a column of the taxa names so that it is within the table, rather than just as row names (this makes working with ggplot easier)
arch_classes$Classes <- row.names(arch_classes)

# now we'll transform the table into narrow, or long, format (also makes plotting easier)
arch_classes_long <- pivot_longer(arch_classes, !Classes, names_to = "Sample", values_to = "Proportion") %>% data.frame()