-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathmake-a-tree.qmd
More file actions
354 lines (235 loc) · 16.8 KB
/
Copy pathmake-a-tree.qmd
File metadata and controls
354 lines (235 loc) · 16.8 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
# Making a Phylogenetic Tree {#sec-making-a-tree}
## Introduction
In this section, you will start with a set of bacterial protein sequences and use these to construct and visualise a phylogenetic tree, using online bioinformatics software tools. To do this you will move through the following stages:
1. Acquire sequence data
2. Align sequences
3. Construct a tree from the alignment
4. Visualise the tree
::: { .callout-note }
Although the principles and the order of actions taken here are the same as we would use at a research level, most research-level phylogenetic reconstruction you will see is likely to use a different set of tools than those employed here.
:::
### Scenario
You are a biomedical researcher interested in evolution of the pathogen _Burkholderia pseudomallei_, which causes [melioidosis](https://en.wikipedia.org/wiki/Melioidosis) in humans (@Chewapreecha2017-lx).
.](assets/images/hepatic-melioidosis.jpg){#fig-hepatic-melioidosis width=80%}
Depending on whether you live in a medically well-resourced, or poorly-resourced, area the risk of death from infection varies from 10%-40%. The pathogen itself has the capability to infect a range of human cells and evade the immune system, in part because it produces proteins called _effectors_: these are proteins adapted to interfere with host biology and make the environment more favourable for the bacterium.
The protein you are studying, BipC, is such an _effector_ protein. Specifically, it is a _translocator_ protein that makes holes in the host's cells, allowing its own effector proteins to enter and manipulate host biochemistry, and also binds to [actin microfilaments](https://en.wikipedia.org/wiki/Actin), disrupting host cell structure.
You would like to understand the evolution of this protein, to see whether the knowledge can help you design drug compounds that interfere with the action of BipC as a treatment for melioidosis. To do this, you will produce a phylogenetic tree for the protein.
## Obtain sequences from UniProt
The [UniProt database](https://www.uniprot.org) is the most comprehensive protein sequence database, combining protein sequence, structure, and annotation data. UniProt has an entry for a relative of the protein you are interested in, which has accession `Q62B08`.
::: { .callout-important title="Task" }
**Go to the UniProt database [link](https://www.uniprot.org) and find the record associated with accession number `Q62B08`.**
:::
::: { .callout-tip collapse="true" }
## I need a hint!
On the main UniProt landing page, enter the accession of the sequence you are searching for in the field under `Find your protein` and click `Search`
{#fig-uniprot-query width=80%}
:::
::: { .callout-caution collapse="true" }
## Check your result
You should find yourself at the [page](https://www.uniprot.org/uniprotkb/Q62B08/entry) for `Q62B08`, `BIPC_BURMA`.
{#fig-uniprot-bipc-header width=80%}
:::
### Exploring the UniProt record
The UniProt database can provide a great deal of information about the proteins it contains. This can be a great help in understanding more about your protein, and finding useful links to other sources of information. To help you get greater knowledge about BipC, **please answer the formative questions below in the [MyPlace quiz]({{< var myplace.quiz2 >}})**
::: { .callout-important title="Question 1"}
What organism does this sequence come from?
:::
::: { .callout-tip collapse="true" }
## I need a hint!
Look at the header for the record page.
:::
::: { .callout-important title="Question 2"}
How many amino acids does this protein contain (i.e. how long is the sequence)?
:::
::: { .callout-tip collapse="true" }
## I need a hint!
Look at the header for the record page, or click on `Sequence` in the sidebar.
:::
::: { .callout-important title="Question 3"}
According to the Gene Ontology (GO), what is the subcellular location of this protein?
:::
::: { .callout-tip collapse="true" }
## I need a hint!
Click on `Subcellular Location` in the sidebar, and then look at the `GO Annotation` tab.
:::
::: { .callout-important title="Question 4"}
How many identical (100% identity) sequences are there in the UniProt database?
:::
::: { .callout-tip collapse="true" }
## I need a hint!
Click on `Similar Proteins` and look at the table.
:::
::: { .callout-important title="Question 5"}
What are the accessions of the identical sequences in _Burkholderia pseudomallei_
:::
::: { .callout-tip collapse="true" }
## I need a hint!
Can you find this organism in the `Similar Proteins` table?
:::
### Downloading sequence homologues
We only construct phylogenetic trees from protein sequences that are _related_ to each other - it wouldn't make much sense to try to infer relationships between unrelated sequences!. When we say that two sequences are related we mean that they _share a recognisable common ancestor_.
::: { .callout-note }
Two sequences that share a common ancestor are said to be **_homologous_**.
:::
::: { .callout-caution title="On homology"}
**Homology** - the state of two sequences sharing a common ancestor - is an all-or-nothing state. Two sequences are either homologous to each other, or they are not. It is incorrect to talk about gradation or percentage of homology (though you will encounter people doing so).
:::
UniProt provides a precomputed cluster of all the proteins it knows about that are potentially homologous to any other protein sequence in its database, and you can find this by clicking on the `Similar Proteins` link in the sidebar of the record page. This will take you to a table of similar protein sequences (@fig-uniprot-similar-proteins).
{#fig-uniprot-similar-proteins width=80%}
This table has three tabs: `100% identity`, `90% identity`, and `50% identity`. Clicking on these will show you the clusters of sequences in UniProt sharing at least that level of _sequence identity_ with the current record. In general, the number of sequences in the cluster increases as the percentage identity used to gather the cluster goes down.
::: { .callout-important title="Task" }
**View all sequences in the 50% identity cluster for `Q62B08`.**
:::
::: { .callout-tip collapse="true" }
## I need a hint!
- On the UniProt record page for `Q62B08` click on the `Similar Proteins` link in the sidebar to get the similar proteins table.
- Click on the `50% identity` tab to see the first few proteins in the cluster.
- Click on `View All` to see the full list of sequences in the cluster.
:::
::: { .callout-caution collapse="true" }
## Check your result
You should find yourself at the [UniRef cluster page](https://www.uniprot.org/uniprotkb?query=uniref_cluster_50:UniRef50_A3P6Z6) with 50% identity to `Q62B08`.
{#fig-uniprot-cluster-header width=80%}
:::
This set of sequences are all closely-enough related to each other for drawing a phylogenetic tree to be worthwhile. Several different species are represented, which should make it an informative tree in regards to the evolution of this protein effector. You will use this set of sequences to construct your tree.
::: { .callout-important title="Task" }
**Download all sequences (_not compressed_) in the 50% identity cluster for `Q62B08` to your computer.**
:::
::: { .callout-tip collapse="true" }
## I need a hint!
- Click on the `Download` link at the top of the table. This will bring up a set of options (@fig-uniprot-download-options).
- Make sure `Download all` and `Compressed - No` are selected.
- Click on `Download` and save the file somewhere you can find it again (e.g. on your `Desktop`).
{#fig-uniprot-download-options width=80%}
:::
::: { .callout-caution collapse="true" }
## Check your result
Your sequence file should be the same as that which you can download from [this link](https://raw.githubusercontent.com/sipbs-compbiol/BM211-Workshop-5/main/assets/data/sequences/Q62B08_50pc_cluster.fasta).
:::
**Please answer the formative questions below in the [MyPlace quiz]({{< var myplace.quiz3 >}})**
::: { .callout-important title="Question 6"}
How many sequences are there in this cluster?
:::
::: { .callout-tip collapse="true" }
## I need a hint!
Look at the header for the cluster page.
:::
::: { .callout-important title="Question 7"}
How long is the shortest sequence in this cluster?
:::
::: { .callout-tip collapse="true" }
## I need a hint!
Look in the `Length` column of the cluster table.
:::
::: { .callout-important title="Question 8"}
How many species are represented in the dataset?
:::
::: { .callout-tip collapse="true" }
## I need a hint!
Look in the `Organism` column of the cluster table.
:::
## Make a multiple sequence alignment
The sequences in the cluster you have downloaded from UniProt are different lengths, and are not identical to each other.
If we were to compare animal body parts to understand their evolution, we would naturally try to compare only _equivalent_ elements, such as forelimbs, with each other. It wouldn't be reasonable to compare organ structure with skull shape, say.In the same way, for biological sequence data, we want to compare genetic changes at _equivalent sites_, which means we have to find which sites are _equivalent_ before we build the tree.
We find equivalent sites by aligning sequences against each other in a **Multiple Sequence Alignment (MSA)**. Detailed accounts of the methods involved in multiple sequence alignment are beyond the scope of the course, but they can be similar in some cases to the pairwise sequence alignment process that [you have already seen in BM214](https://sipbs-compbiol.github.io/BM214-Workshop-2/pairwise.html).
You are going to employ a widely-used research tool called `MUSCLE` to make a multiple sequence alignment from the sequences in the cluster you downloaded. This program is availble to install on your own computer, but can also be used through an online service (@fig-muscle-landing).
- `MUSCLE` online service: [https://www.ebi.ac.uk/jdispatcher/msa/muscle](https://www.ebi.ac.uk/jdispatcher/msa/muscle)
{#fig-muscle-landing width=80%}
::: { .callout-important title="Task" }
**Use the online `MUSCLE` service to generate a multiple sequence alignment from your cluster sequences.**
:::
::: { .callout-tip collapse="true" }
## I need a hint!
- On the `MUSCLE` service landing page, click on `Browse…` and use the dialogue box that appears to navigate to your downloaded cluster sequences.
- Scroll down the page and click `Submit`
:::
::: { .callout-caution collapse="true" }
## Check your result
Your alignment should look similar to the one in @fig-muscle-alignment.
{#fig-muscle-alignment}
:::
::: { .callout-note }
The output from the online tool can be very helpful. You can use your mouse to scroll around the alignment visualisation to see which regions are similar, and which differ. There are also options for colouring the residues in the alignment by a range of schemes, to highlight potentially conserved biochemical properties of the residues (@fig-muscle-colours).
{#fig-muscle-colours}
:::
::: { .callout-important title="Task" }
**Download the `CLUSTAL` format `MUSCLE` sequence alignment file to your computer.**
:::
::: { .callout-tip collapse="true" }
## I need a hint!
- On the sequence alignment page, click on the `Result Files` tab.
- Find the `Alignment in CLUSTAL format` line in the table, click on the `Download button next to it, and use the dialogue box to save the file somewhere you can find it.
:::
::: { .callout-caution collapse="true" }
## Check your result
Your alignment file should be the same as the one you can download from [this link](https://raw.githubusercontent.com/sipbs-compbiol/BM211-Workshop-5/refs/heads/main/assets/data/sequences/muscle_alignment.clustalw)
:::
## Construct a phylogenetic tree
Now you have a sequence alignment, you are ready to build a phylogenetic tree.
::: { .callout-warning }
For this workshop we are using a relatively simple tool that will give us a phylogenetic tree quickly. The approach we are using here is known to have systematic issues and would not be used in a modern research setting. However, it is a method that was very widely-used until the early 2000s and is still sometimes seen in current research.
:::
You are going to use a tool called `Simple Phylogeny` to make a phylogenetic tree from your multiple sequence alignment (@fig-simple-phylogeny-loading). This tool implements tree generation methods from the `ClustalW` software tool, and is restricted to two relatively basic clustering approaches:
- `UPGMA` (Unweighted Pair Group Method with Arithmetic Mean) [link](https://sipbs-compbiol.github.io/BM329_Block_B_Workshop/upgma.html)
- `Neighbour Joining` [link](https://en.wikipedia.org/wiki/Neighbor_joining)
We will use this program _via_ its online service (@fig-simple-phylogeny-loading)
- `Simple Phylogeny` online service: [https://www.ebi.ac.uk/jdispatcher/phylogeny/simple_phylogeny](https://www.ebi.ac.uk/jdispatcher/phylogeny/simple_phylogeny))
{#fig-simple-phylogeny-loading width=80%}
::: { .callout-important title="Task" }
**Use the online `Simple Phylogeny` service to generate a Neighbour-Joining phylogenetic tree from your alignment.**
:::
::: { .callout-tip collapse="true" }
## I need a hint!
- Open the sequence alignment file you downloaded in a text editor (**Use Notepad or similar - DO NOT USE MICROSOFT WORD**).
- Copy the data from the file and paste it into the `Input Sequence` field.
- Set Tree Format to `Clustal`
- You can leave all the other options as their default settings (@fig-simpphy-settings).
- Scroll down the page and click `Submit`
{#fig-simpphy-settings}
:::
::: { .callout-caution collapse="true" }
## Check your result
The output should look similar to the tree data in @fig-simpphy-tree. Note that this file is in `.newick` format, but puts each parenthesis (`(`) on a new line.
{#fig-simpphy-tree width=80%}
:::
::: { .callout-important title="Task" }
**Download the `.newick` format tree file to your computer.**
:::
::: { .callout-tip collapse="true" }
## I need a hint!
- On the `Simple Phylogeny` output page, click on the `Result Files` tab.
- Find the `Phylogenetic Tree` line in the table, click on the `Download button next to it, and use the dialogue box to save the file somewhere you can find it.
:::
::: { .callout-caution collapse="true" }
## Check your result
Your tree file should be the same as that which you can download from [this link](https://raw.githubusercontent.com/sipbs-compbiol/BM211-Workshop-5/main/assets/data/sequences/simple-phylogeny.newick)
:::
## Visualise the tree
You will now use `iTol` to visualise the tree you produced. Use what you have learned from @sec-itol-load to view the tree.
::: { .callout-important title="Task" }
**Use `iToL` to visualise the `.newick` tree file you generated with `Simple Phylogeny`.**
:::
::: { .callout-tip collapse="true" }
## I need a hint!
- Click on the `Upload` button in `iTol`
- Click on the `Browse…` button and use the dialogue box to navigate to the location of the tree file you downloaded.
- Click the `Upload` button
:::
::: { .callout-caution collapse="true" }
## Check your result
The tree file shown in `iTol` should resemble @fig-itol-simpphy-tree.
{#fig-itol-simpphy-tree width=80%}
:::
## Summary
::: { .callout-note title="Well Done!"}
After successfully working through this section you should be able to:
- use UniProt to obtain annotation and functional information about a protein sequence
- use UniProt to identify and download homologues of a protein sequence
- generate a protein multiple sequence alignment using `MUSCLE`
- produce a Neighbour-Joining phylogenetic tree using `Simple Phylogeny`
- visualise a phylogenetic tree using `iToL`
:::
::: { .callout-important }
**Please complete the workshop by answering the questions below in the formative quiz on MyPlace**
- [MyPlace formative quiz]({{< var myplace.quiz3 >}})
:::