Entropia-MSA: Shannon Entropy Analysis for Multiple Sequence Alignments

A comprehensive toolkit for calculating and visualizing Shannon entropy in protein multiple sequence alignments (MSAs). Designed for comparative genomics and phylogenomics studies.

See Example Outputs below for visualization examples from the included sample data.

Features

Column-wise Analysis (Position-level)

Shannon Entropy Calculation: Calculate normalized Shannon entropy for amino acid alignments
Gap Trimming: Automatically remove poorly aligned regions (>80% gaps)
Positional Entropy Profiles: Generate detailed entropy plots along alignment positions

Row-wise Analysis (Taxon-level) 🆕

Sequence Divergence: Calculate mean pairwise distance for each sequence
Divergent Taxon Detection: Identify outlier sequences and conserved lineages
Distance to Consensus: Measure how far each sequence is from the consensus

Integrated Analysis

Complete MSA Overview: Combined column + row analysis in single visualization
Copy Number Integration: Combine entropy metrics with gene copy number heatmaps
Publication-Ready Figures: Export high-quality PDF, PNG, and SVG visualizations

Installation

Requirements

# Create conda environment
conda create -n entropia python=3.8
conda activate entropia

# Install dependencies
pip install numpy pandas matplotlib seaborn

Dependencies

Python ≥ 3.8
NumPy
Pandas
Matplotlib
Seaborn

Quick Start

1. Calculate Shannon Entropy

cd your_alignment_directory
python /path/to/entropia-msa/src/calculate_shannon_entropy.py

Input: Multiple .msa files in FASTA format

Output:

shannon_entropy_results.csv: Detailed statistics for all genes
entropy_plot.png: Bar plot of genes ranked by entropy

2. Generate Positional Entropy Profiles

python /path/to/entropia-msa/src/plot_positional_entropy.py

Split-group positional entropy (overlay group curves and output per-group summaries):

python /path/to/entropia-msa/src/plot_positional_entropy.py \
  --input-glob "*.msa" \
  --output positional_entropy_all_genes.pdf \
  --gap-threshold 0.8 \
  --split-groups groups.tsv \
  --split-output-dir split_entropy_csv

Where groups.tsv has at least two columns: seq_id and group.

Example groups.tsv:

seq_id	group
sp1_geneA	Satellite
sp2_geneA	Satellite
sp3_geneA	Transposon
sp4_geneA	Transposon
sp5_geneA	Mixed

What you get:

Overlaid entropy curves per group on each positional entropy plot.
*_split_summary.tsv next to the PDF with mean/median entropy per group.
Optional per-position CSVs in --split-output-dir (one per gene), e.g.:
- split_entropy_csv/CENPA_split_entropy.csv with columns position, entropy_all, and entropy_<Group>.

Minimal demo (toy data):

cd examples
python ../src/plot_positional_entropy.py \
  --input-glob "toy.msa" \
  --output toy_positional_entropy.pdf \
  --gap-threshold 0.8 \
  --split-groups groups.tsv \
  --split-output-dir split_entropy_csv

Demo inputs:

examples/toy.msa
examples/groups.tsv Demo output:
examples/toy_positional_entropy.png

Output:

positional_entropy_all_genes.pdf: Multi-page PDF with one plot per gene

3. Calculate Sequence Divergence (Row-wise Analysis) 🆕

python /path/to/entropia-msa/src/calculate_sequence_divergence.py

Output:

sequence_divergence_results.csv: Per-sequence divergence metrics
divergence_plot.png: Visualization of taxon-level divergence

4. Generate Complete MSA Overview 🆕

For a comprehensive analysis combining both column and row-wise metrics:

python /path/to/entropia-msa/src/plot_msa_overview.py <alignment_file.msa>

Output:

<gene_id>_msa_overview.{png,pdf}: Integrated visualization

5. Create Integrated Heatmap (Optional)

For combining entropy with gene copy number data:

cd your_analysis_directory
python /path/to/entropia-msa/src/plot_heatmap_with_entropy.py

Requirements:

shannon_entropy_results.csv from step 1
hits_arabidopsis.kinetochore_label_summary_with_complex.tsv (copy number data)
Phylogenetic tree in Newick format

Output:

kinetochore_heatmap_with_entropy.{png,pdf,svg}

Example Outputs

Entropy Summary Plot

Overview of entropy distribution across genes:

Bar plot showing genes ranked by mean normalized Shannon entropy, with distribution histogram.

Positional Entropy Profiles

Detailed entropy along alignment positions for individual genes:

Highly Conserved: Histone H3

Mean entropy: 0.0095 - Extremely conserved protein with minimal variation.

Moderately Variable: Nuf2

Mean entropy: 0.2927 - Moderate sequence variation with conserved and variable regions.

Highly Variable: Skp1

Mean entropy: 0.3932 - High sequence variability across most positions.

Sequence Divergence Analysis (Row-wise) 🆕

Taxon-level divergence metrics:

Distribution of sequence divergence across all genes, showing mean pairwise distances, distance to consensus, and per-gene summaries.

Complete MSA Overview 🆕

Integrated column + row analysis for individual genes:

Highly Conserved: Histone H3

Mean entropy: 0.0095 | Mean divergence: 0.0164 - Extremely conserved protein with minimal variation at both position and sequence levels. No highly variable positions or divergent sequences.

Highly Variable: Skp1

Mean entropy: 0.3932 | Mean divergence: 0.5073 - Highly variable protein showing 206/616 variable positions and 318/324 divergent sequences. Top: Positional entropy profile. Middle: Alignment heatmap with most/least divergent sequences and divergence barplot. Bottom: Distribution histograms for both metrics.

Integrated Heatmap with Entropy

Combination of Shannon entropy and gene copy number data:

Top: Entropy barplot (left) and copy number variance (right). Bottom: Gene copy number heatmap across species. X-axis labels show protein names color-coded by complex.

Methodology

Shannon Entropy Formula

For each alignment position i:

H(i) = -Σ(p_a × log₂(p_a))

Where:

p_a = frequency of amino acid a at position i
Gaps are excluded from calculations

Normalization

Entropy values are normalized by the maximum possible entropy:

H_normalized = H(i) / log₂(20)

Where log₂(20) = 4.32 for 20 standard amino acids.

Range: 0 (completely conserved) to 1.0 (maximum variability)

Gap Trimming

Alignment columns with ≥80% gaps are removed before entropy calculation to focus on reliably aligned regions.

Sequence Divergence (Row-wise Analysis) 🆕

For each sequence s, calculate divergence metrics:

Mean Pairwise Distance:

D(s) = (1/N) × Σ d(s, s_i)

Where:

d(s, s_i) = pairwise distance (proportion of differing amino acids)
N = number of other sequences

Distance to Consensus:

D_cons(s) = d(s, consensus)

Where consensus = most common amino acid at each position.

Interpretation:

High divergence (>0.5): Outlier/divergent lineage
Medium divergence (0.2-0.5): Moderate variation
Low divergence (<0.2): Conserved/typical sequence

Copy Number Variance

The Coefficient of Variation (CV) measures copy number distribution across species:

CV = σ / μ

Where:

σ = standard deviation of copy numbers
μ = mean copy number

Interpretation:

High CV: Uneven distribution (some species have many copies, others few)
Low CV: Even distribution (similar copy numbers across species)

Example Workflow

# 1. Navigate to alignment directory
cd /path/to/alignments/

# 2. Calculate entropy for all alignments
python ~/entropia-msa/src/calculate_shannon_entropy.py

# 3. Generate positional entropy profiles
python ~/entropia-msa/src/plot_positional_entropy.py

# 4. Review results
# - shannon_entropy_results.csv
# - entropy_plot.png
# - positional_entropy_all_genes.pdf

Output Interpretation

Shannon Entropy Zones

Entropy Range	Conservation Level	Color Code
0.0 - 0.2	Highly conserved	Green
0.2 - 0.5	Moderately variable	Yellow
0.5 - 1.0	Highly variable	Red

Example Results

Top Variable Proteins (from Rhynchospora kinetochore analysis):

Skp1 / OG0000061 - Entropy: 0.2628
Skp1 / OG0001284 - Entropy: 0.2038
Skp1 / OG0001282 - Entropy: 0.1919

File Formats

Input: MSA Files

FASTA format with aligned sequences:

>Species1_gene
MARTKQTARKSTGGKAPR---KQLAT
>Species2_gene
MARTKQTARKSTGGKAPR---KQLAT
>Species3_gene
MARTK-TARKST--KAPR---KQLAT

Output: CSV Results

Column	Description
gene_id	Gene/OG identifier
num_sequences	Number of sequences in alignment
original_length	Alignment length before trimming
trimmed_length	Alignment length after gap removal
mean_normalized_entropy	Average normalized Shannon entropy
median_normalized_entropy	Median entropy value
std_normalized_entropy	Standard deviation
max_normalized_entropy	Maximum entropy position

Advanced Usage

Custom Gap Threshold

Edit calculate_shannon_entropy.py:

# Line ~40
trimmed_seqs = trim_alignment(sequences, gap_threshold=0.8)  # Change to 0.7, 0.9, etc.

Filter by Protein Complex

In plot_heatmap_with_entropy.py, modify filtering:

# Line ~70
df = df[df["Complex"] == "CCAN"]  # Focus on specific complex

Examples

Example data is provided in data/example_alignments/:

OG0000103_H3.msa - Highly conserved histone H3
OG0000122_Skp1.msa - Variable Skp1 protein
OG0000272_Nuf2.msa - Moderately conserved kinetochore protein

Run on examples:

cd data/example_alignments
python ../../src/calculate_shannon_entropy.py

Citation

If you use this tool in your research, please cite:

Gonzalez, J. (2025). Entropia-MSA: Shannon Entropy Analysis for Multiple Sequence Alignments.
GitHub: https://github.com/jacgonisa/entropia-msa

Method Citation

Shannon Entropy:

Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27(3), 379-423.

Troubleshooting

Issue: "ModuleNotFoundError"

Solution: Ensure all dependencies are installed:

pip install numpy pandas matplotlib seaborn

Issue: "No .msa files found"

Solution: Ensure you're in the directory containing alignment files, or modify the glob pattern in the script.

Issue: "Memory Error"

Solution: Process alignments in batches for large datasets (>500 alignments).

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch
Submit a pull request

License

MIT License - see LICENSE file for details

Contact

Author: Jacob Gonzalez
GitHub: @jacgonisa
Repository: https://github.com/jacgonisa/entropia-msa

Acknowledgments

Developed for the Rhynchospora phylogenomics project
Shannon entropy implementation based on standard information theory principles
Visualization inspired by ComplexHeatmap and seaborn libraries

Version History

v1.0.0 (2025-11-20): Initial release
- Shannon entropy calculation
- Positional entropy profiles
- Integrated heatmap visualization
- Copy number variance metrics

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
data/example_alignments		data/example_alignments
docs		docs
examples		examples
src		src
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Entropia-MSA: Shannon Entropy Analysis for Multiple Sequence Alignments

Features

Column-wise Analysis (Position-level)

Row-wise Analysis (Taxon-level) 🆕

Integrated Analysis

Installation

Requirements

Dependencies

Quick Start

1. Calculate Shannon Entropy

2. Generate Positional Entropy Profiles

3. Calculate Sequence Divergence (Row-wise Analysis) 🆕

4. Generate Complete MSA Overview 🆕

5. Create Integrated Heatmap (Optional)

Example Outputs

Entropy Summary Plot

Positional Entropy Profiles

Highly Conserved: Histone H3

Moderately Variable: Nuf2

Highly Variable: Skp1

Sequence Divergence Analysis (Row-wise) 🆕

Complete MSA Overview 🆕

Highly Conserved: Histone H3

Highly Variable: Skp1

Integrated Heatmap with Entropy

Methodology

Shannon Entropy Formula

Normalization

Gap Trimming

Sequence Divergence (Row-wise Analysis) 🆕

Copy Number Variance

Example Workflow

Output Interpretation

Shannon Entropy Zones

Example Results

File Formats

Input: MSA Files

Output: CSV Results

Advanced Usage

Custom Gap Threshold

Filter by Protein Complex

Examples

Citation

Method Citation

Troubleshooting

Issue: "ModuleNotFoundError"

Issue: "No .msa files found"

Issue: "Memory Error"

Contributing

License

Contact

Acknowledgments

Version History

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages