A comprehensive toolkit for calculating and visualizing Shannon entropy in protein multiple sequence alignments (MSAs). Designed for comparative genomics and phylogenomics studies.
See Example Outputs below for visualization examples from the included sample data.
- Shannon Entropy Calculation: Calculate normalized Shannon entropy for amino acid alignments
- Gap Trimming: Automatically remove poorly aligned regions (>80% gaps)
- Positional Entropy Profiles: Generate detailed entropy plots along alignment positions
- Sequence Divergence: Calculate mean pairwise distance for each sequence
- Divergent Taxon Detection: Identify outlier sequences and conserved lineages
- Distance to Consensus: Measure how far each sequence is from the consensus
- Complete MSA Overview: Combined column + row analysis in single visualization
- Copy Number Integration: Combine entropy metrics with gene copy number heatmaps
- Publication-Ready Figures: Export high-quality PDF, PNG, and SVG visualizations
# Create conda environment
conda create -n entropia python=3.8
conda activate entropia
# Install dependencies
pip install numpy pandas matplotlib seaborn- Python ≥ 3.8
- NumPy
- Pandas
- Matplotlib
- Seaborn
cd your_alignment_directory
python /path/to/entropia-msa/src/calculate_shannon_entropy.pyInput: Multiple .msa files in FASTA format
Output:
shannon_entropy_results.csv: Detailed statistics for all genesentropy_plot.png: Bar plot of genes ranked by entropy
python /path/to/entropia-msa/src/plot_positional_entropy.pySplit-group positional entropy (overlay group curves and output per-group summaries):
python /path/to/entropia-msa/src/plot_positional_entropy.py \
--input-glob "*.msa" \
--output positional_entropy_all_genes.pdf \
--gap-threshold 0.8 \
--split-groups groups.tsv \
--split-output-dir split_entropy_csvWhere groups.tsv has at least two columns: seq_id and group.
Example groups.tsv:
seq_id group
sp1_geneA Satellite
sp2_geneA Satellite
sp3_geneA Transposon
sp4_geneA Transposon
sp5_geneA MixedWhat you get:
- Overlaid entropy curves per group on each positional entropy plot.
*_split_summary.tsvnext to the PDF with mean/median entropy per group.- Optional per-position CSVs in
--split-output-dir(one per gene), e.g.:split_entropy_csv/CENPA_split_entropy.csvwith columnsposition,entropy_all, andentropy_<Group>.
Minimal demo (toy data):
cd examples
python ../src/plot_positional_entropy.py \
--input-glob "toy.msa" \
--output toy_positional_entropy.pdf \
--gap-threshold 0.8 \
--split-groups groups.tsv \
--split-output-dir split_entropy_csvDemo inputs:
examples/toy.msaexamples/groups.tsvDemo output:examples/toy_positional_entropy.png
Output:
positional_entropy_all_genes.pdf: Multi-page PDF with one plot per gene
python /path/to/entropia-msa/src/calculate_sequence_divergence.pyOutput:
sequence_divergence_results.csv: Per-sequence divergence metricsdivergence_plot.png: Visualization of taxon-level divergence
For a comprehensive analysis combining both column and row-wise metrics:
python /path/to/entropia-msa/src/plot_msa_overview.py <alignment_file.msa>Output:
<gene_id>_msa_overview.{png,pdf}: Integrated visualization
For combining entropy with gene copy number data:
cd your_analysis_directory
python /path/to/entropia-msa/src/plot_heatmap_with_entropy.pyRequirements:
shannon_entropy_results.csvfrom step 1hits_arabidopsis.kinetochore_label_summary_with_complex.tsv(copy number data)- Phylogenetic tree in Newick format
Output:
kinetochore_heatmap_with_entropy.{png,pdf,svg}
Overview of entropy distribution across genes:
Bar plot showing genes ranked by mean normalized Shannon entropy, with distribution histogram.
Detailed entropy along alignment positions for individual genes:
Mean entropy: 0.0095 - Extremely conserved protein with minimal variation.
Mean entropy: 0.2927 - Moderate sequence variation with conserved and variable regions.
Mean entropy: 0.3932 - High sequence variability across most positions.
Taxon-level divergence metrics:
Distribution of sequence divergence across all genes, showing mean pairwise distances, distance to consensus, and per-gene summaries.
Integrated column + row analysis for individual genes:
Mean entropy: 0.0095 | Mean divergence: 0.0164 - Extremely conserved protein with minimal variation at both position and sequence levels. No highly variable positions or divergent sequences.
Mean entropy: 0.3932 | Mean divergence: 0.5073 - Highly variable protein showing 206/616 variable positions and 318/324 divergent sequences. Top: Positional entropy profile. Middle: Alignment heatmap with most/least divergent sequences and divergence barplot. Bottom: Distribution histograms for both metrics.
Combination of Shannon entropy and gene copy number data:
Top: Entropy barplot (left) and copy number variance (right). Bottom: Gene copy number heatmap across species. X-axis labels show protein names color-coded by complex.
For each alignment position i:
H(i) = -Σ(p_a × log₂(p_a))
Where:
p_a= frequency of amino acidaat positioni- Gaps are excluded from calculations
Entropy values are normalized by the maximum possible entropy:
H_normalized = H(i) / log₂(20)
Where log₂(20) = 4.32 for 20 standard amino acids.
Range: 0 (completely conserved) to 1.0 (maximum variability)
Alignment columns with ≥80% gaps are removed before entropy calculation to focus on reliably aligned regions.
For each sequence s, calculate divergence metrics:
Mean Pairwise Distance:
D(s) = (1/N) × Σ d(s, s_i)
Where:
d(s, s_i)= pairwise distance (proportion of differing amino acids)N= number of other sequences
Distance to Consensus:
D_cons(s) = d(s, consensus)
Where consensus = most common amino acid at each position.
Interpretation:
- High divergence (>0.5): Outlier/divergent lineage
- Medium divergence (0.2-0.5): Moderate variation
- Low divergence (<0.2): Conserved/typical sequence
The Coefficient of Variation (CV) measures copy number distribution across species:
CV = σ / μ
Where:
σ= standard deviation of copy numbersμ= mean copy number
Interpretation:
- High CV: Uneven distribution (some species have many copies, others few)
- Low CV: Even distribution (similar copy numbers across species)
# 1. Navigate to alignment directory
cd /path/to/alignments/
# 2. Calculate entropy for all alignments
python ~/entropia-msa/src/calculate_shannon_entropy.py
# 3. Generate positional entropy profiles
python ~/entropia-msa/src/plot_positional_entropy.py
# 4. Review results
# - shannon_entropy_results.csv
# - entropy_plot.png
# - positional_entropy_all_genes.pdf| Entropy Range | Conservation Level | Color Code |
|---|---|---|
| 0.0 - 0.2 | Highly conserved | Green |
| 0.2 - 0.5 | Moderately variable | Yellow |
| 0.5 - 1.0 | Highly variable | Red |
Top Variable Proteins (from Rhynchospora kinetochore analysis):
- Skp1 / OG0000061 - Entropy: 0.2628
- Skp1 / OG0001284 - Entropy: 0.2038
- Skp1 / OG0001282 - Entropy: 0.1919
FASTA format with aligned sequences:
>Species1_gene
MARTKQTARKSTGGKAPR---KQLAT
>Species2_gene
MARTKQTARKSTGGKAPR---KQLAT
>Species3_gene
MARTK-TARKST--KAPR---KQLAT
| Column | Description |
|---|---|
| gene_id | Gene/OG identifier |
| num_sequences | Number of sequences in alignment |
| original_length | Alignment length before trimming |
| trimmed_length | Alignment length after gap removal |
| mean_normalized_entropy | Average normalized Shannon entropy |
| median_normalized_entropy | Median entropy value |
| std_normalized_entropy | Standard deviation |
| max_normalized_entropy | Maximum entropy position |
Edit calculate_shannon_entropy.py:
# Line ~40
trimmed_seqs = trim_alignment(sequences, gap_threshold=0.8) # Change to 0.7, 0.9, etc.In plot_heatmap_with_entropy.py, modify filtering:
# Line ~70
df = df[df["Complex"] == "CCAN"] # Focus on specific complexExample data is provided in data/example_alignments/:
OG0000103_H3.msa- Highly conserved histone H3OG0000122_Skp1.msa- Variable Skp1 proteinOG0000272_Nuf2.msa- Moderately conserved kinetochore protein
Run on examples:
cd data/example_alignments
python ../../src/calculate_shannon_entropy.pyIf you use this tool in your research, please cite:
Gonzalez, J. (2025). Entropia-MSA: Shannon Entropy Analysis for Multiple Sequence Alignments.
GitHub: https://github.com/jacgonisa/entropia-msa
Shannon Entropy:
Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27(3), 379-423.
Solution: Ensure all dependencies are installed:
pip install numpy pandas matplotlib seabornSolution: Ensure you're in the directory containing alignment files, or modify the glob pattern in the script.
Solution: Process alignments in batches for large datasets (>500 alignments).
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Submit a pull request
MIT License - see LICENSE file for details
- Author: Jacob Gonzalez
- GitHub: @jacgonisa
- Repository: https://github.com/jacgonisa/entropia-msa
- Developed for the Rhynchospora phylogenomics project
- Shannon entropy implementation based on standard information theory principles
- Visualization inspired by ComplexHeatmap and seaborn libraries
- v1.0.0 (2025-11-20): Initial release
- Shannon entropy calculation
- Positional entropy profiles
- Integrated heatmap visualization
- Copy number variance metrics







