A template repo for curating Kaptive databases
Just click the green "Use this template" button in the top-right corner, create your repo and upload your database Genbank files!
All Kaptive databases now must be accompanied with a metadata TOML file with the same name as the corresponding Genbank file, with a '.toml' extension in place of the '.gbk' extension.
Below is an example of the Klebsiella pneumoniae Species Complex K-locus database metadata:
name = "Klebsiella_pneumoniae_Species_Complex_K"
keyword = "kpsc_k"
genbank = "Klebsiella_pneumoniae_Species_Complex_K.gbk"
organism = "Klebsiella pneumoniae Species Complex"
taxon = 3390273
antigen = "Capsular polysaccharide"
pathway = "Wzx/Wzy-dependent"
prefix = "K"
version = "3.2.1"
id_threshold = 82.5
doi = ["TBD"]
owner = "klebgenomics"
repo = "KpSC_surface_antigen_loci"
branch = "main"
contact = { "Kelly Wyres" = "kaptive.typing@gmail.com" }
[phenotype_logic]
"Capsule null" = { loci = ["KL*"], inactive_genes = ['wza','wzb','wzc','wzx','wzy', 'wcaJ*', 'wbaP*'], priority = 100 }
"K37" = { loci = ["KL22"], inactive_genes = ["atr12"] }The TOML format is a simple, human-readable, easily-parsable format, which makes it perfect for metadata. For these reasons, it also made sense to define the phenotype logic here too! Whilst this is still a work-in-progress, here is how we're currently defining it:
- Each line represents a unique phenotype that can be applied to a serotyping call.
- All fields accept a wildcard (
*) for selecting multiple items. - Loci are defined by the "loci" field - here you can choose the specific loci the logic applies to.
We have created a simple Streamlit app to help you generate the metadata any database in your repo!
This repository uses a fully automated Continuous Integration / Continuous Deployment (CI/CD) pipeline to manage database versions.
You do not need to manually edit version numbers or create Git tags. The pipeline relies on Semantic Versioning (SemVer) and reads your commit messages to automatically calculate the correct version bump, update the corresponding .toml files, and generate database-specific release tags.
The automation script decides how to version a database based on the language used in your commit messages. We follow the Conventional Commits standard.
When you commit changes to a database Genbank file, prefix your commit message with one of the following:
fix: - Use this for correcting typos, fixing broken logic rules, or minor backwards-compatible bug fixes.
- Example:
fix: correct wcaJ truncation rule in Klebsiella - Result:
v3.2.1 ➡️ v3.2.2
feat: - Use this when adding new features, such as adding a new locus, a new glycosidic linkage, or expanding the phenotype logic in a backwards-compatible way.
- Example:
feat: add KL102 locus to Klebsiella_pneumoniae_K - Result:
v3.2.1 ➡️ v3.3.0
feat!: or [major] - Use this for breaking changes, such as overhauling the TOML schema, changing existing core
nomenclature, or deleting previously supported loci.
- Example:
feat!: restructure TOML schema for phenotype logic - Result:
v3.2.1 ➡️ v4.0.0
chore:, docs:, style: - Changes to READMEs, generic repository maintenance, or formatting will not trigger a version bump.
To update an existing database, simply make your changes to the .gbk files and commit them using the appropriate prefix.
# 1. Make changes to your files
git add Klebsiella_pneumoniae_K.gbk Klebsiella_pneumoniae_K.toml
# 2. Commit using a Conventional Commit message
git commit -m "feat: add new Wzy-dependent linkage rules"
# 3. Push to main
git push origin mainWhat happens next? The GitHub Action will detect the changes to the Klebsiella files, parse the feat: prefix,
bump the minor version in Klebsiella_pneumoniae_K.toml, commit that TOML update back to the repository,
and create a scoped tag (e.g., Klebsiella_pneumoniae_K-v3.3.0).
The pipeline is database-agnostic. To add a new database, you just need to drop the required files into the repository.
- Add your new GenBank file (e.g.,
seudomonas_aeruginosa_O.gbk). - Add a starting TOML file (e.g.,
Pseudomonas_aeruginosa_O.toml) and manually set the initial version (e.g.,version = "1.0.0"). - Commit and push:
git add Pseudomonas_aeruginosa_O.*
git commit -m "feat: initial release of Pseudomonas O-locus database"
git push origin mainThe pipeline will automatically discover the new .toml file, register the feat: bump (e.g., v1.1.0), and tag it.
If you make a broad change that affects multiple databases (for example, fixing a shared logic rule across both Klebsiella_pneumoniae_K and Klebsiella_pneumoniae_O), simply commit them together:
git add *.logic
git commit -m "fix: standardize capsule null logic across all databases"
git push origin main**The workflow will detect every database that was modified, bump their .toml versions independently, and generate a separate release tag for each one.
- Never manually edit the version = "..." string in the .toml files. The Python automation (tomlkit) handles this to ensure strict alignment between the file contents and the Git tags.
- Ensure file names match exactly. The base name of the TOML file must match the base name of the GenBank files
(e.g.,
Database_Name.tomlpairs withDatabase_Name.gbk). - Pull before you work. Because the GitHub Action makes automated commits to update the TOML files, always
run
git pullbefore starting new work to ensure your local branch has the latest version strings.
