When using the Conser-vision dataset, I ran Zamba's splitting on sites and "zebra" didn't fall into the test split despite there being seven sites with zebras.
I think site-specific splitting needs to be label-aware to better balance actual labels across splits. I think there's some optimal solution here that round-robins but assigns based on the least-represented label within the split (relative to target proportions), but I don't remember the theory behind that. Relevant Zamba code is here
When using the Conser-vision dataset, I ran Zamba's splitting on sites and "zebra" didn't fall into the test split despite there being seven sites with zebras.
I think site-specific splitting needs to be label-aware to better balance actual labels across splits. I think there's some optimal solution here that round-robins but assigns based on the least-represented label within the split (relative to target proportions), but I don't remember the theory behind that. Relevant Zamba code is here