Skip to content

Use weighted min_bin_size and max_bin_size when sample_weight provided#359

Merged
guillermo-navas-palencia merged 4 commits into
guillermo-navas-palencia:developfrom
YC-1412:323-fix_bin_size_with_sample_weight
Jun 4, 2025
Merged

Use weighted min_bin_size and max_bin_size when sample_weight provided#359
guillermo-navas-palencia merged 4 commits into
guillermo-navas-palencia:developfrom
YC-1412:323-fix_bin_size_with_sample_weight

Conversation

@YC-1412

@YC-1412 YC-1412 commented May 28, 2025

Copy link
Copy Markdown
Contributor

Fix #323

Hi Guillermo,

Thanks for developing this amazing tool. I found it extremely useful and easy to use. I hope this PR resolves an existing issue with sample_weight.

  1. Added a property self._n_samples_weighted to calculate weighted sample size - if sample_weight is provided it will be the weighted sample size (sum(sample_weight), if not, if will be the same as self._n_samples.

    • self._n_samples_weighted is used when calculating the min_bin_size and max_bin_size in _fit_optimizer().
  2. When sample_weight is provided, use min_weight_fraction_leaf instead of min_samples_leaf for DecisionTreeClassifier. It was suggested in the sklearn documentation:

    Note that min_samples_split considers samples directly and independent of sample_weight, if provided (e.g. a node with m weighted samples is still treated as having exactly m samples). Consider min_weight_fraction_leaf or min_impurity_decrease if accounting for sample weights is required at splits.

    Class balancing can be done by sampling an equal number of samples from each class, or preferably by normalizing the sum of the sample weights (sample_weight) for each class to the same value. Also note that weight-based pre-pruning criteria, such as min_weight_fraction_leaf, will then be less biased toward dominant classes than criteria that are not aware of the sample weights, like min_samples_leaf.

    If the samples are weighted, it will be easier to optimize the tree structure using weight-based pre-pruning criterion such as min_weight_fraction_leaf, which ensure that leaf nodes contain at least a fraction of the overall sum of the sample weights.

I used the test code in #323 to test the fix:

Before:
image

After:
image

Todo

  • I think it is also worth mentioning in the documentation that min_bin_size and max_bin_size include the mix of missing and special groups. I can add that if you think the change makes sense.

Thanks again for creating and maintaining this tool!

@guillermo-navas-palencia guillermo-navas-palencia added the enhancement New feature or request label Jun 4, 2025
@guillermo-navas-palencia guillermo-navas-palencia added this to the v0.21.0 milestone Jun 4, 2025
@guillermo-navas-palencia

guillermo-navas-palencia commented Jun 4, 2025

Copy link
Copy Markdown
Owner

Great PR. Thanks a lot for your contribution. Regarding the clarification about the min/max_bin_size, if think it is a good idea.

@guillermo-navas-palencia guillermo-navas-palencia merged commit 10dd47a into guillermo-navas-palencia:develop Jun 4, 2025
12 checks passed
@guillermo-navas-palencia guillermo-navas-palencia mentioned this pull request Oct 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants