Implement FastDD algorithm by MichaelS239 · Pull Request #767 · Desbordante/desbordante-core

MichaelS239 · 2026-05-24T12:28:17Z

This PR adds the optimized implementation of the FastDD algorithm for discovering Differential Dependencies (DDs) in Desbordante.

The FastDD algorithm was introduced in the following article: Zijing Tan et al. "Efficient Differential Dependency Discovery". Proc. VLDB Endow. 2024. Vol. 17, no. 7. P. 1552–1564.

The algorithm uses the novel technique of set cover enumeration (or hitting set enumeration) and consists of three main stages:

Preprocessing
Diff-set construction
Set cover enumeration and minimization

During the first stage, the algorithm reads the input table and constructs the search space. During the second stage, the algorithm builds so-called diff-sets that form a hypergraph. During the third stage, the algorithm performs set cover enumeration on the obtained hypergraph and does additional minimization to output only minimal DDs.

This PR adds the following optimizations to the original algorithm:

Extension of the algorithm to support the more general definition of DD, which was used in SPLIT algorithm
Usage of MMCS algorithm for hitting set enumeration
Additional post-processing to eliminate not interesting or trivial DDs
Technical optimizations (usage of bitsets with fixed size and other effective data structures)

The optimized version of FastDD performs significantly better than the version provided by the authors, with speedup from 2.75x to 13.75x (6.19x on average).

Add bitset that shows for each bit-position whether a child node is present in the map. Potentially helps to avoid costly lookups in the map.

The second stage of the FastDD algorithm is (mostly) hitting set enumeration. MMCS algorithm (Murakami, Uno, 2014) is considered to be the fastest hitting set enumeration algorithm. A modification of MMCS is used in HPIValid algorithm. This implementation is based on the implementation inside HPIValid with slight modifications specific to DD discovery. The new strategy for hitting set enumeration outperforms the original one that was proposed in the article presenting FastDD.

The first version of the dynamic bitset stores first 64 bits in model::Bitset<64> and the remaining bits in boost::dynamic_bitset. The second version stores all bits either in static or in dynamic bitset using std::variant. The third version stores static and dynamic bitsets separately. The forth version stores npos as a field to simplify find_first() and find_next() methods. The fifth version imitates static bitset and gives a performance boost.

Choose, whether to use static or dynamic bitset in the algorithm. The choice depends on the bitset size: if size <= 128, then static bitset is used, otherwise - boost::dynamic_bitset. Static bitset is based on model::Bitset<N>, where N = 32, 64, 128.

For larger search spaces, ISNs that are used to encode diff-sets may overflow std::size_t. In this case, bitsets are used during DiffSet construction without encoding/decoding process.

800 seems to be an optimal value at least for datasets used in the experiments.

Remove excessive distance calculations for min_max_dif, combine them with ISN calculation. This results in not optimal search space in DiffSet, so the search space is refined at a later step and bitsets from DiffSet are then translated to a smaller optimal search space.

- Use StaticBitset<N> for smaller search spaces; - Use bool instead of std::optional (actual value was not used) - Replace std::unordered_map with std::vector for children - Add children_bitset that shows which children are present

Search only in those match_dfs that satisfy RHS. This set is minimized before checking using MinimizeDifferentialSet(). In order to optimize this method, additional bitset comparison was removed from std::sort because it does not affect the results.

MichaelS239 added 28 commits May 24, 2026 15:33

Implement FastDD

1fbc6dc

Add main target

5a51287

Refactor algorithm for more general definition

cec626b

Add children bitset

97e92bc

Add bitset that shows for each bit-position whether a child node is present in the map. Potentially helps to avoid costly lookups in the map.

Use std::vector instead of std::unordered_map

a9b4cc1

Add basic path compression

e126b3a

Use std::vector instead of std::unordered_set

cfdf476

Use ForEach instead of iterator

92c4a75

Move MinimizeDifferentialSet() to HybridEvidenceInverter

d626480

Small optimizations (sort + reserve)

c224fd1

Use std::unordered_set for building DiffSet

a8be4cd

Optimize SetNumMask in SingleISNBuilder

86e03e0

Use boost::unordered::unordered_flat_set for clues

88cdc27

Choose bitset type in the algorithm

ed8df42

Choose, whether to use static or dynamic bitset in the algorithm. The choice depends on the bitset size: if size <= 128, then static bitset is used, otherwise - boost::dynamic_bitset. Static bitset is based on model::Bitset<N>, where N = 32, 64, 128.

Optimize RemoveTransitive

d89b017

Fallback to bitsets when ISNs overflow

f86bb36

For larger search spaces, ISNs that are used to encode diff-sets may overflow std::size_t. In this case, bitsets are used during DiffSet construction without encoding/decoding process.

Optimize MinMaxDifCalculator

bea0e0d

Set default shard length to 800

02863bf

800 seems to be an optimal value at least for datasets used in the experiments.

Treat mixed-type value as strings

da6b928

Move optimized Levenshtein distance to util

2cc0ca0

Use optimized version of Levenshtein distance

c2747d7

Optimize is_subset_of in StaticBitset

fb6eae1

Optimize MinimizeTree

27e006d

- Use StaticBitset<N> for smaller search spaces; - Use bool instead of std::optional (actual value was not used) - Replace std::unordered_map with std::vector for children - Add children_bitset that shows which children are present

Optimize removal of trivial DDs

a095dfe

Search only in those match_dfs that satisfy RHS. This set is minimized before checking using MinimizeDifferentialSet(). In order to optimize this method, additional bitset comparison was removed from std::sort because it does not affect the results.

Cleanup

0999607

MichaelS239 force-pushed the fastdd-optimized3 branch from 2ea934b to 0999607 Compare May 24, 2026 12:45

MichaelS239 marked this pull request as draft May 24, 2026 13:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement FastDD algorithm#767

Implement FastDD algorithm#767
MichaelS239 wants to merge 28 commits into
Desbordante:mainfrom
MichaelS239:fastdd-optimized3

MichaelS239 commented May 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MichaelS239 commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

MichaelS239 commented May 24, 2026 •

edited

Loading