Skip to content

Implement FastDD algorithm#767

Draft
MichaelS239 wants to merge 28 commits into
Desbordante:mainfrom
MichaelS239:fastdd-optimized3
Draft

Implement FastDD algorithm#767
MichaelS239 wants to merge 28 commits into
Desbordante:mainfrom
MichaelS239:fastdd-optimized3

Conversation

@MichaelS239

@MichaelS239 MichaelS239 commented May 24, 2026

Copy link
Copy Markdown
Collaborator

This PR adds the optimized implementation of the FastDD algorithm for discovering Differential Dependencies (DDs) in Desbordante.

The FastDD algorithm was introduced in the following article: Zijing Tan et al. "Efficient Differential Dependency Discovery". Proc. VLDB Endow. 2024. Vol. 17, no. 7. P. 1552–1564.

The algorithm uses the novel technique of set cover enumeration (or hitting set enumeration) and consists of three main stages:

  • Preprocessing
  • Diff-set construction
  • Set cover enumeration and minimization

During the first stage, the algorithm reads the input table and constructs the search space. During the second stage, the algorithm builds so-called diff-sets that form a hypergraph. During the third stage, the algorithm performs set cover enumeration on the obtained hypergraph and does additional minimization to output only minimal DDs.

This PR adds the following optimizations to the original algorithm:

  • Extension of the algorithm to support the more general definition of DD, which was used in SPLIT algorithm
  • Usage of MMCS algorithm for hitting set enumeration
  • Additional post-processing to eliminate not interesting or trivial DDs
  • Technical optimizations (usage of bitsets with fixed size and other effective data structures)

The optimized version of FastDD performs significantly better than the version provided by the authors, with speedup from 2.75x to 13.75x (6.19x on average).

Add bitset that shows for each bit-position whether a child node is present in the map.
Potentially helps to avoid costly lookups in the map.
The second stage of the FastDD algorithm is (mostly) hitting set enumeration.
MMCS algorithm (Murakami, Uno, 2014) is considered to be the fastest
hitting set enumeration algorithm. A modification of MMCS
is used in HPIValid algorithm.

This implementation is based on the implementation inside HPIValid
with slight modifications specific to DD discovery. The new strategy
for hitting set enumeration outperforms the original one that was proposed
in the article presenting FastDD.
The first version of the dynamic bitset stores first 64 bits
in model::Bitset<64> and the remaining bits in boost::dynamic_bitset.

The second version stores all bits either in static or in dynamic
bitset using std::variant.

The third version stores static and dynamic bitsets separately.

The forth version stores npos as a field to simplify find_first()
and find_next() methods.

The fifth version imitates static bitset and gives a performance boost.
Choose, whether to use static or dynamic bitset in the algorithm.
The choice depends on the bitset size: if size <= 128, then
static bitset is used, otherwise - boost::dynamic_bitset.

Static bitset is based on model::Bitset<N>, where N = 32, 64, 128.
For larger search spaces, ISNs that are used to encode diff-sets
may overflow std::size_t. In this case, bitsets are used
during DiffSet construction without encoding/decoding process.
800 seems to be an optimal value at least for datasets
used in the experiments.
Remove excessive distance calculations for min_max_dif,
combine them with ISN calculation. This results in not optimal
search space in DiffSet, so the search space is refined
at a later step and bitsets from DiffSet are then translated
to a smaller optimal search space.
- Use StaticBitset<N> for smaller search spaces;
- Use bool instead of std::optional (actual value was not used)
- Replace std::unordered_map with std::vector for children
- Add children_bitset that shows which children are present
Search only in those match_dfs that satisfy RHS.
This set is minimized before checking using
MinimizeDifferentialSet(). In order to optimize
this method, additional bitset comparison was
removed from std::sort because it does not
affect the results.
@MichaelS239 MichaelS239 marked this pull request as draft May 24, 2026 13:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant