Implement FastDD algorithm#767
Draft
MichaelS239 wants to merge 28 commits into
Draft
Conversation
Add bitset that shows for each bit-position whether a child node is present in the map. Potentially helps to avoid costly lookups in the map.
The second stage of the FastDD algorithm is (mostly) hitting set enumeration. MMCS algorithm (Murakami, Uno, 2014) is considered to be the fastest hitting set enumeration algorithm. A modification of MMCS is used in HPIValid algorithm. This implementation is based on the implementation inside HPIValid with slight modifications specific to DD discovery. The new strategy for hitting set enumeration outperforms the original one that was proposed in the article presenting FastDD.
The first version of the dynamic bitset stores first 64 bits in model::Bitset<64> and the remaining bits in boost::dynamic_bitset. The second version stores all bits either in static or in dynamic bitset using std::variant. The third version stores static and dynamic bitsets separately. The forth version stores npos as a field to simplify find_first() and find_next() methods. The fifth version imitates static bitset and gives a performance boost.
Choose, whether to use static or dynamic bitset in the algorithm. The choice depends on the bitset size: if size <= 128, then static bitset is used, otherwise - boost::dynamic_bitset. Static bitset is based on model::Bitset<N>, where N = 32, 64, 128.
For larger search spaces, ISNs that are used to encode diff-sets may overflow std::size_t. In this case, bitsets are used during DiffSet construction without encoding/decoding process.
800 seems to be an optimal value at least for datasets used in the experiments.
Remove excessive distance calculations for min_max_dif, combine them with ISN calculation. This results in not optimal search space in DiffSet, so the search space is refined at a later step and bitsets from DiffSet are then translated to a smaller optimal search space.
- Use StaticBitset<N> for smaller search spaces; - Use bool instead of std::optional (actual value was not used) - Replace std::unordered_map with std::vector for children - Add children_bitset that shows which children are present
Search only in those match_dfs that satisfy RHS. This set is minimized before checking using MinimizeDifferentialSet(). In order to optimize this method, additional bitset comparison was removed from std::sort because it does not affect the results.
2ea934b to
0999607
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds the optimized implementation of the FastDD algorithm for discovering Differential Dependencies (DDs) in Desbordante.
The FastDD algorithm was introduced in the following article: Zijing Tan et al. "Efficient Differential Dependency Discovery". Proc. VLDB Endow. 2024. Vol. 17, no. 7. P. 1552–1564.
The algorithm uses the novel technique of set cover enumeration (or hitting set enumeration) and consists of three main stages:
During the first stage, the algorithm reads the input table and constructs the search space. During the second stage, the algorithm builds so-called diff-sets that form a hypergraph. During the third stage, the algorithm performs set cover enumeration on the obtained hypergraph and does additional minimization to output only minimal DDs.
This PR adds the following optimizations to the original algorithm:
The optimized version of FastDD performs significantly better than the version provided by the authors, with speedup from 2.75x to 13.75x (6.19x on average).