Add CIND Cure Algorithm with Example and Tests#729
Conversation
2b042d1 to
c70be05
Compare
b5c081f to
9a57e71
Compare
84044db to
24261a0
Compare
| struct VecIntHash { | ||
| std::size_t operator()(std::vector<int> const& vec) const noexcept { | ||
| return boost::hash_value(vec); | ||
| } | ||
| }; | ||
|
|
||
| struct PairIntHash { | ||
| std::size_t operator()(std::pair<int, int> const& p) const noexcept { | ||
| std::size_t seed = 0; | ||
| boost::hash_combine(seed, p.first); | ||
| boost::hash_combine(seed, p.second); | ||
| return seed; | ||
| } | ||
| }; | ||
|
|
||
| std::vector<int> MakeKey(std::size_t row, AttrsType const& cols) { | ||
| std::vector<int> key; | ||
| key.reserve(cols.size()); | ||
| for (auto const* c : cols) { | ||
| key.push_back(c->GetValue(row)); | ||
| } | ||
| return key; | ||
| } |
There was a problem hiding this comment.
I think VecIntHash and MakeKey() should be moved to a separate namespace cind::utils. This will avoid code duplication in the verifier and all mining algorithms. Also, the MakeKey() logic is often implicitly used in the pli_cind and cinderella algorithms and can be replaced by an explicit call to this function.
| entry.values[lhs_cond_size + p.rhs_attr_idx] = | ||
| attrs.rhs_conditional[p.rhs_attr_idx]->DecodeValue(p.rhs_value); | ||
| entry.support = p.support; | ||
| cover.emplace(key, std::move(entry)); |
There was a problem hiding this comment.
| cover.emplace(key, std::move(entry)); | |
| cover.emplace(std::move(key), std::move(entry)); |
| std::vector<Condition> conditions; | ||
| conditions.reserve(cover.size()); | ||
|
|
||
| for (auto& [key, entry] : cover) { |
There was a problem hiding this comment.
| for (auto& [key, entry] : cover) { | |
| for (auto& [_, entry] : cover) { |
| std::size_t total_joined = 0; | ||
| for (auto const& p : patterns) { | ||
| total_joined += p.support; | ||
| } |
There was a problem hiding this comment.
IMHO, it is better to transfer the calculation to the MinimalCover() method since the result is used only there and is not used further.
| public: | ||
| explicit CureCind(config::InputTables& input_tables); | ||
|
|
There was a problem hiding this comment.
Our order of filling in fields in classes is private/protected/public.
| .def("get_condition_attributes", | ||
| [](CIND const& cind) { return VectorToTuple(cind.conditional_attributes); }); | ||
| [](CIND const& cind) { return VectorToTuple(cind.conditional_attributes); }) | ||
| .def("get_ind_string", [](CIND const& cind) { return cind.ind.ToLongString(); }); |
There was a problem hiding this comment.
From the point of view of binding requirements (you can look at Wiki/Development‐Features), it would be more correct if the user could get the full IND object embedded in the CIND. You can simply add the get_ind method so that the user can do whatever he wants with the embedded IND.
There was a problem hiding this comment.
Please note that due to the fact that the embedded IND is stored by reference in the CIND class, accessing it after destroying the algorithm class will result in an error. It seems worth rewriting the code so that IND is copied or moved to the CIND object.
There was a problem hiding this comment.
A couple of days ago, a common function appeared in py_util.h for such code. Use it.
| TEST_P(TestCureCind, TotalConditions) { | ||
| auto const& p = GetParam(); | ||
| auto mp = MakeCureParams(p.support); | ||
| auto cind_algo = algos::CreateAndLoadAlgorithm<algos::cind::CindAlgorithm>(mp); | ||
| cind_algo->Execute(); | ||
| ASSERT_FALSE(cind_algo->CINDList().empty()); | ||
| size_t total_conditions = 0; | ||
| for (auto const& cind : cind_algo->CINDList()) { | ||
| total_conditions += cind.ConditionsNumber(); | ||
| } | ||
| ASSERT_EQ(total_conditions, p.expected_total_conditions); | ||
| } |
There was a problem hiding this comment.
It's my mistake that I missed this last time, but it's worth rewriting the tests so that the conditions themselves are checked, too, and not just their number. You can see how it's done, for example, in test_apriori.cpp or test_cfd_algos.cpp
| int const lv = lhs_attr->GetValue(lhs_row); | ||
| for (std::size_t rhs_row : it->second) { | ||
| int const rv = rhs_attr->GetValue(rhs_row); | ||
| pair_counts[{lv, rv}]++; |
There was a problem hiding this comment.
| pair_counts[{lv, rv}]++; | |
| ++pair_counts[{lv, rv}]; |
| auto key = MakeKey(lhs_row, attrs.lhs_inclusion); | ||
| auto it = rhs_index.find(key); |
There was a problem hiding this comment.
IMHO, it's better this way because we don't need the key anymore.
| auto key = MakeKey(lhs_row, attrs.lhs_inclusion); | |
| auto it = rhs_index.find(key); | |
| auto it = rhs_index.find(MakeKey(lhs_row, attrs.lhs_inclusion)); |
| @@ -1,4 +1,6 @@ | |||
| set(NAME cind.miners) | |||
There was a problem hiding this comment.
This thing wasn't correct, we have at least a target for each algorithm. Make yours a separate one and do some renaming for this one.
| auto* cure = static_cast<CureCind*>(cind_miner_.get()); | ||
| config::Option<unsigned int> support_opt{&cure->min_support_, | ||
| config::names::kCindMinSupport, | ||
| config::descriptions::kDCindMinSupport, 2u}; | ||
| support_opt.SetValueCheck([](unsigned int support) { | ||
| if (support < 1) { | ||
| throw config::ConfigurationError("ERROR: support must be >= 1."); | ||
| } | ||
| }); | ||
| RegisterOption(std::move(support_opt)); |
There was a problem hiding this comment.
For the consistency of the code, it is better to write like this.
| auto* cure = static_cast<CureCind*>(cind_miner_.get()); | |
| config::Option<unsigned int> support_opt{&cure->min_support_, | |
| config::names::kCindMinSupport, | |
| config::descriptions::kDCindMinSupport, 2u}; | |
| support_opt.SetValueCheck([](unsigned int support) { | |
| if (support < 1) { | |
| throw config::ConfigurationError("ERROR: support must be >= 1."); | |
| } | |
| }); | |
| RegisterOption(std::move(support_opt)); | |
| auto* cure = static_cast<CureCind*>(cind_miner_.get()); | |
| auto support_check = [](usnigned int support) { | |
| if(support < 1){ | |
| throw config::ConfigurationError("Support must be >= 1."); | |
| } | |
| }; | |
| RegisterOption(config::Option<unsigned int>{&cure->min_support_, config::names::kCindMinSupport, config::descriptions::kDCindMinSupport, 2u}.SetValueCheck(std::move(support_check))); |
| private: | ||
| friend class CindAlgorithm; | ||
|
|
||
| uint min_support_{2}; |
There was a problem hiding this comment.
uint from sys/types.h is not a standard C++ type and, in theory, its use may lead to compilation errors for other platforms. Just replace it with a standard unsigned int.
| struct CoverEntry { | ||
| std::vector<std::string> values; | ||
| std::size_t support{0}; | ||
| }; | ||
|
|
||
| using CoverKey = std::pair<std::size_t, int>; | ||
| std::unordered_map<CoverKey, CoverEntry, PairIntHash> cover; | ||
|
|
||
| std::size_t total_joined = 0; | ||
| for (PatternPair const& p : patterns) { | ||
| total_joined += p.support; | ||
| } | ||
|
|
||
| for (PatternPair const& p : patterns) { | ||
| CoverKey key{p.lhs_attr_idx, p.lhs_value}; | ||
| auto it = cover.find(key); | ||
|
|
||
| if (it == cover.end()) { | ||
| CoverEntry entry; | ||
| entry.values.resize(total_attrs, kAnyValue); | ||
| entry.values[p.lhs_attr_idx] = | ||
| attrs.lhs_conditional[p.lhs_attr_idx]->DecodeValue(p.lhs_value); | ||
| entry.values[lhs_cond_size + p.rhs_attr_idx] = | ||
| attrs.rhs_conditional[p.rhs_attr_idx]->DecodeValue(p.rhs_value); | ||
| entry.support = p.support; | ||
| cover.emplace(std::move(key), std::move(entry)); | ||
| } else { | ||
| CoverEntry& entry = it->second; | ||
| std::size_t const rhs_pos = lhs_cond_size + p.rhs_attr_idx; | ||
| std::string const rhs_decoded = | ||
| attrs.rhs_conditional[p.rhs_attr_idx]->DecodeValue(p.rhs_value); | ||
|
|
||
| if (entry.values[rhs_pos] == kAnyValue) { | ||
| entry.values[rhs_pos] = rhs_decoded; | ||
| } else if (entry.values[rhs_pos].find(rhs_decoded) == std::string::npos) { | ||
| // Disjunction: append with comma | ||
| entry.values[rhs_pos] += ", " + rhs_decoded; | ||
| } | ||
| entry.support += p.support; | ||
| } | ||
| } |
There was a problem hiding this comment.
It seems like it's more understandable what's going on.
| struct CoverEntry { | |
| std::vector<std::string> values; | |
| std::size_t support{0}; | |
| }; | |
| using CoverKey = std::pair<std::size_t, int>; | |
| std::unordered_map<CoverKey, CoverEntry, PairIntHash> cover; | |
| std::size_t total_joined = 0; | |
| for (PatternPair const& p : patterns) { | |
| total_joined += p.support; | |
| } | |
| for (PatternPair const& p : patterns) { | |
| CoverKey key{p.lhs_attr_idx, p.lhs_value}; | |
| auto it = cover.find(key); | |
| if (it == cover.end()) { | |
| CoverEntry entry; | |
| entry.values.resize(total_attrs, kAnyValue); | |
| entry.values[p.lhs_attr_idx] = | |
| attrs.lhs_conditional[p.lhs_attr_idx]->DecodeValue(p.lhs_value); | |
| entry.values[lhs_cond_size + p.rhs_attr_idx] = | |
| attrs.rhs_conditional[p.rhs_attr_idx]->DecodeValue(p.rhs_value); | |
| entry.support = p.support; | |
| cover.emplace(std::move(key), std::move(entry)); | |
| } else { | |
| CoverEntry& entry = it->second; | |
| std::size_t const rhs_pos = lhs_cond_size + p.rhs_attr_idx; | |
| std::string const rhs_decoded = | |
| attrs.rhs_conditional[p.rhs_attr_idx]->DecodeValue(p.rhs_value); | |
| if (entry.values[rhs_pos] == kAnyValue) { | |
| entry.values[rhs_pos] = rhs_decoded; | |
| } else if (entry.values[rhs_pos].find(rhs_decoded) == std::string::npos) { | |
| // Disjunction: append with comma | |
| entry.values[rhs_pos] += ", " + rhs_decoded; | |
| } | |
| entry.support += p.support; | |
| } | |
| } | |
| std::size_t total_joined = 0; | |
| for (PatternPair const& p : patterns) { | |
| total_joined += p.support; | |
| } | |
| struct CoverEntry { | |
| std::vector<std::string> values; | |
| std::size_t support{0}; | |
| explicit CoverEntry(std::size_t total_attrs): values(total_attrs, kAnyValue) {} | |
| }; | |
| using CoverKey = std::pair<std::size_t, int>; | |
| std::unordered_map<CoverKey, CoverEntry, PairIntHash> cover; | |
| for (PatternPair const& p : patterns) { | |
| CoverKey key{p.lhs_attr_idx, p.lhs_value}; | |
| auto [it, inserted] = cover.try_emplace(std::move(key), total_attrs); | |
| CoverEntry& entry = it->second; | |
| if (inserted) { | |
| entry.values[p.lhs_attr_idx] = | |
| attrs.lhs_conditional[p.lhs_attr_idx]->DecodeValue(p.lhs_value); | |
| } | |
| std::size_t const rhs_pos = lhs_cond_size + p.rhs_attr_idx; | |
| std::string const rhs_decoded = | |
| attrs.rhs_conditional[p.rhs_attr_idx]->DecodeValue(p.rhs_value); | |
| if (entry.values[rhs_pos] == kAnyValue) { | |
| entry.values[rhs_pos] = rhs_decoded; | |
| } else if (entry.values[rhs_pos].find(rhs_decoded) == std::string::npos) { | |
| // Disjunction: append with comma | |
| entry.values[rhs_pos] += ", " + rhs_decoded; | |
| } | |
| entry.support += p.support; | |
| } |
Summary
This PR adds the Cure condition mining algorithm to the CIND framework, complementing the existing
CINDERELLAandPLI-CINDminers.Background
A Conditional Inclusion Dependency (CIND) is an IND that holds only on the subset of rows matching a pattern condition on non-inclusion attributes (see #664 for the formal definition). Desbordante already provides
CINDERELLA(itemset-based) andPLI-CIND(PLI-based) condition miners, both controlled by validity/completeness thresholds. Cure offers a different trade-off: it is controlled by a single support threshold (minimum number of joined tuples) and produces compact patterns with disjunctive RHS values.The Cure algorithm
O. Curé (2012) discovers conditional patterns in two phases:
Discovery — for each approximate IND, enumerate pairs of (LHS conditional attribute, RHS conditional attribute) and hash-join rows on the inclusion key. For each joined LHS/RHS value pair, count co-occurrences and keep those with count ≥
support.Minimal cover — merge patterns sharing the same
(LHS conditional attribute, LHS value)key into a single tableau row. When multiple RHS values occur for the same LHS key, they are folded into a comma-separated disjunction in the corresponding RHS slot.Per-pattern validity and completeness are derived from the pattern support and the total number of joined tuples.
Reference
Changes
CureCind(core/algorithms/cind/condition_miners/cure_cind.{h,cpp}) — new condition miner implementing the two-phase Cure algorithm.AlgoType::cure_cind— added alongside existingcinderellaandpli_cindincore/algorithms/cind/types.h.CindAlgorithmdispatch — instantiatesCureCindwhenalgo_type = cure_cind(core/algorithms/cind/cind_algorithm.cpp).min_support_member added toCindMinerbase class (used only by Cure).supportconfig option (core/config/conditions/support/option.{h,cpp}) — new minimum-tuples threshold, default2.supportin {1, 2, 3}) added tosrc/tests/unit/test_cind_algorithms.cpp; acind.algostest target is also wired up in the unit test CMakeLists.examples/basic/mining_cind_cure.pyruns CINDERELLA and Cure on the same DBpedia German/English persons dataset and prints a side-by-side comparison.