Skip to content

Add CIND Cure Algorithm with Example and Tests#729

Open
ALanovaya wants to merge 9 commits into
Desbordante:mainfrom
ALanovaya:cind/cure-algorithm
Open

Add CIND Cure Algorithm with Example and Tests#729
ALanovaya wants to merge 9 commits into
Desbordante:mainfrom
ALanovaya:cind/cure-algorithm

Conversation

@ALanovaya

@ALanovaya ALanovaya commented Apr 21, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR adds the Cure condition mining algorithm to the CIND framework, complementing the existing CINDERELLA and PLI-CIND miners.

Background

A Conditional Inclusion Dependency (CIND) is an IND that holds only on the subset of rows matching a pattern condition on non-inclusion attributes (see #664 for the formal definition). Desbordante already provides CINDERELLA (itemset-based) and PLI-CIND (PLI-based) condition miners, both controlled by validity/completeness thresholds. Cure offers a different trade-off: it is controlled by a single support threshold (minimum number of joined tuples) and produces compact patterns with disjunctive RHS values.

The Cure algorithm

O. Curé (2012) discovers conditional patterns in two phases:

  1. Discovery — for each approximate IND, enumerate pairs of (LHS conditional attribute, RHS conditional attribute) and hash-join rows on the inclusion key. For each joined LHS/RHS value pair, count co-occurrences and keep those with count ≥ support.

  2. Minimal cover — merge patterns sharing the same (LHS conditional attribute, LHS value) key into a single tableau row. When multiple RHS values occur for the same LHS key, they are folded into a comma-separated disjunction in the corresponding RHS slot.

Per-pattern validity and completeness are derived from the pattern support and the total number of joined tuples.

Reference

  • O. Curé. "Improving the Data Quality of Drug Databases using Conditional Dependencies and Ontologies". ACM JDIQ 4(1):20, 2012.

Changes

  • CureCind (core/algorithms/cind/condition_miners/cure_cind.{h,cpp}) — new condition miner implementing the two-phase Cure algorithm.
  • AlgoType::cure_cind — added alongside existing cinderella and pli_cind in core/algorithms/cind/types.h.
  • CindAlgorithm dispatch — instantiates CureCind when algo_type = cure_cind (core/algorithms/cind/cind_algorithm.cpp).
  • min_support_ member added to CindMiner base class (used only by Cure).
  • support config option (core/config/conditions/support/option.{h,cpp}) — new minimum-tuples threshold, default 2.
  • Tests — three parameterized cases (support in {1, 2, 3}) added to src/tests/unit/test_cind_algorithms.cpp; a cind.algos test target is also wired up in the unit test CMakeLists.
  • Exampleexamples/basic/mining_cind_cure.py runs CINDERELLA and Cure on the same DBpedia German/English persons dataset and prints a side-by-side comparison.

@ALanovaya ALanovaya force-pushed the cind/cure-algorithm branch from 2b042d1 to c70be05 Compare April 21, 2026 20:37
Comment thread src/core/algorithms/cind/condition_miners/cind_miner.cpp
Comment thread src/core/config/CMakeLists.txt Outdated
Comment thread src/core/model/table/encoded_tables.h
@ALanovaya ALanovaya force-pushed the cind/cure-algorithm branch 4 times, most recently from b5c081f to 9a57e71 Compare May 2, 2026 18:21
@ALanovaya ALanovaya force-pushed the cind/cure-algorithm branch from 84044db to 24261a0 Compare May 2, 2026 19:40
@ALanovaya ALanovaya requested a review from wildsor May 2, 2026 21:09
Comment on lines +12 to +34
struct VecIntHash {
std::size_t operator()(std::vector<int> const& vec) const noexcept {
return boost::hash_value(vec);
}
};

struct PairIntHash {
std::size_t operator()(std::pair<int, int> const& p) const noexcept {
std::size_t seed = 0;
boost::hash_combine(seed, p.first);
boost::hash_combine(seed, p.second);
return seed;
}
};

std::vector<int> MakeKey(std::size_t row, AttrsType const& cols) {
std::vector<int> key;
key.reserve(cols.size());
for (auto const* c : cols) {
key.push_back(c->GetValue(row));
}
return key;
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think VecIntHash and MakeKey() should be moved to a separate namespace cind::utils. This will avoid code duplication in the verifier and all mining algorithms. Also, the MakeKey() logic is often implicitly used in the pli_cind and cinderella algorithms and can be replaced by an explicit call to this function.

entry.values[lhs_cond_size + p.rhs_attr_idx] =
attrs.rhs_conditional[p.rhs_attr_idx]->DecodeValue(p.rhs_value);
entry.support = p.support;
cover.emplace(key, std::move(entry));

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cover.emplace(key, std::move(entry));
cover.emplace(std::move(key), std::move(entry));

std::vector<Condition> conditions;
conditions.reserve(cover.size());

for (auto& [key, entry] : cover) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for (auto& [key, entry] : cover) {
for (auto& [_, entry] : cover) {

Comment on lines +190 to +193
std::size_t total_joined = 0;
for (auto const& p : patterns) {
total_joined += p.support;
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, it is better to transfer the calculation to the MinimalCover() method since the result is used only there and is not used further.

Comment on lines +8 to +10
public:
explicit CureCind(config::InputTables& input_tables);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our order of filling in fields in classes is private/protected/public.

.def("get_condition_attributes",
[](CIND const& cind) { return VectorToTuple(cind.conditional_attributes); });
[](CIND const& cind) { return VectorToTuple(cind.conditional_attributes); })
.def("get_ind_string", [](CIND const& cind) { return cind.ind.ToLongString(); });

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the point of view of binding requirements (you can look at Wiki/Development‐Features), it would be more correct if the user could get the full IND object embedded in the CIND. You can simply add the get_ind method so that the user can do whatever he wants with the embedded IND.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that due to the fact that the embedded IND is stored by reference in the CIND class, accessing it after destroying the algorithm class will result in an error. It seems worth rewriting the code so that IND is copied or moved to the CIND object.

Comment thread src/python_bindings/cind/bind_cind.cpp Outdated
Comment on lines 19 to 29

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of days ago, a common function appeared in py_util.h for such code. Use it.

Comment on lines +129 to +140
TEST_P(TestCureCind, TotalConditions) {
auto const& p = GetParam();
auto mp = MakeCureParams(p.support);
auto cind_algo = algos::CreateAndLoadAlgorithm<algos::cind::CindAlgorithm>(mp);
cind_algo->Execute();
ASSERT_FALSE(cind_algo->CINDList().empty());
size_t total_conditions = 0;
for (auto const& cind : cind_algo->CINDList()) {
total_conditions += cind.ConditionsNumber();
}
ASSERT_EQ(total_conditions, p.expected_total_conditions);
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's my mistake that I missed this last time, but it's worth rewriting the tests so that the conditions themselves are checked, too, and not just their number. You can see how it's done, for example, in test_apriori.cpp or test_cfd_algos.cpp

int const lv = lhs_attr->GetValue(lhs_row);
for (std::size_t rhs_row : it->second) {
int const rv = rhs_attr->GetValue(rhs_row);
pair_counts[{lv, rv}]++;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pair_counts[{lv, rv}]++;
++pair_counts[{lv, rv}];

Comment on lines +104 to +105
auto key = MakeKey(lhs_row, attrs.lhs_inclusion);
auto it = rhs_index.find(key);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, it's better this way because we don't need the key anymore.

Suggested change
auto key = MakeKey(lhs_row, attrs.lhs_inclusion);
auto it = rhs_index.find(key);
auto it = rhs_index.find(MakeKey(lhs_row, attrs.lhs_inclusion));

@@ -1,4 +1,6 @@
set(NAME cind.miners)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This thing wasn't correct, we have at least a target for each algorithm. Make yours a separate one and do some renaming for this one.

Comment thread src/core/algorithms/cind/condition_miners/cure_cind.cpp Outdated
Comment thread src/core/algorithms/cind/cind_algorithm.cpp Outdated
Comment thread src/core/algorithms/cind/cind_algorithm.cpp
Comment thread src/core/config/conditions/support/option.cpp Outdated
Comment thread src/tests/unit/test_cind_algorithms.cpp Outdated
@ALanovaya ALanovaya requested review from Oddin60F and wildsor May 5, 2026 14:21
Comment thread src/core/algorithms/cind/cind_algorithm.cpp
@ALanovaya ALanovaya requested a review from wildsor May 7, 2026 14:48
Comment on lines +69 to +78
auto* cure = static_cast<CureCind*>(cind_miner_.get());
config::Option<unsigned int> support_opt{&cure->min_support_,
config::names::kCindMinSupport,
config::descriptions::kDCindMinSupport, 2u};
support_opt.SetValueCheck([](unsigned int support) {
if (support < 1) {
throw config::ConfigurationError("ERROR: support must be >= 1.");
}
});
RegisterOption(std::move(support_opt));

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the consistency of the code, it is better to write like this.

Suggested change
auto* cure = static_cast<CureCind*>(cind_miner_.get());
config::Option<unsigned int> support_opt{&cure->min_support_,
config::names::kCindMinSupport,
config::descriptions::kDCindMinSupport, 2u};
support_opt.SetValueCheck([](unsigned int support) {
if (support < 1) {
throw config::ConfigurationError("ERROR: support must be >= 1.");
}
});
RegisterOption(std::move(support_opt));
auto* cure = static_cast<CureCind*>(cind_miner_.get());
auto support_check = [](usnigned int support) {
if(support < 1){
throw config::ConfigurationError("Support must be >= 1.");
}
};
RegisterOption(config::Option<unsigned int>{&cure->min_support_, config::names::kCindMinSupport, config::descriptions::kDCindMinSupport, 2u}.SetValueCheck(std::move(support_check)));

private:
friend class CindAlgorithm;

uint min_support_{2};

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uint from sys/types.h is not a standard C++ type and, in theory, its use may lead to compilation errors for other platforms. Just replace it with a standard unsigned int.

Comment on lines +118 to +158
struct CoverEntry {
std::vector<std::string> values;
std::size_t support{0};
};

using CoverKey = std::pair<std::size_t, int>;
std::unordered_map<CoverKey, CoverEntry, PairIntHash> cover;

std::size_t total_joined = 0;
for (PatternPair const& p : patterns) {
total_joined += p.support;
}

for (PatternPair const& p : patterns) {
CoverKey key{p.lhs_attr_idx, p.lhs_value};
auto it = cover.find(key);

if (it == cover.end()) {
CoverEntry entry;
entry.values.resize(total_attrs, kAnyValue);
entry.values[p.lhs_attr_idx] =
attrs.lhs_conditional[p.lhs_attr_idx]->DecodeValue(p.lhs_value);
entry.values[lhs_cond_size + p.rhs_attr_idx] =
attrs.rhs_conditional[p.rhs_attr_idx]->DecodeValue(p.rhs_value);
entry.support = p.support;
cover.emplace(std::move(key), std::move(entry));
} else {
CoverEntry& entry = it->second;
std::size_t const rhs_pos = lhs_cond_size + p.rhs_attr_idx;
std::string const rhs_decoded =
attrs.rhs_conditional[p.rhs_attr_idx]->DecodeValue(p.rhs_value);

if (entry.values[rhs_pos] == kAnyValue) {
entry.values[rhs_pos] = rhs_decoded;
} else if (entry.values[rhs_pos].find(rhs_decoded) == std::string::npos) {
// Disjunction: append with comma
entry.values[rhs_pos] += ", " + rhs_decoded;
}
entry.support += p.support;
}
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like it's more understandable what's going on.

Suggested change
struct CoverEntry {
std::vector<std::string> values;
std::size_t support{0};
};
using CoverKey = std::pair<std::size_t, int>;
std::unordered_map<CoverKey, CoverEntry, PairIntHash> cover;
std::size_t total_joined = 0;
for (PatternPair const& p : patterns) {
total_joined += p.support;
}
for (PatternPair const& p : patterns) {
CoverKey key{p.lhs_attr_idx, p.lhs_value};
auto it = cover.find(key);
if (it == cover.end()) {
CoverEntry entry;
entry.values.resize(total_attrs, kAnyValue);
entry.values[p.lhs_attr_idx] =
attrs.lhs_conditional[p.lhs_attr_idx]->DecodeValue(p.lhs_value);
entry.values[lhs_cond_size + p.rhs_attr_idx] =
attrs.rhs_conditional[p.rhs_attr_idx]->DecodeValue(p.rhs_value);
entry.support = p.support;
cover.emplace(std::move(key), std::move(entry));
} else {
CoverEntry& entry = it->second;
std::size_t const rhs_pos = lhs_cond_size + p.rhs_attr_idx;
std::string const rhs_decoded =
attrs.rhs_conditional[p.rhs_attr_idx]->DecodeValue(p.rhs_value);
if (entry.values[rhs_pos] == kAnyValue) {
entry.values[rhs_pos] = rhs_decoded;
} else if (entry.values[rhs_pos].find(rhs_decoded) == std::string::npos) {
// Disjunction: append with comma
entry.values[rhs_pos] += ", " + rhs_decoded;
}
entry.support += p.support;
}
}
std::size_t total_joined = 0;
for (PatternPair const& p : patterns) {
total_joined += p.support;
}
struct CoverEntry {
std::vector<std::string> values;
std::size_t support{0};
explicit CoverEntry(std::size_t total_attrs): values(total_attrs, kAnyValue) {}
};
using CoverKey = std::pair<std::size_t, int>;
std::unordered_map<CoverKey, CoverEntry, PairIntHash> cover;
for (PatternPair const& p : patterns) {
CoverKey key{p.lhs_attr_idx, p.lhs_value};
auto [it, inserted] = cover.try_emplace(std::move(key), total_attrs);
CoverEntry& entry = it->second;
if (inserted) {
entry.values[p.lhs_attr_idx] =
attrs.lhs_conditional[p.lhs_attr_idx]->DecodeValue(p.lhs_value);
}
std::size_t const rhs_pos = lhs_cond_size + p.rhs_attr_idx;
std::string const rhs_decoded =
attrs.rhs_conditional[p.rhs_attr_idx]->DecodeValue(p.rhs_value);
if (entry.values[rhs_pos] == kAnyValue) {
entry.values[rhs_pos] = rhs_decoded;
} else if (entry.values[rhs_pos].find(rhs_decoded) == std::string::npos) {
// Disjunction: append with comma
entry.values[rhs_pos] += ", " + rhs_decoded;
}
entry.support += p.support;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants