In critical fields such as healthcare, finance, and industrial operations, the reliability of artificial intelligence (AI) models depends on their predictive accuracy and ability to explain their decisions. This has led to growing attention toward Explainable AI (XAI), which aims to provide transparency and accountability in model predictions. Feature attribution methods (AMs) are widely employed among the various XAI approaches. They assign importance scores to input features, highlighting the factors influencing a model’s output. However, the usefulness of these methods is closely tied to their faithfulness—whether the identified features truly reflect what the model relies upon. Evaluating faithfulness is typically carried out through perturbation-based techniques, where input features are modified according to their attributed importance, and the resulting changes in model performance are analyzed.
Despite their popularity, existing evaluation practices, particularly those relying on the Area Under Perturbation Curve (AUPC), show notable shortcomings when applied to time-series data. Our earlier investigations revealed that such metrics can lead to misleading conclusions, especially when the choice of perturbation method or region size distorts the evaluation. To address these issues, this work introduces a new metric, the Consistency-Magnitude Index (CMI), which integrates two complementary measures: the Decaying Degradation Score (DDS) and the Perturbation Effect Size (PES). Together, they provide a more reliable and consistent assessment of attribution faithfulness.
Furthermore, we propose an adapted evaluation methodology that leverages a diverse set of perturbation strategies rather than depending on a single one, ensuring robustness across varying datasets and model architectures. Our experimental study, conducted on multiple time-series datasets and deep learning models, highlights the significant role of perturbation methods and region size in faithfulness evaluation. Based on these insights, we offer practical guidelines for selecting suitable attribution and perturbation methods.
To address the limitations of existing evaluation strategies for attribution methods (AMs), this project proposes a robust perturbation-based evaluation framework combined with a novel metric, the Consistency-Magnitude Index (CMI). The technique is designed to reliably measure the faithfulness of AMs in the context of neural time-series classification models, where inconsistencies between explanation methods are particularly problematic.
The framework begins by applying multiple Perturbation Methods (PMs) to input features based on the importance scores provided by different AMs. Instead of relying on a single perturbation strategy, the methodology systematically explores a diverse set of PMs, reducing the risk of biased or misleading evaluations. Both highly relevant and low-relevance features are controlled to observe their influence on model predictions.
To quantify AM performance, the proposed CMI metric combines two complementary measures: the Decaying Degradation Score (DDS), which captures the degree of separation between relevant and irrelevant features, and the Perturbation Effect Size (PES), which evaluates how consistently an AM distinguishes important from unimportant features. Integrating these measures ensures that evaluations reflect feature attribution's magnitude and reliability.
Through this methodology, the model provides practitioners with actionable insights into selecting the most faithful attribution methods for a given dataset and architecture, thereby improving the interpretability and trustworthiness of deep learning systems in high-stakes domains.
The mean CMI is used to rank the best PMs for each dataset across all models and area sizes for binary dataset
The mean CMI is used to rank the best PMs for each dataset across all models and area sizes for multiclass dataset
The mean CMI is used to rank the best PMs for each dataset across all models and area sizes for anomaly detection
Mean AM rankings for all datasets, models and region sizes
Over ResNet and Inception model architectures, the mean AM rankings for all datasets and region sizes
This study highlights the critical influence of perturbation method (PM) selection on the faithfulness evaluation of attribution methods (AMs) in neural time-series classification. Through a comprehensive set of experiments, we demonstrated that the interplay between dataset characteristics and model architectures substantially affects the suitability of different PMs. Importantly, our results emphasize that incorporating highly relevant and low-relevance features into evaluation metrics is essential for a fair assessment of AM performance.
To address the limitations of existing approaches, we introduced the Consistency-Magnitude Index (CMI), a novel metric that reliably estimates the suitability of PMs for specific datasets and models. Combined with our proposed methodology, CMI enables robust and consistent evaluation of AM faithfulness. Our large-scale experimental setup—covering five datasets, five deep learning architectures, twelve attribution methods, twenty-three perturbation methods, two region sizes, and two perturbation orders—represents, to the best of our knowledge, the most extensive investigation of its kind in the time-series domain.
The results show that careful PM selection is indispensable, as no single perturbation strategy is universally optimal. While SampleMean, Zero, and the newly introduced Laplace PM serve as strong default choices across most scenarios, other PMs may outperform them in specific cases. Notably, perturbations based on neighboring regions showed promising results, whereas the commonly used UniformNoise PM exhibited inconsistent behavior, sometimes leading to misleading outcomes.
Regarding attribution methods, FeatureAblation generally provides the most faithful explanations across different datasets and models. GradCAM is a suitable alternative for convolutional architectures, while Integrated Gradients offers a practical balance between faithfulness and execution time. Conversely, methods such as GuidedBackprop, KernelSHAP, and LIME were unsuitable for raw time-series data.\
Overall, our proposed methodology and the CMI metric establish a reliable framework for evaluating the faithfulness of attribution methods in time-series classification. Beyond providing practical guidelines for practitioners seeking trustworthy explanations, this work also equips AM developers with a rigorous tool for validating new methods. By ensuring more faithful explanations, our contributions enhance transparency and trust in deep learning systems, paving the way for their safer deployment in high-stakes domains.

