A reproducible research prototype for memory-assisted SOC triage and CIC-IDS2017 validation
Huỳnh Chí Trung
ORCID: 0009-0004-6155-3634
Email: huynhchitrung@dtu.edu.vn, huynhtrung.csc@gmail.com
MAAT-SOC: Memory-Augmented Alert Triage for Security Operations
The title foregrounds the primary scientific object of the repository: memory-assisted alert triage in Security Operations Centers. Leakage control and anti-memorization checks remain essential to scientific integrity, but they are treated as methodological requirements rather than the main title. This scope is more accurate than a broad "agentic AI SOC" claim because the repository demonstrates persistent entity-memory assisted alert triage and a leakage-aware CIC-IDS2017 validation pipeline; it does not yet prove a fully autonomous SOC agent, production incident response, or state-of-the-art intrusion detection across multiple datasets [1], [2], [5].
Security Operations Centers (SOCs) face persistent alert fatigue: analysts must process large volumes of low-priority, duplicate, and false-positive alerts while still avoiding missed attacks. Recent SOC research emphasizes that alert validity depends not only on raw detector output but also on analyst tacit knowledge, system context, and repeated local behavior [1]. This repository investigates whether a lightweight, auditable memory layer can encode entity-level alert history and reduce repeated false-positive pressure without suppressing attack alerts.
MAAT-SOC implements a four-layer memory architecture for alert triage: episodic memory stores prior alert records, semantic memory summarizes stable entity behavior, procedural memory tracks previous decisions, and working memory assembles the model-facing context. The architecture is motivated by memory-augmented learning systems, but is deliberately implemented as auditable security engineering rather than an opaque neural memory model [3]. The central design constraint is safety: benign history may reduce repeated false-positive pressure only when attack recall is preserved.
The repository also evaluates an intrusion-detection baseline on CIC-IDS2017 using leakage-aware and anti-memorization preprocessing. The pipeline removes exact duplicate feature vectors before splitting, excludes label and source-file fields from features, fits imputation only on training data, uses train-only class weights, and explicitly reports train-test feature overlap. This protocol responds to known reproducibility and leakage risks in machine-learning-based science and computer-security ML evaluation [5], [6], [8].
Results are intentionally reported with both positive and negative findings. In a deterministic 75-scenario alert-triage benchmark, combined memory preserves 100.0% attack recall and reduces false-positive rate from 48.9% to 46.7%, but high-anomaly false positives remain unresolved. On CIC-IDS2017, the leakage-aware stratified holdout reaches 99.946% balanced accuracy and 99.978% recall; however, a source-file holdout stress test on the PortScan file collapses fixed-threshold recall to 0.604% despite ROC AUC of 92.415%. This contrast is the main scientific lesson: high random-split performance is not sufficient evidence of deployment readiness [4], [6], [7].
Security Operations Center; SOC alert triage; alert fatigue; memory-augmented AI; intrusion detection; CIC-IDS2017; leakage-aware validation; anti-memorization controls; reproducible machine learning; source-file holdout; false-positive reduction.
The main purpose of MAAT-SOC is to test whether a SOC alert triage system can use persistent entity memory to reduce repeated false positives while preserving attack recall. This problem matters because false positives and alert overload are repeatedly identified as practical barriers to effective SOC operation, and analyst context is difficult to capture in conventional stateless scoring pipelines [1], [2].
The project also tests a second methodological claim: intrusion-detection evaluation should report leakage controls and stress tests, not only headline random-split accuracy. Prior work has warned that ML systems for network intrusion detection often perform well in controlled closed-world benchmarks but fail to generalize under realistic shifts, and broader ML literature shows that leakage can produce overoptimistic scientific conclusions [4], [5], [8].
| ID | Research question | Why it matters | Evaluation artifact |
|---|---|---|---|
| RQ1 | Can persistent entity memory reduce repeated false-positive pressure without suppressing attacks? | SOCs need noise reduction, but missed attacks are more costly than small FP reductions [1]. | scripts/memory_benchmark.py |
| RQ2 | How much prior history is required before memory becomes useful? | A memory system may be unsafe or unhelpful during cold start. | reports/results/history_depth.csv |
| RQ3 | Can a CIC-IDS2017 classifier exceed 90% performance under leakage-aware preprocessing? | Public IDS benchmarks are often used to support ML claims, but evaluation hygiene matters [5], [6]. | scripts/evaluate_cicids2017.py |
| RQ4 | Does high stratified-holdout performance survive a source-file holdout stress test? | Source/time shift better reflects deployment risk than a random split alone [4], [7]. | reports/public_datasets/cicids2017_portscan_holdout/ |
-
Memory-augmented SOC triage architecture. The repository implements an auditable four-layer memory design for entity-level alert context rather than treating every alert as stateless. This differs from classical IDS classification work by focusing on alert triage state and repeated entity behavior, not only packet-flow classification [1], [3].
-
Safety-oriented memory evaluation. The deterministic benchmark does not only ask whether false positives decrease; it also asks whether benign history suppresses attacks. This safety framing is important because alert suppression can become harmful if it hides true incidents [1], [4].
-
Leakage-aware CIC-IDS2017 validation. The CIC pipeline reports duplicate removal, missing and infinite values, train-test exact feature overlap, class imbalance handling, and fixed-threshold evaluation. These controls directly address methodological pitfalls documented in ML security and reproducibility literature [5], [6], [8].
-
Negative source-holdout result as a reproducibility contribution. The project reports that random-split metrics are excellent but fixed-threshold source-file generalization is weak. In a rigorous scientific paper, this negative result is not a failure to hide; it is evidence that the work is scientifically honest and identifies a concrete future research target [4], [7], [8].
SOC alert fatigue is a human-machine decision problem, not merely a classifier problem. Recent ACM Computing Surveys work frames alert fatigue as a research area involving excessive alert volume, false positives, analyst tacit knowledge, and operational context [1]. MAAT-SOC follows this framing by representing entity history explicitly and reporting false-positive effects separately from attack recall.
Analyst-in-the-loop security learning has been explored before. AI2 combined unsupervised anomaly detection, analyst feedback, and supervised learning, reporting improved detection and reduced false positives on large-scale operational logs [2]. MAAT-SOC is related but narrower: it does not claim active analyst feedback learning; instead, it studies persistent entity memory and deterministic safety constraints that can be inspected and reproduced.
Memory-augmented models such as Memory Networks introduced the idea of systems that read and write long-term memory for prediction [3]. MAAT-SOC adopts the general principle of persistent memory, but avoids black-box neural memory in the current prototype. The memory layers are stored in SQLite and are inspectable through API routes, which is important for auditability in security operations.
Network intrusion detection with machine learning has long faced a closed-world problem. Sommer and Paxson argued that models trained in controlled settings may not capture the open-ended nature of operational network security [4]. MAAT-SOC therefore reports both a stratified holdout and a source-file holdout stress test, because the latter exposes threshold fragility hidden by the former.
Computer-security ML literature warns that subtle evaluation errors can undermine apparently strong results. Arp et al. identify common pitfalls in learning-based security systems and recommend careful design, evaluation, and interpretation [5]. Kapoor and Narayanan similarly argue that leakage has caused reproducibility failures across ML-based science [8]. MAAT-SOC responds by making leakage controls first-class artifacts rather than burying them in prose.
CIC-IDS2017 is a widely used public IDS dataset built to include benign traffic and common attacks over a five-day capture period [6]. However, later analysis found issues in traffic generation, flow construction, feature extraction, and labeling, with more than 20% of original traces reconstructed or relabeled in their improved processing methodology [7]. MAAT-SOC therefore treats CIC-IDS2017 as a useful benchmark, not as production ground truth.
MAAT-SOC contains two linked experimental layers. The first layer is a FastAPI SOC triage prototype with persistent entity memory and hybrid risk scoring. The second layer is a reproducible CIC-IDS2017 evaluation pipeline that demonstrates leakage-aware IDS modeling and stress testing. The two layers are connected by the research theme of reliable alert triage, but they evaluate different questions: memory behavior in alert scoring, and dataset hygiene in flow-based intrusion detection [1], [5], [6].
flowchart LR
A["Normalized security log"] --> B["Ingest endpoint"]
B --> C["Anomaly detector"]
B --> D["Entity memory store"]
D --> E["Episodic memory"]
D --> F["Semantic profile"]
D --> G["Procedural state"]
E --> H["Working context"]
F --> H
G --> H
H --> I["Model interface or fallback"]
I --> J["Hybrid score"]
J --> K["Decision policy"]
K --> L["SOC triage action"]
| Memory layer | Stored state | Purpose |
|---|---|---|
| Episodic | Prior normalized alert records per entity | Preserve event-level history and repeated patterns |
| Semantic | Entity profile, dominant event types, known-good hours, FP confidence | Encode stable behavioral priors |
| Procedural | Last decision, cooldown, downgrade hysteresis | Avoid oscillating decisions |
| Working | Per-request context summary | Provide bounded context to the scoring/model interface |
The memory benchmark uses 75 deterministic SOC scenarios across benign, borderline false-positive, high-anomaly false-positive, clear attack, and stealth attack groups. Four memory conditions are compared: cold start, same-type history, different-type history, and combined memory. Ground-truth labels are never used as model input; they are used only for evaluation [1], [4].
The scoring policy combines anomaly score, model risk score, entity history score, severity, false-positive pattern confidence, and semantic profile discount. This is not presented as a universal risk formula; it is an auditable experimental scoring policy used to test whether memory can reduce repeated false positives while preserving attack recall.
adjusted_anomaly =
anomaly_score
- trust_discount * category_factor
- semantic_discount
effective_history =
history_score * (1 - fp_pattern_score * 0.80)
base_score =
0.25 * adjusted_anomaly
+ 0.45 * model_risk_score
+ 0.20 * effective_history
+ 0.10 * severity_score
calibrated_score =
composite_score * confidence + 50 * (1 - confidence)
The CIC-IDS2017 experiment uses all eight local MachineLearningCSV files and is implemented with scikit-learn components for reproducibility [9]. The pipeline converts features to numeric values, replaces infinite values with missing values, removes exact duplicate feature vectors before splitting, fits median imputation only on training data, applies train-only balanced sample weights, and reports exact feature overlap between train and test. This is aligned with reproducibility guidance that warns against preprocessing and split leakage [5], [8].
| Condition | N | TP | FP | TN | FN | Precision | Recall | F1 | FPR |
|---|---|---|---|---|---|---|---|---|---|
| C0 Cold Start | 75 | 30 | 22 | 23 | 0 | 57.7% | 100.0% | 73.2% | 48.9% |
| C1a Match History | 75 | 30 | 21 | 24 | 0 | 58.8% | 100.0% | 74.1% | 46.7% |
| C1b Mismatch History | 75 | 30 | 23 | 22 | 0 | 56.6% | 100.0% | 72.3% | 51.1% |
| C3 Combined Memory | 75 | 30 | 21 | 24 | 0 | 58.8% | 100.0% | 74.1% | 46.7% |
The strongest positive result is safety: all 30 attack scenarios remain detected under combined memory. The false-positive improvement is modest, from 48.9% to 46.7%, and should not be overclaimed. High-anomaly false positives remain unresolved, which implies that memory should support analyst triage rather than autonomously suppress alerts [1], [4].
| Prior entity events | FP pattern | Semantic confidence | Score | Below alert threshold? |
|---|---|---|---|---|
| 0 | 0.000 | 0.000 | 51 | No |
| 3 | 0.404 | 0.404 | 56 | No |
| 5 | 0.522 | 0.522 | 53 | No |
| 8 | 0.640 | 0.640 | 52 | No |
| 10 | 0.698 | 0.698 | 51 | No |
| 15 | 0.807 | 0.807 | 50 | No |
| 20 | 0.887 | 0.887 | 49 | Yes |
The history-depth experiment shows a practical limitation: shallow history can increase score because the history contribution is stronger than the false-positive discount. In this benchmark, memory becomes useful only after roughly 20 prior entity events. This is a valuable negative result for scientific reporting because it defines a boundary condition rather than hiding it.
| Validation item | Result |
|---|---|
| Raw rows loaded | 2,830,743 |
| Numeric feature columns | 78 |
| Infinite values replaced with missing values | 4,376 |
| Missing values after conversion | 5,734 |
| Exact duplicate feature rows removed before split | 331,919 |
| Duplicate feature groups with conflicting labels | 719 |
| Rows after duplicate removal | 2,498,824 |
| Train/test exact feature overlap | 0 |
| Metric | Value |
|---|---|
| Accuracy | 99.924% |
| Balanced accuracy | 99.946% |
| Precision | 99.580% |
| Recall | 99.978% |
| F1 | 99.778% |
| ROC AUC | 99.999% |
| Average precision | 99.993% |
| False-positive rate | 0.087% |
| False-negative rate | 0.022% |
| Confusion matrix | TN 414,258 / FP 359 / FN 19 / TP 85,129 |
The stratified holdout exceeds the 90% target by a wide margin while reporting duplicate removal and zero train-test exact feature overlap. This result supports the claim that the implemented pipeline can produce a strong leakage-aware CIC-IDS2017 binary IDS baseline. It does not prove deployment readiness because random splits can remain easier than operational shifts [4], [6], [7].
| Metric | PortScan source holdout |
|---|---|
| Accuracy | 57.209% |
| Balanced accuracy | 50.016% |
| Precision | 44.026% |
| Recall | 0.604% |
| F1 | 1.193% |
| ROC AUC | 92.415% |
| Average precision | 82.788% |
| False-positive rate | 0.573% |
| False-negative rate | 99.396% |
| Confusion matrix | TN 121,072 / FP 698 / FN 90,270 / TP 549 |
The source-file holdout reveals a serious generalization problem: the ranking signal remains useful, but the fixed 0.5 threshold fails for PortScan recall. This result directly supports the methodological contribution: rigorous IDS evaluation should pair high random-split scores with shift-aware stress tests and threshold-calibration analysis [4], [5], [7], [8].
| Claim | Status | Evidence |
|---|---|---|
| Persistent entity memory can preserve attack recall in the deterministic benchmark | Supported | 30/30 attack scenarios detected under combined memory |
| Persistent entity memory strongly eliminates false positives | Not supported | FPR improves only from 48.9% to 46.7% |
| Memory is immediately useful for cold-start entities | Not supported | Useful suppression appears near 20 prior events |
| CIC-IDS2017 binary classification exceeds 90% under leakage-aware stratified holdout | Supported | Balanced accuracy 99.946%, recall 99.978%, overlap 0 |
| High random-split CIC-IDS2017 performance proves real-world deployment readiness | Not supported | PortScan source holdout recall 0.604% at threshold 0.5 |
| The repository is a fully autonomous SOC agent | Not supported | It is a research prototype with API, memory, scoring, and evaluation scripts |
| Artifact | Purpose |
|---|---|
scripts/memory_benchmark.py |
Deterministic memory-condition benchmark |
scripts/generate_research_figures.py |
Rebuilds benchmark figures and result tables |
scripts/evaluate_cicids2017.py |
Leakage-audited CIC-IDS2017 evaluation |
reports/results/benchmark_condition_metrics.csv |
Benchmark condition metrics |
reports/results/history_depth.csv |
History-depth results |
reports/public_datasets/cicids2017/cicids2017_metrics.json |
Full stratified-holdout audit and metrics |
reports/public_datasets/cicids2017_portscan_holdout/cicids2017_metrics.json |
Full source-holdout audit and metrics |
source .venv/bin/activate
# Unit and regression suite
pytest -q
# Deterministic memory benchmark
python scripts/memory_benchmark.py
# Publication-style benchmark figures
python scripts/generate_research_figures.py
# CIC-IDS2017 stratified holdout
python scripts/evaluate_cicids2017.py \
--data-dir /mnt/d/IDS_Hybrid_Project_v20/02_data/MachineLearningCSV/MachineLearningCVE \
--max-iter 160 \
--permutation-sample 10000
# CIC-IDS2017 PortScan source-file holdout
python scripts/evaluate_cicids2017.py \
--data-dir /mnt/d/IDS_Hybrid_Project_v20/02_data/MachineLearningCSV/MachineLearningCVE \
--split-mode source-holdout \
--holdout-source Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv \
--out-dir reports/public_datasets/cicids2017_portscan_holdout \
--model-path reports/models/cicids2017_portscan_holdout.joblib \
--max-iter 120 \
--permutation-sample 0python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
uvicorn main:app --host 0.0.0.0 --port 5000 --reloadAPI docs: http://localhost:5000/docs
Health check: http://localhost:5000/health
Supported log-source labels are suricata, zeek, wazuh, splunk, and generic. These are normalized API source labels, not bundled research datasets.
.
├── app/ # FastAPI SOC memory and triage prototype
├── reports/
│ ├── figures/ # Deterministic benchmark figures
│ ├── public_datasets/ # CIC-IDS2017 figures, JSON, CSV outputs
│ └── results/ # Benchmark CSV/JSON outputs
├── scripts/
│ ├── evaluate_cicids2017.py
│ ├── generate_research_figures.py
│ └── memory_benchmark.py
├── tests/ # Unit and regression tests
├── main.py
└── requirements.txt
The current memory benchmark is deterministic and scenario-based. It validates the decision policy under controlled conditions, but it does not replace analyst-confirmed production SOC labels. Future work should evaluate memory behavior on time-ordered, analyst-labeled SOC alert streams [1], [2].
The CIC-IDS2017 experiment validates a leakage-aware IDS pipeline, but the source-file holdout failure shows that threshold calibration and temporal or domain-shift validation are necessary before deployment. Future work should add validation-only threshold selection, time-based splits, additional public datasets, and confidence-based abstention [4], [5], [7].
The prototype uses SQLite and a deterministic fallback model path for local reproducibility. A production implementation would need authentication, audit logging, rate controls, analyst feedback capture, privacy review, and infrastructure hardening before use in live SOC decision-making [1], [5].
This repository is a research prototype. It should not be used to automate blocking actions in production without analyst review, calibrated thresholds, monitored drift, authenticated ingestion, and local validation. The reported results are intentionally mixed: the project reports both successful stratified performance and source-holdout failure to avoid overstating scientific claims [5], [8].
[1] P. Kearney, M. Abdelsamea, X. Schmoor, F. Shah, and I. Vickers, "Alert Fatigue in Security Operations Centres: Research Challenges and Opportunities," ACM Computing Surveys, vol. 57, no. 9, 2025. DOI: 10.1145/3723158.
[2] K. Veeramachaneni, I. Arnaldo, A. Cuesta-Infante, V. Korrapati, C. Bassias, and K. Li, "AI2: Training a Big Data Machine to Defend," IEEE BigDataSecurity/HPSC/IDS, 2016. DOI: 10.1109/BIGDATASECURITY-HPSC-IDS.2016.79.
[3] J. Weston, S. Chopra, and A. Bordes, "Memory Networks," arXiv:1410.3916, 2014. DOI: 10.48550/arXiv.1410.3916.
[4] R. Sommer and V. Paxson, "Outside the Closed World: On Using Machine Learning for Network Intrusion Detection," IEEE Symposium on Security and Privacy, 2010. DOI: 10.1109/SP.2010.25.
[5] D. Arp, E. Quiring, F. Pendlebury, A. Warnecke, F. Pierazzi, C. Wressnegger, L. Cavallaro, and K. Rieck, "Dos and Don'ts of Machine Learning in Computer Security," USENIX Security Symposium, 2022. URL: USENIX PDF.
[6] I. Sharafaldin, A. H. Lashkari, and A. A. Ghorbani, "Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization," ICISSP, 2018. Dataset page: CIC-IDS2017, Canadian Institute for Cybersecurity.
[7] G. Engelen, V. Rimmer, and W. Joosen, "Troubleshooting an Intrusion Detection Dataset: the CICIDS2017 Case Study," IEEE Security and Privacy Workshops, 2021. DOI: 10.1109/SPW53761.2021.00009.
[8] S. Kapoor and A. Narayanan, "Leakage and the Reproducibility Crisis in ML-based Science," arXiv:2207.07048, 2022. DOI: 10.48550/arXiv.2207.07048.
[9] F. Pedregosa et al., "Scikit-learn: Machine Learning in Python," Journal of Machine Learning Research, vol. 12, pp. 2825-2830, 2011. URL: JMLR.
CC BY-NC 4.0 - free for research and non-commercial use.










