MAAT-SOC

Memory-Augmented Alert Triage for Security Operations

A reproducible research prototype for memory-assisted SOC triage and CIC-IDS2017 validation

Author and Correspondence

Huỳnh Chí Trung
ORCID: 0009-0004-6155-3634
Email: huynhchitrung@dtu.edu.vn, huynhtrung.csc@gmail.com

Scientific Scope and Title

MAAT-SOC: Memory-Augmented Alert Triage for Security Operations

The title foregrounds the primary scientific object of the repository: memory-assisted alert triage in Security Operations Centers. Leakage control and anti-memorization checks remain essential to scientific integrity, but they are treated as methodological requirements rather than the main title. This scope is more accurate than a broad "agentic AI SOC" claim because the repository demonstrates persistent entity-memory assisted alert triage and a leakage-aware CIC-IDS2017 validation pipeline; it does not yet prove a fully autonomous SOC agent, production incident response, or state-of-the-art intrusion detection across multiple datasets [1], [2], [5].

Abstract

Security Operations Centers (SOCs) face persistent alert fatigue: analysts must process large volumes of low-priority, duplicate, and false-positive alerts while still avoiding missed attacks. Recent SOC research emphasizes that alert validity depends not only on raw detector output but also on analyst tacit knowledge, system context, and repeated local behavior [1]. This repository investigates whether a lightweight, auditable memory layer can encode entity-level alert history and reduce repeated false-positive pressure without suppressing attack alerts.

MAAT-SOC implements a four-layer memory architecture for alert triage: episodic memory stores prior alert records, semantic memory summarizes stable entity behavior, procedural memory tracks previous decisions, and working memory assembles the model-facing context. The architecture is motivated by memory-augmented learning systems, but is deliberately implemented as auditable security engineering rather than an opaque neural memory model [3]. The central design constraint is safety: benign history may reduce repeated false-positive pressure only when attack recall is preserved.

The repository also evaluates an intrusion-detection baseline on CIC-IDS2017 using leakage-aware and anti-memorization preprocessing. The pipeline removes exact duplicate feature vectors before splitting, excludes label and source-file fields from features, fits imputation only on training data, uses train-only class weights, and explicitly reports train-test feature overlap. This protocol responds to known reproducibility and leakage risks in machine-learning-based science and computer-security ML evaluation [5], [6], [8].

Results are intentionally reported with both positive and negative findings. In a deterministic 75-scenario alert-triage benchmark, combined memory preserves 100.0% attack recall and reduces false-positive rate from 48.9% to 46.7%, but high-anomaly false positives remain unresolved. On CIC-IDS2017, the leakage-aware stratified holdout reaches 99.946% balanced accuracy and 99.978% recall; however, a source-file holdout stress test on the PortScan file collapses fixed-threshold recall to 0.604% despite ROC AUC of 92.415%. This contrast is the main scientific lesson: high random-split performance is not sufficient evidence of deployment readiness [4], [6], [7].

Keywords

Security Operations Center; SOC alert triage; alert fatigue; memory-augmented AI; intrusion detection; CIC-IDS2017; leakage-aware validation; anti-memorization controls; reproducible machine learning; source-file holdout; false-positive reduction.

Research Problem and Purpose

The main purpose of MAAT-SOC is to test whether a SOC alert triage system can use persistent entity memory to reduce repeated false positives while preserving attack recall. This problem matters because false positives and alert overload are repeatedly identified as practical barriers to effective SOC operation, and analyst context is difficult to capture in conventional stateless scoring pipelines [1], [2].

The project also tests a second methodological claim: intrusion-detection evaluation should report leakage controls and stress tests, not only headline random-split accuracy. Prior work has warned that ML systems for network intrusion detection often perform well in controlled closed-world benchmarks but fail to generalize under realistic shifts, and broader ML literature shows that leakage can produce overoptimistic scientific conclusions [4], [5], [8].

Research Questions

ID	Research question	Why it matters	Evaluation artifact
RQ1	Can persistent entity memory reduce repeated false-positive pressure without suppressing attacks?	SOCs need noise reduction, but missed attacks are more costly than small FP reductions [1].	`scripts/memory_benchmark.py`
RQ2	How much prior history is required before memory becomes useful?	A memory system may be unsafe or unhelpful during cold start.	`reports/results/history_depth.csv`
RQ3	Can a CIC-IDS2017 classifier exceed 90% performance under leakage-aware preprocessing?	Public IDS benchmarks are often used to support ML claims, but evaluation hygiene matters [5], [6].	`scripts/evaluate_cicids2017.py`
RQ4	Does high stratified-holdout performance survive a source-file holdout stress test?	Source/time shift better reflects deployment risk than a random split alone [4], [7].	`reports/public_datasets/cicids2017_portscan_holdout/`

Claimed Contributions

Memory-augmented SOC triage architecture. The repository implements an auditable four-layer memory design for entity-level alert context rather than treating every alert as stateless. This differs from classical IDS classification work by focusing on alert triage state and repeated entity behavior, not only packet-flow classification [1], [3].
Safety-oriented memory evaluation. The deterministic benchmark does not only ask whether false positives decrease; it also asks whether benign history suppresses attacks. This safety framing is important because alert suppression can become harmful if it hides true incidents [1], [4].
Leakage-aware CIC-IDS2017 validation. The CIC pipeline reports duplicate removal, missing and infinite values, train-test exact feature overlap, class imbalance handling, and fixed-threshold evaluation. These controls directly address methodological pitfalls documented in ML security and reproducibility literature [5], [6], [8].
Negative source-holdout result as a reproducibility contribution. The project reports that random-split metrics are excellent but fixed-threshold source-file generalization is weak. In a rigorous scientific paper, this negative result is not a failure to hide; it is evidence that the work is scientifically honest and identifies a concrete future research target [4], [7], [8].

Related Work

SOC alert fatigue is a human-machine decision problem, not merely a classifier problem. Recent ACM Computing Surveys work frames alert fatigue as a research area involving excessive alert volume, false positives, analyst tacit knowledge, and operational context [1]. MAAT-SOC follows this framing by representing entity history explicitly and reporting false-positive effects separately from attack recall.

Analyst-in-the-loop security learning has been explored before. AI2 combined unsupervised anomaly detection, analyst feedback, and supervised learning, reporting improved detection and reduced false positives on large-scale operational logs [2]. MAAT-SOC is related but narrower: it does not claim active analyst feedback learning; instead, it studies persistent entity memory and deterministic safety constraints that can be inspected and reproduced.

Memory-augmented models such as Memory Networks introduced the idea of systems that read and write long-term memory for prediction [3]. MAAT-SOC adopts the general principle of persistent memory, but avoids black-box neural memory in the current prototype. The memory layers are stored in SQLite and are inspectable through API routes, which is important for auditability in security operations.

Network intrusion detection with machine learning has long faced a closed-world problem. Sommer and Paxson argued that models trained in controlled settings may not capture the open-ended nature of operational network security [4]. MAAT-SOC therefore reports both a stratified holdout and a source-file holdout stress test, because the latter exposes threshold fragility hidden by the former.

Computer-security ML literature warns that subtle evaluation errors can undermine apparently strong results. Arp et al. identify common pitfalls in learning-based security systems and recommend careful design, evaluation, and interpretation [5]. Kapoor and Narayanan similarly argue that leakage has caused reproducibility failures across ML-based science [8]. MAAT-SOC responds by making leakage controls first-class artifacts rather than burying them in prose.

CIC-IDS2017 is a widely used public IDS dataset built to include benign traffic and common attacks over a five-day capture period [6]. However, later analysis found issues in traffic generation, flow construction, feature extraction, and labeling, with more than 20% of original traces reconstructed or relabeled in their improved processing methodology [7]. MAAT-SOC therefore treats CIC-IDS2017 as a useful benchmark, not as production ground truth.

System Overview

MAAT-SOC contains two linked experimental layers. The first layer is a FastAPI SOC triage prototype with persistent entity memory and hybrid risk scoring. The second layer is a reproducible CIC-IDS2017 evaluation pipeline that demonstrates leakage-aware IDS modeling and stress testing. The two layers are connected by the research theme of reliable alert triage, but they evaluate different questions: memory behavior in alert scoring, and dataset hygiene in flow-based intrusion detection [1], [5], [6].

flowchart LR
    A["Normalized security log"] --> B["Ingest endpoint"]
    B --> C["Anomaly detector"]
    B --> D["Entity memory store"]
    D --> E["Episodic memory"]
    D --> F["Semantic profile"]
    D --> G["Procedural state"]
    E --> H["Working context"]
    F --> H
    G --> H
    H --> I["Model interface or fallback"]
    I --> J["Hybrid score"]
    J --> K["Decision policy"]
    K --> L["SOC triage action"]

Memory layer	Stored state	Purpose
Episodic	Prior normalized alert records per entity	Preserve event-level history and repeated patterns
Semantic	Entity profile, dominant event types, known-good hours, FP confidence	Encode stable behavioral priors
Procedural	Last decision, cooldown, downgrade hysteresis	Avoid oscillating decisions
Working	Per-request context summary	Provide bounded context to the scoring/model interface

Methodology

The memory benchmark uses 75 deterministic SOC scenarios across benign, borderline false-positive, high-anomaly false-positive, clear attack, and stealth attack groups. Four memory conditions are compared: cold start, same-type history, different-type history, and combined memory. Ground-truth labels are never used as model input; they are used only for evaluation [1], [4].

The scoring policy combines anomaly score, model risk score, entity history score, severity, false-positive pattern confidence, and semantic profile discount. This is not presented as a universal risk formula; it is an auditable experimental scoring policy used to test whether memory can reduce repeated false positives while preserving attack recall.

adjusted_anomaly =
  anomaly_score
  - trust_discount * category_factor
  - semantic_discount

effective_history =
  history_score * (1 - fp_pattern_score * 0.80)

base_score =
  0.25 * adjusted_anomaly
  + 0.45 * model_risk_score
  + 0.20 * effective_history
  + 0.10 * severity_score

calibrated_score =
  composite_score * confidence + 50 * (1 - confidence)

The CIC-IDS2017 experiment uses all eight local MachineLearningCSV files and is implemented with scikit-learn components for reproducibility [9]. The pipeline converts features to numeric values, replaces infinite values with missing values, removes exact duplicate feature vectors before splitting, fits median imputation only on training data, applies train-only balanced sample weights, and reports exact feature overlap between train and test. This is aligned with reproducibility guidance that warns against preprocessing and split leakage [5], [8].

Main Results

RQ1: Memory Safety and False-Positive Reduction

Condition	N	TP	FP	TN	Precision	Recall	F1	FPR
C0 Cold Start	75	30	22	23	57.7%	100.0%	73.2%	48.9%
C1a Match History	75	30	21	24	58.8%	100.0%	74.1%	46.7%
C1b Mismatch History	75	30	23	22	56.6%	100.0%	72.3%	51.1%
C3 Combined Memory	75	30	21	24	58.8%	100.0%	74.1%	46.7%

The strongest positive result is safety: all 30 attack scenarios remain detected under combined memory. The false-positive improvement is modest, from 48.9% to 46.7%, and should not be overclaimed. High-anomaly false positives remain unresolved, which implies that memory should support analyst triage rather than autonomously suppress alerts [1], [4].

RQ2: Required History Depth

Prior entity events	FP pattern	Semantic confidence	Score	Below alert threshold?
0	0.000	0.000	51	No
3	0.404	0.404	56	No
5	0.522	0.522	53	No
8	0.640	0.640	52	No
10	0.698	0.698	51	No
15	0.807	0.807	50	No
20	0.887	0.887	49	Yes

The history-depth experiment shows a practical limitation: shallow history can increase score because the history contribution is stronger than the false-positive discount. In this benchmark, memory becomes useful only after roughly 20 prior entity events. This is a valuable negative result for scientific reporting because it defines a boundary condition rather than hiding it.

RQ3: CIC-IDS2017 Stratified Holdout

Validation item	Result
Raw rows loaded	2,830,743
Numeric feature columns	78
Infinite values replaced with missing values	4,376
Missing values after conversion	5,734
Exact duplicate feature rows removed before split	331,919
Duplicate feature groups with conflicting labels	719
Rows after duplicate removal	2,498,824
Train/test exact feature overlap	0

Metric	Value
Accuracy	99.924%
Balanced accuracy	99.946%
Precision	99.580%
Recall	99.978%
F1	99.778%
ROC AUC	99.999%
Average precision	99.993%
False-positive rate	0.087%
False-negative rate	0.022%
Confusion matrix	TN 414,258 / FP 359 / FN 19 / TP 85,129

The stratified holdout exceeds the 90% target by a wide margin while reporting duplicate removal and zero train-test exact feature overlap. This result supports the claim that the implemented pipeline can produce a strong leakage-aware CIC-IDS2017 binary IDS baseline. It does not prove deployment readiness because random splits can remain easier than operational shifts [4], [6], [7].

RQ4: Source-File Holdout Stress Test

Metric	PortScan source holdout
Accuracy	57.209%
Balanced accuracy	50.016%
Precision	44.026%
Recall	0.604%
F1	1.193%
ROC AUC	92.415%
Average precision	82.788%
False-positive rate	0.573%
False-negative rate	99.396%
Confusion matrix	TN 121,072 / FP 698 / FN 90,270 / TP 549

The source-file holdout reveals a serious generalization problem: the ranking signal remains useful, but the fixed 0.5 threshold fails for PortScan recall. This result directly supports the methodological contribution: rigorous IDS evaluation should pair high random-split scores with shift-aware stress tests and threshold-calibration analysis [4], [5], [7], [8].

What the Project Proves and Does Not Prove

Claim	Status	Evidence
Persistent entity memory can preserve attack recall in the deterministic benchmark	Supported	30/30 attack scenarios detected under combined memory
Persistent entity memory strongly eliminates false positives	Not supported	FPR improves only from 48.9% to 46.7%
Memory is immediately useful for cold-start entities	Not supported	Useful suppression appears near 20 prior events
CIC-IDS2017 binary classification exceeds 90% under leakage-aware stratified holdout	Supported	Balanced accuracy 99.946%, recall 99.978%, overlap 0
High random-split CIC-IDS2017 performance proves real-world deployment readiness	Not supported	PortScan source holdout recall 0.604% at threshold 0.5
The repository is a fully autonomous SOC agent	Not supported	It is a research prototype with API, memory, scoring, and evaluation scripts

Reproducibility Package

Artifact	Purpose
`scripts/memory_benchmark.py`	Deterministic memory-condition benchmark
`scripts/generate_research_figures.py`	Rebuilds benchmark figures and result tables
`scripts/evaluate_cicids2017.py`	Leakage-audited CIC-IDS2017 evaluation
`reports/results/benchmark_condition_metrics.csv`	Benchmark condition metrics
`reports/results/history_depth.csv`	History-depth results
`reports/public_datasets/cicids2017/cicids2017_metrics.json`	Full stratified-holdout audit and metrics
`reports/public_datasets/cicids2017_portscan_holdout/cicids2017_metrics.json`	Full source-holdout audit and metrics

source .venv/bin/activate

# Unit and regression suite
pytest -q

# Deterministic memory benchmark
python scripts/memory_benchmark.py

# Publication-style benchmark figures
python scripts/generate_research_figures.py

# CIC-IDS2017 stratified holdout
python scripts/evaluate_cicids2017.py \
  --data-dir /mnt/d/IDS_Hybrid_Project_v20/02_data/MachineLearningCSV/MachineLearningCVE \
  --max-iter 160 \
  --permutation-sample 10000

# CIC-IDS2017 PortScan source-file holdout
python scripts/evaluate_cicids2017.py \
  --data-dir /mnt/d/IDS_Hybrid_Project_v20/02_data/MachineLearningCSV/MachineLearningCVE \
  --split-mode source-holdout \
  --holdout-source Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv \
  --out-dir reports/public_datasets/cicids2017_portscan_holdout \
  --model-path reports/models/cicids2017_portscan_holdout.joblib \
  --max-iter 120 \
  --permutation-sample 0

API Prototype

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
uvicorn main:app --host 0.0.0.0 --port 5000 --reload

API docs: http://localhost:5000/docs
Health check: http://localhost:5000/health

Supported log-source labels are suricata, zeek, wazuh, splunk, and generic. These are normalized API source labels, not bundled research datasets.

Project Structure

.
├── app/                         # FastAPI SOC memory and triage prototype
├── reports/
│   ├── figures/                 # Deterministic benchmark figures
│   ├── public_datasets/         # CIC-IDS2017 figures, JSON, CSV outputs
│   └── results/                 # Benchmark CSV/JSON outputs
├── scripts/
│   ├── evaluate_cicids2017.py
│   ├── generate_research_figures.py
│   └── memory_benchmark.py
├── tests/                       # Unit and regression tests
├── main.py
└── requirements.txt

Limitations and Future Work

The current memory benchmark is deterministic and scenario-based. It validates the decision policy under controlled conditions, but it does not replace analyst-confirmed production SOC labels. Future work should evaluate memory behavior on time-ordered, analyst-labeled SOC alert streams [1], [2].

The CIC-IDS2017 experiment validates a leakage-aware IDS pipeline, but the source-file holdout failure shows that threshold calibration and temporal or domain-shift validation are necessary before deployment. Future work should add validation-only threshold selection, time-based splits, additional public datasets, and confidence-based abstention [4], [5], [7].

The prototype uses SQLite and a deterministic fallback model path for local reproducibility. A production implementation would need authentication, audit logging, rate controls, analyst feedback capture, privacy review, and infrastructure hardening before use in live SOC decision-making [1], [5].

Ethical and Scientific Integrity Statement

This repository is a research prototype. It should not be used to automate blocking actions in production without analyst review, calibrated thresholds, monitored drift, authenticated ingestion, and local validation. The reported results are intentionally mixed: the project reports both successful stratified performance and source-holdout failure to avoid overstating scientific claims [5], [8].

References

[1] P. Kearney, M. Abdelsamea, X. Schmoor, F. Shah, and I. Vickers, "Alert Fatigue in Security Operations Centres: Research Challenges and Opportunities," ACM Computing Surveys, vol. 57, no. 9, 2025. DOI: 10.1145/3723158.

[2] K. Veeramachaneni, I. Arnaldo, A. Cuesta-Infante, V. Korrapati, C. Bassias, and K. Li, "AI2: Training a Big Data Machine to Defend," IEEE BigDataSecurity/HPSC/IDS, 2016. DOI: 10.1109/BIGDATASECURITY-HPSC-IDS.2016.79.

[3] J. Weston, S. Chopra, and A. Bordes, "Memory Networks," arXiv:1410.3916, 2014. DOI: 10.48550/arXiv.1410.3916.

[4] R. Sommer and V. Paxson, "Outside the Closed World: On Using Machine Learning for Network Intrusion Detection," IEEE Symposium on Security and Privacy, 2010. DOI: 10.1109/SP.2010.25.

[5] D. Arp, E. Quiring, F. Pendlebury, A. Warnecke, F. Pierazzi, C. Wressnegger, L. Cavallaro, and K. Rieck, "Dos and Don'ts of Machine Learning in Computer Security," USENIX Security Symposium, 2022. URL: USENIX PDF.

[6] I. Sharafaldin, A. H. Lashkari, and A. A. Ghorbani, "Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization," ICISSP, 2018. Dataset page: CIC-IDS2017, Canadian Institute for Cybersecurity.

[7] G. Engelen, V. Rimmer, and W. Joosen, "Troubleshooting an Intrusion Detection Dataset: the CICIDS2017 Case Study," IEEE Security and Privacy Workshops, 2021. DOI: 10.1109/SPW53761.2021.00009.

[8] S. Kapoor and A. Narayanan, "Leakage and the Reproducibility Crisis in ML-based Science," arXiv:2207.07048, 2022. DOI: 10.48550/arXiv.2207.07048.

[9] F. Pedregosa et al., "Scikit-learn: Machine Learning in Python," Journal of Machine Learning Research, vol. 12, pp. 2825-2830, 2011. URL: JMLR.

License

CC BY-NC 4.0 - free for research and non-commercial use.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
app		app
reports		reports
scripts		scripts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
main.py		main.py
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MAAT-SOC

Memory-Augmented Alert Triage for Security Operations

Author and Correspondence

Scientific Scope and Title

Abstract

Keywords

Research Problem and Purpose

Research Questions

Claimed Contributions

Related Work

System Overview

Methodology

Main Results

RQ1: Memory Safety and False-Positive Reduction

RQ2: Required History Depth

RQ3: CIC-IDS2017 Stratified Holdout

RQ4: Source-File Holdout Stress Test

What the Project Proves and Does Not Prove

Reproducibility Package

API Prototype

Project Structure

Limitations and Future Work

Ethical and Scientific Integrity Statement

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MAAT-SOC

Memory-Augmented Alert Triage for Security Operations

Author and Correspondence

Scientific Scope and Title

Abstract

Keywords

Research Problem and Purpose

Research Questions

Claimed Contributions

Related Work

System Overview

Methodology

Main Results

RQ1: Memory Safety and False-Positive Reduction

RQ2: Required History Depth

RQ3: CIC-IDS2017 Stratified Holdout

RQ4: Source-File Holdout Stress Test

What the Project Proves and Does Not Prove

Reproducibility Package

API Prototype

Project Structure

Limitations and Future Work

Ethical and Scientific Integrity Statement

References

License

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages