Skip to content

huynhtrungcsc/memory-augmented-agentic-ai-soc

Repository files navigation

MAAT-SOC

Memory-Augmented Alert Triage for Security Operations

A reproducible research prototype for memory-assisted SOC triage and CIC-IDS2017 validation

Python FastAPI SQLite Tests Dataset Status License

Author and Correspondence

Huỳnh Chí Trung
ORCID: 0009-0004-6155-3634
Email: huynhchitrung@dtu.edu.vn, huynhtrung.csc@gmail.com

Scientific Scope and Title

MAAT-SOC: Memory-Augmented Alert Triage for Security Operations

The title foregrounds the primary scientific object of the repository: memory-assisted alert triage in Security Operations Centers. Leakage control and anti-memorization checks remain essential to scientific integrity, but they are treated as methodological requirements rather than the main title. This scope is more accurate than a broad "agentic AI SOC" claim because the repository demonstrates persistent entity-memory assisted alert triage and a leakage-aware CIC-IDS2017 validation pipeline; it does not yet prove a fully autonomous SOC agent, production incident response, or state-of-the-art intrusion detection across multiple datasets [1], [2], [5].

Abstract

Security Operations Centers (SOCs) face persistent alert fatigue: analysts must process large volumes of low-priority, duplicate, and false-positive alerts while still avoiding missed attacks. Recent SOC research emphasizes that alert validity depends not only on raw detector output but also on analyst tacit knowledge, system context, and repeated local behavior [1]. This repository investigates whether a lightweight, auditable memory layer can encode entity-level alert history and reduce repeated false-positive pressure without suppressing attack alerts.

MAAT-SOC implements a four-layer memory architecture for alert triage: episodic memory stores prior alert records, semantic memory summarizes stable entity behavior, procedural memory tracks previous decisions, and working memory assembles the model-facing context. The architecture is motivated by memory-augmented learning systems, but is deliberately implemented as auditable security engineering rather than an opaque neural memory model [3]. The central design constraint is safety: benign history may reduce repeated false-positive pressure only when attack recall is preserved.

The repository also evaluates an intrusion-detection baseline on CIC-IDS2017 using leakage-aware and anti-memorization preprocessing. The pipeline removes exact duplicate feature vectors before splitting, excludes label and source-file fields from features, fits imputation only on training data, uses train-only class weights, and explicitly reports train-test feature overlap. This protocol responds to known reproducibility and leakage risks in machine-learning-based science and computer-security ML evaluation [5], [6], [8].

Results are intentionally reported with both positive and negative findings. In a deterministic 75-scenario alert-triage benchmark, combined memory preserves 100.0% attack recall and reduces false-positive rate from 48.9% to 46.7%, but high-anomaly false positives remain unresolved. On CIC-IDS2017, the leakage-aware stratified holdout reaches 99.946% balanced accuracy and 99.978% recall; however, a source-file holdout stress test on the PortScan file collapses fixed-threshold recall to 0.604% despite ROC AUC of 92.415%. This contrast is the main scientific lesson: high random-split performance is not sufficient evidence of deployment readiness [4], [6], [7].

Keywords

Security Operations Center; SOC alert triage; alert fatigue; memory-augmented AI; intrusion detection; CIC-IDS2017; leakage-aware validation; anti-memorization controls; reproducible machine learning; source-file holdout; false-positive reduction.

Research Problem and Purpose

The main purpose of MAAT-SOC is to test whether a SOC alert triage system can use persistent entity memory to reduce repeated false positives while preserving attack recall. This problem matters because false positives and alert overload are repeatedly identified as practical barriers to effective SOC operation, and analyst context is difficult to capture in conventional stateless scoring pipelines [1], [2].

The project also tests a second methodological claim: intrusion-detection evaluation should report leakage controls and stress tests, not only headline random-split accuracy. Prior work has warned that ML systems for network intrusion detection often perform well in controlled closed-world benchmarks but fail to generalize under realistic shifts, and broader ML literature shows that leakage can produce overoptimistic scientific conclusions [4], [5], [8].

Research Questions

ID Research question Why it matters Evaluation artifact
RQ1 Can persistent entity memory reduce repeated false-positive pressure without suppressing attacks? SOCs need noise reduction, but missed attacks are more costly than small FP reductions [1]. scripts/memory_benchmark.py
RQ2 How much prior history is required before memory becomes useful? A memory system may be unsafe or unhelpful during cold start. reports/results/history_depth.csv
RQ3 Can a CIC-IDS2017 classifier exceed 90% performance under leakage-aware preprocessing? Public IDS benchmarks are often used to support ML claims, but evaluation hygiene matters [5], [6]. scripts/evaluate_cicids2017.py
RQ4 Does high stratified-holdout performance survive a source-file holdout stress test? Source/time shift better reflects deployment risk than a random split alone [4], [7]. reports/public_datasets/cicids2017_portscan_holdout/

Claimed Contributions

  1. Memory-augmented SOC triage architecture. The repository implements an auditable four-layer memory design for entity-level alert context rather than treating every alert as stateless. This differs from classical IDS classification work by focusing on alert triage state and repeated entity behavior, not only packet-flow classification [1], [3].

  2. Safety-oriented memory evaluation. The deterministic benchmark does not only ask whether false positives decrease; it also asks whether benign history suppresses attacks. This safety framing is important because alert suppression can become harmful if it hides true incidents [1], [4].

  3. Leakage-aware CIC-IDS2017 validation. The CIC pipeline reports duplicate removal, missing and infinite values, train-test exact feature overlap, class imbalance handling, and fixed-threshold evaluation. These controls directly address methodological pitfalls documented in ML security and reproducibility literature [5], [6], [8].

  4. Negative source-holdout result as a reproducibility contribution. The project reports that random-split metrics are excellent but fixed-threshold source-file generalization is weak. In a rigorous scientific paper, this negative result is not a failure to hide; it is evidence that the work is scientifically honest and identifies a concrete future research target [4], [7], [8].

Related Work

SOC alert fatigue is a human-machine decision problem, not merely a classifier problem. Recent ACM Computing Surveys work frames alert fatigue as a research area involving excessive alert volume, false positives, analyst tacit knowledge, and operational context [1]. MAAT-SOC follows this framing by representing entity history explicitly and reporting false-positive effects separately from attack recall.

Analyst-in-the-loop security learning has been explored before. AI2 combined unsupervised anomaly detection, analyst feedback, and supervised learning, reporting improved detection and reduced false positives on large-scale operational logs [2]. MAAT-SOC is related but narrower: it does not claim active analyst feedback learning; instead, it studies persistent entity memory and deterministic safety constraints that can be inspected and reproduced.

Memory-augmented models such as Memory Networks introduced the idea of systems that read and write long-term memory for prediction [3]. MAAT-SOC adopts the general principle of persistent memory, but avoids black-box neural memory in the current prototype. The memory layers are stored in SQLite and are inspectable through API routes, which is important for auditability in security operations.

Network intrusion detection with machine learning has long faced a closed-world problem. Sommer and Paxson argued that models trained in controlled settings may not capture the open-ended nature of operational network security [4]. MAAT-SOC therefore reports both a stratified holdout and a source-file holdout stress test, because the latter exposes threshold fragility hidden by the former.

Computer-security ML literature warns that subtle evaluation errors can undermine apparently strong results. Arp et al. identify common pitfalls in learning-based security systems and recommend careful design, evaluation, and interpretation [5]. Kapoor and Narayanan similarly argue that leakage has caused reproducibility failures across ML-based science [8]. MAAT-SOC responds by making leakage controls first-class artifacts rather than burying them in prose.

CIC-IDS2017 is a widely used public IDS dataset built to include benign traffic and common attacks over a five-day capture period [6]. However, later analysis found issues in traffic generation, flow construction, feature extraction, and labeling, with more than 20% of original traces reconstructed or relabeled in their improved processing methodology [7]. MAAT-SOC therefore treats CIC-IDS2017 as a useful benchmark, not as production ground truth.

System Overview

MAAT-SOC contains two linked experimental layers. The first layer is a FastAPI SOC triage prototype with persistent entity memory and hybrid risk scoring. The second layer is a reproducible CIC-IDS2017 evaluation pipeline that demonstrates leakage-aware IDS modeling and stress testing. The two layers are connected by the research theme of reliable alert triage, but they evaluate different questions: memory behavior in alert scoring, and dataset hygiene in flow-based intrusion detection [1], [5], [6].

flowchart LR
    A["Normalized security log"] --> B["Ingest endpoint"]
    B --> C["Anomaly detector"]
    B --> D["Entity memory store"]
    D --> E["Episodic memory"]
    D --> F["Semantic profile"]
    D --> G["Procedural state"]
    E --> H["Working context"]
    F --> H
    G --> H
    H --> I["Model interface or fallback"]
    I --> J["Hybrid score"]
    J --> K["Decision policy"]
    K --> L["SOC triage action"]
Loading
Memory layer Stored state Purpose
Episodic Prior normalized alert records per entity Preserve event-level history and repeated patterns
Semantic Entity profile, dominant event types, known-good hours, FP confidence Encode stable behavioral priors
Procedural Last decision, cooldown, downgrade hysteresis Avoid oscillating decisions
Working Per-request context summary Provide bounded context to the scoring/model interface

Methodology

The memory benchmark uses 75 deterministic SOC scenarios across benign, borderline false-positive, high-anomaly false-positive, clear attack, and stealth attack groups. Four memory conditions are compared: cold start, same-type history, different-type history, and combined memory. Ground-truth labels are never used as model input; they are used only for evaluation [1], [4].

The scoring policy combines anomaly score, model risk score, entity history score, severity, false-positive pattern confidence, and semantic profile discount. This is not presented as a universal risk formula; it is an auditable experimental scoring policy used to test whether memory can reduce repeated false positives while preserving attack recall.

adjusted_anomaly =
  anomaly_score
  - trust_discount * category_factor
  - semantic_discount

effective_history =
  history_score * (1 - fp_pattern_score * 0.80)

base_score =
  0.25 * adjusted_anomaly
  + 0.45 * model_risk_score
  + 0.20 * effective_history
  + 0.10 * severity_score

calibrated_score =
  composite_score * confidence + 50 * (1 - confidence)

The CIC-IDS2017 experiment uses all eight local MachineLearningCSV files and is implemented with scikit-learn components for reproducibility [9]. The pipeline converts features to numeric values, replaces infinite values with missing values, removes exact duplicate feature vectors before splitting, fits median imputation only on training data, applies train-only balanced sample weights, and reports exact feature overlap between train and test. This is aligned with reproducibility guidance that warns against preprocessing and split leakage [5], [8].

Main Results

RQ1: Memory Safety and False-Positive Reduction

Condition N TP FP TN FN Precision Recall F1 FPR
C0 Cold Start 75 30 22 23 0 57.7% 100.0% 73.2% 48.9%
C1a Match History 75 30 21 24 0 58.8% 100.0% 74.1% 46.7%
C1b Mismatch History 75 30 23 22 0 56.6% 100.0% 72.3% 51.1%
C3 Combined Memory 75 30 21 24 0 58.8% 100.0% 74.1% 46.7%

The strongest positive result is safety: all 30 attack scenarios remain detected under combined memory. The false-positive improvement is modest, from 48.9% to 46.7%, and should not be overclaimed. High-anomaly false positives remain unresolved, which implies that memory should support analyst triage rather than autonomously suppress alerts [1], [4].

Benchmark condition metrics

C3 combined-memory confusion matrix

RQ2: Required History Depth

Prior entity events FP pattern Semantic confidence Score Below alert threshold?
0 0.000 0.000 51 No
3 0.404 0.404 56 No
5 0.522 0.522 53 No
8 0.640 0.640 52 No
10 0.698 0.698 51 No
15 0.807 0.807 50 No
20 0.887 0.887 49 Yes

The history-depth experiment shows a practical limitation: shallow history can increase score because the history contribution is stronger than the false-positive discount. In this benchmark, memory becomes useful only after roughly 20 prior entity events. This is a valuable negative result for scientific reporting because it defines a boundary condition rather than hiding it.

History depth curve

RQ3: CIC-IDS2017 Stratified Holdout

Validation item Result
Raw rows loaded 2,830,743
Numeric feature columns 78
Infinite values replaced with missing values 4,376
Missing values after conversion 5,734
Exact duplicate feature rows removed before split 331,919
Duplicate feature groups with conflicting labels 719
Rows after duplicate removal 2,498,824
Train/test exact feature overlap 0
Metric Value
Accuracy 99.924%
Balanced accuracy 99.946%
Precision 99.580%
Recall 99.978%
F1 99.778%
ROC AUC 99.999%
Average precision 99.993%
False-positive rate 0.087%
False-negative rate 0.022%
Confusion matrix TN 414,258 / FP 359 / FN 19 / TP 85,129

The stratified holdout exceeds the 90% target by a wide margin while reporting duplicate removal and zero train-test exact feature overlap. This result supports the claim that the implemented pipeline can produce a strong leakage-aware CIC-IDS2017 binary IDS baseline. It does not prove deployment readiness because random splits can remain easier than operational shifts [4], [6], [7].

CIC-IDS2017 label distribution

CIC-IDS2017 binary confusion matrix

CIC-IDS2017 ROC curve

CIC-IDS2017 precision-recall curve

CIC-IDS2017 attack-family recall

CIC-IDS2017 permutation importance

RQ4: Source-File Holdout Stress Test

Metric PortScan source holdout
Accuracy 57.209%
Balanced accuracy 50.016%
Precision 44.026%
Recall 0.604%
F1 1.193%
ROC AUC 92.415%
Average precision 82.788%
False-positive rate 0.573%
False-negative rate 99.396%
Confusion matrix TN 121,072 / FP 698 / FN 90,270 / TP 549

The source-file holdout reveals a serious generalization problem: the ranking signal remains useful, but the fixed 0.5 threshold fails for PortScan recall. This result directly supports the methodological contribution: rigorous IDS evaluation should pair high random-split scores with shift-aware stress tests and threshold-calibration analysis [4], [5], [7], [8].

CIC-IDS2017 PortScan holdout ROC curve

CIC-IDS2017 PortScan holdout confusion matrix

What the Project Proves and Does Not Prove

Claim Status Evidence
Persistent entity memory can preserve attack recall in the deterministic benchmark Supported 30/30 attack scenarios detected under combined memory
Persistent entity memory strongly eliminates false positives Not supported FPR improves only from 48.9% to 46.7%
Memory is immediately useful for cold-start entities Not supported Useful suppression appears near 20 prior events
CIC-IDS2017 binary classification exceeds 90% under leakage-aware stratified holdout Supported Balanced accuracy 99.946%, recall 99.978%, overlap 0
High random-split CIC-IDS2017 performance proves real-world deployment readiness Not supported PortScan source holdout recall 0.604% at threshold 0.5
The repository is a fully autonomous SOC agent Not supported It is a research prototype with API, memory, scoring, and evaluation scripts

Reproducibility Package

Artifact Purpose
scripts/memory_benchmark.py Deterministic memory-condition benchmark
scripts/generate_research_figures.py Rebuilds benchmark figures and result tables
scripts/evaluate_cicids2017.py Leakage-audited CIC-IDS2017 evaluation
reports/results/benchmark_condition_metrics.csv Benchmark condition metrics
reports/results/history_depth.csv History-depth results
reports/public_datasets/cicids2017/cicids2017_metrics.json Full stratified-holdout audit and metrics
reports/public_datasets/cicids2017_portscan_holdout/cicids2017_metrics.json Full source-holdout audit and metrics
source .venv/bin/activate

# Unit and regression suite
pytest -q

# Deterministic memory benchmark
python scripts/memory_benchmark.py

# Publication-style benchmark figures
python scripts/generate_research_figures.py

# CIC-IDS2017 stratified holdout
python scripts/evaluate_cicids2017.py \
  --data-dir /mnt/d/IDS_Hybrid_Project_v20/02_data/MachineLearningCSV/MachineLearningCVE \
  --max-iter 160 \
  --permutation-sample 10000

# CIC-IDS2017 PortScan source-file holdout
python scripts/evaluate_cicids2017.py \
  --data-dir /mnt/d/IDS_Hybrid_Project_v20/02_data/MachineLearningCSV/MachineLearningCVE \
  --split-mode source-holdout \
  --holdout-source Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv \
  --out-dir reports/public_datasets/cicids2017_portscan_holdout \
  --model-path reports/models/cicids2017_portscan_holdout.joblib \
  --max-iter 120 \
  --permutation-sample 0

API Prototype

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
uvicorn main:app --host 0.0.0.0 --port 5000 --reload

API docs: http://localhost:5000/docs
Health check: http://localhost:5000/health

Supported log-source labels are suricata, zeek, wazuh, splunk, and generic. These are normalized API source labels, not bundled research datasets.

Project Structure

.
├── app/                         # FastAPI SOC memory and triage prototype
├── reports/
│   ├── figures/                 # Deterministic benchmark figures
│   ├── public_datasets/         # CIC-IDS2017 figures, JSON, CSV outputs
│   └── results/                 # Benchmark CSV/JSON outputs
├── scripts/
│   ├── evaluate_cicids2017.py
│   ├── generate_research_figures.py
│   └── memory_benchmark.py
├── tests/                       # Unit and regression tests
├── main.py
└── requirements.txt

Limitations and Future Work

The current memory benchmark is deterministic and scenario-based. It validates the decision policy under controlled conditions, but it does not replace analyst-confirmed production SOC labels. Future work should evaluate memory behavior on time-ordered, analyst-labeled SOC alert streams [1], [2].

The CIC-IDS2017 experiment validates a leakage-aware IDS pipeline, but the source-file holdout failure shows that threshold calibration and temporal or domain-shift validation are necessary before deployment. Future work should add validation-only threshold selection, time-based splits, additional public datasets, and confidence-based abstention [4], [5], [7].

The prototype uses SQLite and a deterministic fallback model path for local reproducibility. A production implementation would need authentication, audit logging, rate controls, analyst feedback capture, privacy review, and infrastructure hardening before use in live SOC decision-making [1], [5].

Ethical and Scientific Integrity Statement

This repository is a research prototype. It should not be used to automate blocking actions in production without analyst review, calibrated thresholds, monitored drift, authenticated ingestion, and local validation. The reported results are intentionally mixed: the project reports both successful stratified performance and source-holdout failure to avoid overstating scientific claims [5], [8].

References

[1] P. Kearney, M. Abdelsamea, X. Schmoor, F. Shah, and I. Vickers, "Alert Fatigue in Security Operations Centres: Research Challenges and Opportunities," ACM Computing Surveys, vol. 57, no. 9, 2025. DOI: 10.1145/3723158.

[2] K. Veeramachaneni, I. Arnaldo, A. Cuesta-Infante, V. Korrapati, C. Bassias, and K. Li, "AI2: Training a Big Data Machine to Defend," IEEE BigDataSecurity/HPSC/IDS, 2016. DOI: 10.1109/BIGDATASECURITY-HPSC-IDS.2016.79.

[3] J. Weston, S. Chopra, and A. Bordes, "Memory Networks," arXiv:1410.3916, 2014. DOI: 10.48550/arXiv.1410.3916.

[4] R. Sommer and V. Paxson, "Outside the Closed World: On Using Machine Learning for Network Intrusion Detection," IEEE Symposium on Security and Privacy, 2010. DOI: 10.1109/SP.2010.25.

[5] D. Arp, E. Quiring, F. Pendlebury, A. Warnecke, F. Pierazzi, C. Wressnegger, L. Cavallaro, and K. Rieck, "Dos and Don'ts of Machine Learning in Computer Security," USENIX Security Symposium, 2022. URL: USENIX PDF.

[6] I. Sharafaldin, A. H. Lashkari, and A. A. Ghorbani, "Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization," ICISSP, 2018. Dataset page: CIC-IDS2017, Canadian Institute for Cybersecurity.

[7] G. Engelen, V. Rimmer, and W. Joosen, "Troubleshooting an Intrusion Detection Dataset: the CICIDS2017 Case Study," IEEE Security and Privacy Workshops, 2021. DOI: 10.1109/SPW53761.2021.00009.

[8] S. Kapoor and A. Narayanan, "Leakage and the Reproducibility Crisis in ML-based Science," arXiv:2207.07048, 2022. DOI: 10.48550/arXiv.2207.07048.

[9] F. Pedregosa et al., "Scikit-learn: Machine Learning in Python," Journal of Machine Learning Research, vol. 12, pp. 2825-2830, 2011. URL: JMLR.

License

CC BY-NC 4.0 - free for research and non-commercial use.

About

Memory-augmented SOC alert-triage research prototype with reproducible CIC-IDS2017 validation, leakage-aware controls, and publication-ready figures.

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages