Cybersecurity Anomaly Detection from System Events: A Practical Safety Layer for Agentic Systems

TL;DR

Autonomous/agentic systems need safety layers that can flag abnormal behavior early, even when we don’t have perfect labels for attacks. In this project, I built an anomaly detector over system-event logs and learned a key lesson that transfers directly to agentic AI in retail investing:

A detector calibrated for trust (low false positives) in normal operations can fail under incident conditions where the base rate shifts.
The fix isn’t “a better model” alone — it’s resilience-aware alerting policy (adaptive alert-rate / incident mode).
Parsing structured arguments (args) boosts detection of high-severity behavior, but can increase noise in ops-like settings without proper calibration.
Error analysis showed false positives cluster in a small set of benign system services (e.g., systemd-*, management agents), motivating policy layers (allowlists) and context-aware baselines.

What you’ll get by the end

By the end of this post you’ll have a template for building a practical safety layer:

A reproducible pipeline for anomaly detection on event logs
Two operating modes:
- Trust mode: conservative alerting for day-to-day operations
- Incident mode: adaptive alerting to maintain recall under attack-heavy conditions
A concrete methodology for evaluating:
- ranking quality (PR-AUC/ROC-AUC)
- operational performance (precision/recall at a chosen alert budget)
- severity behavior (does the detector rank “evil” events higher without seeing them in training?)
A clear mapping of these ideas to agentic AI for retail investing (tool calls, order placement, account actions)

Problem + why the data forces a “trust vs resilience” design

In security monitoring, the hardest part isn’t building a model — it’s deploying something people can trust. If a detector fires constantly during normal operations, analysts and users stop listening. But if you tune it to be quiet, it can fail when the environment shifts into an incident.

This dataset makes that tension visible. In the training and validation splits, suspicious events (sus=1) are rare (≈0.17% in train and ≈0.4% in validation), and there are no evil=1 examples at all. In contrast, the test split is attack-heavy: most events are suspicious and a large portion are labeled evil=1. That means a classic supervised “evil detector” can’t be trained from train/val — the realistic approach is one-class / unsupervised anomaly detection trained on mostly normal behavior, with an operational policy that adapts when the base rate shifts.

This is exactly the pattern we’ll see in agentic systems (including agentic retail investing):

Trust mode (normal): keep alerts low and interpretable.
Incident mode (elevated risk): increase monitoring budget and adapt alert thresholds/rates to maintain recall.

Dataset overview

The dataset consists of system event records captured from hosts over time. Each row describes a single event (e.g., close, socket, clone) along with the process/user context and a structured argument list. This structure is similar to what an agentic system produces when it interacts with tools: each tool call has a timestamp, an identity, an action type, and parameters.

Key columns

timestamp: when the event occurred (relative time)
processId / threadId / parentProcessId: process lineage
userId: user context (useful for separating system vs user activity)
mountNamespace: namespace context (useful in containerized environments)
processName / hostName: human-readable process identity and host
eventId / eventName: event type
returnValue: outcome signal (e.g., error/non-error)
argsNum: number of arguments associated with the event
args: a structured list of arguments (stored as a string representation of a list of dicts)
sus: label indicating suspicious activity (broad anomaly)
evil: label indicating malicious activity (high severity)

Label behavior across splits (important)

A key property of this dataset is that evil=1 is absent from training and validation, but present heavily in the test split. That means we cannot train a supervised “evil classifier” using the provided training data. Instead, we treat this as a realistic security setting: train on mostly normal data, tune on rare suspicious events, and then check whether the anomaly score generalizes to high-severity behavior.

Figure 1 — The dataset exhibits strong base-rate shift: validation contains rare anomalies and no evil events, while test is attack-heavy. This motivates one-class anomaly detection plus adaptive alerting policies (trust mode vs incident mode).

import pandas as pd
import ast

# Load splits (already loaded earlier in our notebook as dfs)
train_df = dfs["train"]
val_df   = dfs["val"]
test_df  = dfs["test"]

print("Columns:", list(train_df.columns))
print("\nTrain head (selected columns):")
display(train_df[["timestamp","processName","eventName","hostName","argsNum","returnValue","sus","evil"]].head(5))

def parse_args_cell(s):
    """
    args is stored as a string that looks like a Python list of dicts.
    We parse it safely using ast.literal_eval.
    """
    if pd.isna(s):
        return []
    s = str(s).strip()
    if s == "" or s == "[]":
        return []
    try:
        v = ast.literal_eval(s)
        return v if isinstance(v, list) else []
    except Exception:
        return []

print("\nParsed args examples:")
for i in range(3):
    raw = train_df["args"].iloc[i]
    parsed = parse_args_cell(raw)
    print(f"\nRow {i} raw args:", str(raw)[:120], "...")
    print("Row", i, "parsed args:", parsed)

What the args field looks like

The most information-dense column is args, which stores event parameters as a list of objects with fields like:

name (e.g., fd, option, flags)
type (e.g., int, unsigned long)
value (string or numeric)

Here are real examples from the dataset after parsing:

Example A — multi-argument event with a high-signal token

[
  {'name': 'option', 'type': 'int', 'value': 'PR_SET_NAME'},
  {'name': 'arg2', 'type': 'unsigned long', 'value': 94819493392601},
  {'name': 'arg3', 'type': 'unsigned long', 'value': 94819493392601},
  {'name': 'arg4', 'type': 'unsigned long', 'value': 140662171848350},
  {'name': 'arg5', 'type': 'unsigned long', 'value': 140662156379904}
]

[{'name': 'fd', 'type': 'int', 'value': 19}]

Threat model: what “anomalous” means in these logs (final)

4. Threat model: what are we trying to detect?

This project treats anomaly detection as a safety layer over system activity. The goal is not to classify every possible attack type, but to surface events that deviate from normal behavior in ways that warrant investigation.

In this dataset, most activity is file and process related. The top event types in training include:

close (218,080), openat (209,730), security_file_open (148,611)
fstat (80,071), stat (41,931), access (14,383)
plus smaller counts of socket, directory listing (getdents64), and capability checks (cap_capable)

These are exactly the kinds of “everyday” system calls that dominate normal operations—meaning an effective safety layer must detect anomalies within a sea of routine behavior.

On the process side, training is dominated by a few processes, especially:

ps (406,313) and systemd-udevd (189,292)
sshd (91,762)
plus systemd-* services and agents like amazon-ssm-agen

This matters because anomaly detectors often end up learning “common service behavior,” and will tend to flag rare process behaviors as suspicious—sometimes correctly, sometimes as false positives.

4.1 What counts as an anomaly?

We treat an event as anomalous if it deviates from expected patterns in any of these ways:

A) Identity/context anomalies

unexpected processes performing certain event types
unusual activity concentrated on a particular host
unusual user contexts (userId) for a given process

B) Action anomalies

Even common events can be suspicious depending on context. For example, the top suspicious event types in validation include:

openat (198), lstat (195), close (149)
stat, fstat, directory listing (getdents64)
plus destructive operations such as unlink / security_inode_unlink

So “anomalous” here is not “rare eventName”—it’s unusual combinations of (process, host, event type, args) and the surrounding context.

C) Argument anomalies (args)

The args field is semantically rich. It can contain:

high-signal tokens (e.g., PR_SET_NAME)
flags/options and numeric argument structure
patterns that can distinguish benign from suspicious actions

A safety layer that ignores args will often miss the strongest indicators of unusual behavior.

D) Outcome anomalies (returnValue)

Non-zero return codes (or unusual return patterns) can indicate failed access attempts, probing, or misconfiguration. In isolation, errors aren’t malicious—but in combination with other signals, they can raise risk.

4.2 Attacker and operational assumptions

This dataset reflects two realistic constraints common in cybersecurity:

Incomplete attack labels
Training/validation do not include evil=1, so we cannot rely on supervised malicious labels.
Distribution shift / incident conditions
What “normal” looks like can change rapidly. A threshold tuned for rare anomalies during normal operations can fail when the environment becomes attack-heavy. Resilience requires adapting the alerting policy (e.g., by increasing alert budget or switching to incident-mode behavior).

4.3 Why this is a “safety layer” for agentic systems

The exact same threat model applies to agentic AI—especially in retail investing—if we map:

processName/userId ↔ agent identity + user account
eventName ↔ tool/function invoked (place order, cancel, fetch news, rebalance)
args ↔ tool arguments / payload (ticker, quantity, price limits, endpoint responses)
returnValue ↔ tool response / error codes

In both cases, the safety layer must:

detect abnormal action-argument combinations
handle label scarcity
remain robust under base-rate shift
and support trust via manageable alert volume and interpretable triage

# Top event and process types (train)
display(dfs["train"]["eventName"].value_counts().head(10))
display(dfs["train"]["processName"].value_counts().head(10))

# Suspicious events distribution (validation)
val_sus = dfs["val"][dfs["val"]["sus"]==1]
display(val_sus["processName"].value_counts().head(10))
display(val_sus["eventName"].value_counts().head(10))

Feature engineering: baseline vs args-aware features (final)

This segment is where we explain why the model behaves differently in “trust mode” vs “incident mode.” The short version: the args field contains the richest signal, but leveraging it can change the alerting behavior depending on how you calibrate thresholds.

5.1 Design goal: features that support a safety layer

For a safety layer, we care about two things:

Operational reliability: fast, stable features that work at scale
Semantic signal: features that capture what the event actually did (often in args)

We therefore build two feature sets:

Baseline (lightweight): stable, cheap features suitable for trust-mode calibration
Args-aware (enhanced): adds structure extracted from args to better detect high-severity behavior during incidents

5.2 Baseline features (lightweight, trust-mode friendly)

Baseline features are mostly numeric/contextual:

timestamp
processId, threadId, parentProcessId, userId, mountNamespace
eventId, argsNum, returnValue
simple proxies:
- args_len (length of args string)
- stack_len (approx. length of stackAddresses list)

Why this helps: It provides a stable signal without overfitting to rare argument patterns. In practice, this can produce a detector that’s easier to calibrate for low false positives during normal operations.

5.3 Args-aware features (enhanced, incident-mode strength)

The enhanced feature set extracts structured signals from args:

args_count: number of arguments
args_unique_names, args_unique_types: diversity measures
counts by type/value structure:
- args_int_cnt, args_ulong_cnt, args_str_cnt, args_num_cnt
robust numeric summaries (after fixing NaN/Inf issues)
high-signal flags:
- args_has_pathlike, args_has_ip, args_has_url
- token presence like args_has_pr_set_name

Why this helps: During attack-heavy conditions, malicious events often differ strongly in argument structure and tokens (e.g., unusual flags, parameters, payload-like strings). These features can sharply increase detection power for severe behavior (including evil)—at the cost of potentially increasing noise in normal operations if not carefully calibrated.

This is the exact feature construction approach used in our experiments.

import pandas as pd
import numpy as np
import ast, re

# ---------- Baseline proxies ----------
for split, df in dfs.items():
    df["args_len"]  = df["args"].astype(str).str.len()
    df["stack_len"] = df["stackAddresses"].astype(str).str.count(",").add(1)

baseline_cols = [
    "timestamp","processId","threadId","parentProcessId","userId","mountNamespace",
    "eventId","argsNum","returnValue","args_len","stack_len"
]

# ---------- args parsing helpers ----------
_ip_re  = re.compile(r"\b(?:\d{1,3}\.){3}\d{1,3}\b")
_url_re = re.compile(r"https?://|www\.", re.IGNORECASE)

def parse_args_cell(s):
    if pd.isna(s):
        return []
    s = str(s).strip()
    if s == "" or s == "[]":
        return []
    try:
        v = ast.literal_eval(s)
        return v if isinstance(v, list) else []
    except Exception:
        return []

def args_numeric_features(df: pd.DataFrame) -> pd.DataFrame:
    feats = pd.DataFrame(index=df.index)
    args_list = df["args"].map(parse_args_cell)

    feats["args_count"] = args_list.map(len).astype(int)
    feats["args_unique_names"] = args_list.map(lambda L: len({d.get("name") for d in L if isinstance(d, dict)})).astype(int)
    feats["args_unique_types"] = args_list.map(lambda L: len({d.get("type") for d in L if isinstance(d, dict)})).astype(int)

    def count_type(L, t):
        t = t.lower()
        return sum(1 for d in L if isinstance(d, dict) and str(d.get("type","")).lower() == t)

    feats["args_int_cnt"]   = args_list.map(lambda L: count_type(L, "int")).astype(int)
    feats["args_ulong_cnt"] = args_list.map(lambda L: count_type(L, "unsigned long")).astype(int)
    feats["args_str_cnt"]   = args_list.map(lambda L: sum(1 for d in L if isinstance(d, dict) and isinstance(d.get("value"), str))).astype(int)

    def numeric_vals(L):
        out = []
        for d in L:
            if not isinstance(d, dict):
                continue
            v = d.get("value")
            if isinstance(v, (int, float)):
                out.append(float(v))
            else:
                try:
                    out.append(float(v))
                except Exception:
                    pass
        return out

    nums = args_list.map(numeric_vals)
    feats["args_num_cnt"] = nums.map(len).astype(int)

    # Avoid NaN/Inf: clamp + log1p on non-negative by using abs
    feats["args_num_mean_log"] = nums.map(lambda x: np.log1p(np.mean(np.abs(x))) if len(x) else 0.0)
    feats["args_num_max_log"]  = nums.map(lambda x: np.log1p(np.max(np.abs(x)))  if len(x) else 0.0)

    def any_str_flag(L, predicate):
        for d in L:
            if isinstance(d, dict) and isinstance(d.get("value"), str) and predicate(d["value"]):
                return 1
        return 0

    feats["args_has_pathlike"] = args_list.map(lambda L: any_str_flag(L, lambda s: "/" in s or "\\\\" in s)).astype(int)
    feats["args_has_ip"]       = args_list.map(lambda L: any_str_flag(L, lambda s: bool(_ip_re.search(s)))).astype(int)
    feats["args_has_url"]      = args_list.map(lambda L: any_str_flag(L, lambda s: bool(_url_re.search(s)))).astype(int)

    def contains_token(L, token):
        token = token.lower()
        for d in L:
            if isinstance(d, dict) and isinstance(d.get("value",""), str) and d.get("value","").lower() == token:
                return 1
        return 0

    feats["args_has_pr_set_name"] = args_list.map(lambda L: contains_token(L, "pr_set_name")).astype(int)

    # Safety net
    feats = feats.replace([np.inf, -np.inf], np.nan).fillna(0.0)
    return feats

# Add args-aware features into each split dataframe
for split, df in dfs.items():
    extra = args_numeric_features(df)
    for c in extra.columns:
        df[c] = extra[c]

enhanced_cols = baseline_cols + [
    "args_count","args_unique_names","args_unique_types",
    "args_int_cnt","args_ulong_cnt","args_str_cnt",
    "args_num_cnt","args_num_mean_log","args_num_max_log",
    "args_has_pathlike","args_has_ip","args_has_url","args_has_pr_set_name"
]

print("Baseline feature count:", len(baseline_cols))
print("Enhanced feature count:", len(enhanced_cols))

Models and evaluation protocol (final)

This segment explains how we train and how we evaluate an anomaly detector when labels are incomplete and the operating environment shifts.

6.1 Model choice: Isolation Forest as a practical safety-layer baseline

We use Isolation Forest because it’s:

fast on large tabular data
effective as a one-class / unsupervised anomaly detector
simple to deploy and explain as a “risk scoring layer”

Crucially, we train only on normal events (sus=0) from the training split. This matches a realistic security assumption: the cleanest data you have is typically “mostly normal,” while attacks are rare or unknown at training time.

6.2 What we measure: ranking vs decision policy

A key distinction in anomaly detection:

A) Ranking quality (threshold-free)

These metrics evaluate whether anomalies tend to score higher than normal events:

ROC-AUC: good for overall separability
PR-AUC: more informative under class imbalance (rare anomalies)

We compute these on:

sus (suspicious) for validation and test
evil on test as a severity generalization check (since evil isn’t available in train/val)

B) Operational performance (requires a policy)

To deploy a safety layer you must decide when to alert. This is where trust and resilience come in.

We evaluate precision/recall/F1 under two policies:

6.3 Trust mode policy (normal operations)

In normal operations, anomalies are rare and alert fatigue is the main risk. We therefore choose a conservative policy:

Set the alert threshold to cap the false positive rate on normal validation events
(e.g., “~1% of normal events may trigger alerts.”)

This produces a stable, trust-oriented safety layer.

6.4 Incident mode policy (attack-heavy conditions)

In incidents, the base rate shifts: many more events become anomalous. A fixed “trust-mode” threshold becomes too strict and recall collapses.

Instead, incident response should use:

Alert-rate policy: “flag the top X% most anomalous events”
Top-K triage: “review the top K events” under a fixed human budget

This is a resilience mechanism: you adapt alert volume to match elevated risk conditions.

6.5 Evaluation protocol across splits

We follow this protocol:

Train: fit Isolation Forest on training normals (sus=0)
Calibrate (trust mode): choose threshold using validation normals
Evaluate:
- Validation (sus) under trust-mode threshold
- Test (sus) under trust-mode threshold (to show failure under shift)
- Test (sus) under incident-mode policies (alert rate / top-K)
- Test (evil) as a severity generalization check

This structure makes the post honest: it distinguishes what works in everyday conditions from what you need during incidents.

Feature set table (added for clarity)

Feature Set	What it includes	Why it’s useful	Typical mode
Baseline	Context + action metadata (IDs, `eventId`, `argsNum`, `returnValue`) + cheap proxies (`args_len`, `stack_len`)	Stable, low-noise signals; easier to calibrate for low false positives	Trust mode
Args-aware (Enhanced)	Baseline + structured extraction from `args` (counts, diversity, numeric summaries, path/IP/URL flags, token presence)	Captures richer semantics; strong detection under attack-heavy conditions	Incident mode

Code: training + evaluation (trust mode + incident mode)

This block is blog-ready and mirrors the workflow used in our experiments.
(You can run it for baseline_cols or enhanced_cols.)

import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.metrics import classification_report, average_precision_score, roc_auc_score, precision_recall_fscore_support

def fit_iforest_and_score(feature_cols):
    train = dfs["train"]
    val   = dfs["val"]
    test  = dfs["test"]

    # train on normals only
    X_train_norm = train.loc[train["sus"]==0, feature_cols].replace([np.inf,-np.inf], np.nan).fillna(0.0)
    X_val  = val[feature_cols].replace([np.inf,-np.inf], np.nan).fillna(0.0)
    X_test = test[feature_cols].replace([np.inf,-np.inf], np.nan).fillna(0.0)

    model = IsolationForest(n_estimators=300, random_state=42, n_jobs=-1)
    model.fit(X_train_norm)

    val_score  = -model.score_samples(X_val)
    test_score = -model.score_samples(X_test)
    return model, val_score, test_score

def eval_ranking(scores, y, name):
    print(f"\n=== {name} (ranking) ===")
    print("PR-AUC:", average_precision_score(y, scores))
    print("ROC-AUC:", roc_auc_score(y, scores))

def eval_threshold(scores, y, thr, name):
    pred = (scores >= thr).astype(int)
    print(f"\n=== {name} (thresholded) ===")
    print(classification_report(y, pred, digits=4, zero_division=0))

def trust_mode_threshold(val_score, y_val, fpr_target=0.01):
    # threshold from normal validation scores
    thr = np.quantile(val_score[y_val==0], 1 - fpr_target)
    return float(thr)

def eval_alert_rate(scores, y, alert_rate, name):
    thr = np.quantile(scores, 1 - alert_rate)
    pred = (scores >= thr).astype(int)
    p, r, f1, _ = precision_recall_fscore_support(y, pred, average="binary", zero_division=0)
    print(f"{name} alert_rate={alert_rate:.3f} thr={thr:.4f}  precision={p:.3f} recall={r:.3f} f1={f1:.3f}  flagged={pred.mean():.3f}")

# Example usage:
# model, val_score, test_score = fit_iforest_and_score(baseline_cols)  # or enhanced_cols

# y labels
y_val_sus  = dfs["val"]["sus"].values.astype(int)
y_test_sus = dfs["test"]["sus"].values.astype(int)
y_test_evil = dfs["test"]["evil"].values.astype(int)

# Ranking quality
# eval_ranking(val_score, y_val_sus, "VAL sus")
# eval_ranking(test_score, y_test_sus, "TEST sus")
# eval_ranking(test_score, y_test_evil, "TEST evil (severity check)")

# Trust-mode thresholding
# thr = trust_mode_threshold(val_score, y_val_sus, fpr_target=0.01)
# eval_threshold(val_score,  y_val_sus,  thr, "VAL sus (trust mode)")
# eval_threshold(test_score, y_test_sus, thr, "TEST sus (trust mode)")

# Incident-mode alert-rate policy
# for ar in [0.01, 0.05, 0.10, 0.20]:
#     eval_alert_rate(test_score, y_test_sus, ar, "TEST sus (incident mode)")

7 — Results (baseline for trust mode, enhanced for incident mode)

7.1 Trust mode (validation, rare anomalies)

In validation, suspicious events are rare, so we calibrate conservatively to avoid alert fatigue. Using the baseline feature set with Isolation Forest trained on normal-only events, we set the threshold to cap false positives on normal validation events (≈1% FPR on validation normals).

Validation performance (sus):

PR-AUC: 0.181
ROC-AUC: 0.990
At the trust-mode threshold:
- Precision: 0.194
- Recall: 0.634
- F1: 0.297

Interpretation: the anomaly score separates suspicious from normal events extremely well (ROC-AUC ≈ 0.99). Under an operationally conservative threshold, the detector catches a majority of suspicious events while keeping alerts manageable.

Practical takeaway: in normal operations, the limiting factor is not model separability—it’s trust calibration (keeping false positives low enough that alerts remain actionable).

7.2 Incident mode (test, attack-heavy distribution shift)

The test split behaves like an incident: suspicious events are the majority, and evil=1 appears heavily. Under this shift, a fixed trust-mode threshold becomes too strict (recall collapses). Instead, we use the enhanced args-aware feature set and evaluate under an incident policy.

First, the anomaly score itself remains highly informative:

Test ranking quality (incident regime):

sus PR-AUC: 0.990
sus ROC-AUC: 0.938
evil PR-AUC: 0.989
evil ROC-AUC: 0.970

Interpretation: even though evil does not appear in training/validation, the model’s anomaly score still ranks high-severity malicious events above normal activity. That’s exactly the behavior you want from a safety layer trained under incomplete labels.

Practical takeaway: during incidents, the score is useful—but the alerting policy must adapt.

7.3 Incident policy: adaptive alert rate (resilience knob)

Instead of a fixed threshold, incident response uses an alert-rate policy: flag the top X% most anomalous events. This is operationally realistic because it maps directly to monitoring capacity and risk level.

Using the enhanced model on the test split:

Alert ~5% (flagged ~7.4% due to score ties):
precision 0.985, recall 0.081, F1 0.149
Alert ~10% (flagged ~11.8%):
precision 0.990, recall 0.129, F1 0.229
Alert ~20% (flagged ~30.3%):
precision 0.996, recall 0.333, F1 0.499

This gives a clean resilience control:

when risk is low → keep alert rate low (trust)
when risk is high → increase alert rate to regain recall (resilience)

7.4 Triage quality: top-K alerts

Another incident-friendly policy is top-K triage: review the highest scoring K events first. This is useful when a human responder needs immediate focus.

Using the enhanced model on test:

Top-100 alerts: precision 0.93
Top-500 alerts: precision 0.91

Recall is low at small K (because the incident contains many anomalous events), but top-K precision demonstrates that the model can produce a high-quality ranked triage list under pressure.

7.5 Summary: what we learn from the results

Trust mode works best when calibrated on rare-anomaly validation data (baseline features + conservative threshold).
Incident mode requires policy adaptation (alert rate / triage), because base rates shift dramatically.
Args-aware features improve detection of severe behavior and generalize well to evil even without seeing evil in training.