[bug]: Silent error handling bug in AppliedToGroup/AddressGroup replaceAll causes stale cache on restart

**Describe the bug**

Lines 452 and 523 in `pkg/agent/controller/networkpolicy/networkpolicy_controller.go` have a typo in the error handling code. They're missing `:=` in the if statement, so errors from `replaceAll()` get silently dropped. This means if writing the policy cache to disk fails (like when disk is full), the agent doesn't log anything but the on-disk cache becomes stale. When the agent restarts later, it loads this stale cache and installs wrong OVS flows, which can break network policy enforcement.

---

**To Reproduce**

1. Deploy antrea on a cluster with some network policies applied
2. Fill up the disk on a node to around 98% (this happens in production with log buildup)
3. Restart the antrea controller pod - this makes the agent reconnect and try to sync all policies
4. The `replaceAll()` call fails because of disk space but no error is logged
5. Free up some disk space, agent keeps running fine (in-memory state is ok)
6. Now restart the agent pod or reboot the node
7. Agent loads stale data from disk cache and installs incorrect flows

---

**Expected**

When `replaceAll()` fails, it should log an error message like it does for networkPolicyStore on line 381. This way we know the disk cache is broken and can fix it before restarting.

---

**Actual behavior**

No error is logged. The bug is in these two places:

Line 452:
```go
if c.appliedToGroupStore.replaceAll(objs); err != nil {
    klog.ErrorS(err, "Failed to store the AppliedToGroups to files")
}
```

Line 523:
```go
if c.addressGroupStore.replaceAll(objs); err != nil {
    klog.ErrorS(err, "Failed to store the AddressGroups to files")
}
```

Should be (add `:=`):
```go
if err := c.appliedToGroupStore.replaceAll(objs); err != nil {
```
```go
if err := c.addressGroupStore.replaceAll(objs); err != nil {
```

The way it's written now, it's checking the outer `err` variable which is always nil, not the return value from `replaceAll()`. Line 381 already does it correctly for networkPolicyStore.

---

**Versions:**

This affects all versions - the bug has been there since the file cache feature was added. I'm looking at current main branch.

---

**Additional context**

I wrote a test that proves this using afero readonly filesystem to force the disk write to fail. With the bug, old stale data stays on disk and no error appears. With the fix, error gets caught properly.

This is pretty serious because it's completely silent - no logs, no events, nothing. And it happens during normal cluster operations like controller upgrades + node restarts. The network policy enforcement just silently breaks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[bug]: Silent error handling bug in AppliedToGroup/AddressGroup replaceAll causes stale cache on restart #8012

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[bug]: Silent error handling bug in AppliedToGroup/AddressGroup replaceAll causes stale cache on restart #8012

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions