Skip to content

[bug]: Silent error handling bug in AppliedToGroup/AddressGroup replaceAll causes stale cache on restart #8012

Description

@Ady0333

Describe the bug

Lines 452 and 523 in pkg/agent/controller/networkpolicy/networkpolicy_controller.go have a typo in the error handling code. They're missing := in the if statement, so errors from replaceAll() get silently dropped. This means if writing the policy cache to disk fails (like when disk is full), the agent doesn't log anything but the on-disk cache becomes stale. When the agent restarts later, it loads this stale cache and installs wrong OVS flows, which can break network policy enforcement.


To Reproduce

  1. Deploy antrea on a cluster with some network policies applied
  2. Fill up the disk on a node to around 98% (this happens in production with log buildup)
  3. Restart the antrea controller pod - this makes the agent reconnect and try to sync all policies
  4. The replaceAll() call fails because of disk space but no error is logged
  5. Free up some disk space, agent keeps running fine (in-memory state is ok)
  6. Now restart the agent pod or reboot the node
  7. Agent loads stale data from disk cache and installs incorrect flows

Expected

When replaceAll() fails, it should log an error message like it does for networkPolicyStore on line 381. This way we know the disk cache is broken and can fix it before restarting.


Actual behavior

No error is logged. The bug is in these two places:

Line 452:

if c.appliedToGroupStore.replaceAll(objs); err != nil {
    klog.ErrorS(err, "Failed to store the AppliedToGroups to files")
}

Line 523:

if c.addressGroupStore.replaceAll(objs); err != nil {
    klog.ErrorS(err, "Failed to store the AddressGroups to files")
}

Should be (add :=):

if err := c.appliedToGroupStore.replaceAll(objs); err != nil {
if err := c.addressGroupStore.replaceAll(objs); err != nil {

The way it's written now, it's checking the outer err variable which is always nil, not the return value from replaceAll(). Line 381 already does it correctly for networkPolicyStore.


Versions:

This affects all versions - the bug has been there since the file cache feature was added. I'm looking at current main branch.


Additional context

I wrote a test that proves this using afero readonly filesystem to force the disk write to fail. With the bug, old stale data stays on disk and no error appears. With the fix, error gets caught properly.

This is pretty serious because it's completely silent - no logs, no events, nothing. And it happens during normal cluster operations like controller upgrades + node restarts. The network policy enforcement just silently breaks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions