Skip to content

Improve BGP controller unit tests to remove dependency on informer goroutine timing #8058

Description

@antoninbas

Background

The BGP controller unit tests in pkg/agent/controller/bgp/controller_test.go use a pattern where syncBGPPolicy is driven manually while live informer goroutines run concurrently. The central synchronization primitive is waitAndGetDummyEvent:

func waitAndGetDummyEvent(t *testing.T, c *fakeController) {
    require.Eventually(t, func() bool {
        return c.queue.Len() == 1
    }, 5*time.Second, 10*time.Millisecond)
    c.queue.Get()
}

This polls the work queue length and dequeues one item. The tests assume that when c.queue.Len() == 1, the item was enqueued by a specific triggering action (e.g., a BGPPolicy DELETE). This assumption has never been sound.

The problem

startInformers calls WaitForCacheSync on both informer factories. WaitForCacheSync only guarantees that the informer store is populated from the initial list — it does not guarantee that event handlers have been called for those objects. The event-handler processor runs in separate goroutines and may still be delivering initial-list ADD notifications when startInformers returns.

In practice this means: after startInformers and the first waitAndGetDummyEvent/doneDummyEvent cycle, pending ADD notifications for the remaining initial BGPPolicy objects may still be in the processor's ring buffer. When doneDummyEvent marks the in-flight item as done, the rate-limiting queue re-queues dummyKey immediately. The subsequent waitAndGetDummyEvent (intended to wait for e.g. a DELETE event) fires on one of these stale re-queued items instead, while the lister may still have the deleted policy present. syncBGPPolicy then returns nil instead of the expected error.

This was a pre-existing latent race that became consistently observable after the upgrade to client-go v0.36, which introduced the watchSynced() goroutine per processor listener (an additional goroutine hop in the event delivery chain), making the window between WaitForCacheSync returning and event handlers being called larger and reliably visible on loaded CI runners.

A short-term fix was applied in #8057 (register a no-op handler on bgpPolicyInformer in startInformers and wait for its HasSynced, which only returns true once all initial-list ADD notifications have been delivered). This reduces the race window but does not eliminate the fundamental fragility.

The fundamental issue

The tests mix two concerns that are hard to compose correctly:

  1. Live informer goroutines delivering events asynchronously
  2. Manual invocation of syncBGPPolicy as if the controller were running synchronously

waitAndGetDummyEvent is an inherently fragile bridge: it checks a queue length snapshot with no guarantee about which event caused the enqueue, and no mechanism to drain stale events.

Proposed improvements

Option A: Don't run live informers at all (preferred)

Pre-populate the fake lister with the desired objects and call syncBGPPolicy directly, without starting any informers. This is the standard approach for controller unit tests:

  • Set up the initial world state in the fake indexer
  • Call syncBGPPolicy, assert on mock expectations and state
  • Mutate the fake indexer to simulate ADD/UPDATE/DELETE
  • Call syncBGPPolicy again, assert

No goroutines, no timing dependencies, no waitAndGetDummyEvent. This is how most well-written controller tests in the Kubernetes ecosystem work.

Option B: Use testing/synctest (Go 1.24)

Wrap the tests in synctest.Test so all goroutines run on a deterministic fake scheduler. waitAndGetDummyEvent could remain but all goroutine scheduling would be deterministic. This is less invasive but requires understanding the synctest model.

Option C: Separate event handler tests from sync tests

Unit-test addBGPPolicy/deleteBGPPolicy/updateBGPPolicy in isolation to verify they enqueue correctly. Test syncBGPPolicy with a fake lister and no live informers. This is the right long-term architecture and avoids the integration-style setup.

Affected tests

All tests that call startInformers followed by waitAndGetDummyEvent:

  • TestSyncBGPPolicy
  • TestSyncBGPPolicyFailures
  • TestBGPPolicyAdd
  • TestBGPPolicyUpdate
  • TestBGPPolicyDelete
  • Several sub-tests in table-driven test functions

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions