Background
The BGP controller unit tests in pkg/agent/controller/bgp/controller_test.go use a pattern where syncBGPPolicy is driven manually while live informer goroutines run concurrently. The central synchronization primitive is waitAndGetDummyEvent:
func waitAndGetDummyEvent(t *testing.T, c *fakeController) {
require.Eventually(t, func() bool {
return c.queue.Len() == 1
}, 5*time.Second, 10*time.Millisecond)
c.queue.Get()
}
This polls the work queue length and dequeues one item. The tests assume that when c.queue.Len() == 1, the item was enqueued by a specific triggering action (e.g., a BGPPolicy DELETE). This assumption has never been sound.
The problem
startInformers calls WaitForCacheSync on both informer factories. WaitForCacheSync only guarantees that the informer store is populated from the initial list — it does not guarantee that event handlers have been called for those objects. The event-handler processor runs in separate goroutines and may still be delivering initial-list ADD notifications when startInformers returns.
In practice this means: after startInformers and the first waitAndGetDummyEvent/doneDummyEvent cycle, pending ADD notifications for the remaining initial BGPPolicy objects may still be in the processor's ring buffer. When doneDummyEvent marks the in-flight item as done, the rate-limiting queue re-queues dummyKey immediately. The subsequent waitAndGetDummyEvent (intended to wait for e.g. a DELETE event) fires on one of these stale re-queued items instead, while the lister may still have the deleted policy present. syncBGPPolicy then returns nil instead of the expected error.
This was a pre-existing latent race that became consistently observable after the upgrade to client-go v0.36, which introduced the watchSynced() goroutine per processor listener (an additional goroutine hop in the event delivery chain), making the window between WaitForCacheSync returning and event handlers being called larger and reliably visible on loaded CI runners.
A short-term fix was applied in #8057 (register a no-op handler on bgpPolicyInformer in startInformers and wait for its HasSynced, which only returns true once all initial-list ADD notifications have been delivered). This reduces the race window but does not eliminate the fundamental fragility.
The fundamental issue
The tests mix two concerns that are hard to compose correctly:
- Live informer goroutines delivering events asynchronously
- Manual invocation of
syncBGPPolicy as if the controller were running synchronously
waitAndGetDummyEvent is an inherently fragile bridge: it checks a queue length snapshot with no guarantee about which event caused the enqueue, and no mechanism to drain stale events.
Proposed improvements
Option A: Don't run live informers at all (preferred)
Pre-populate the fake lister with the desired objects and call syncBGPPolicy directly, without starting any informers. This is the standard approach for controller unit tests:
- Set up the initial world state in the fake indexer
- Call
syncBGPPolicy, assert on mock expectations and state
- Mutate the fake indexer to simulate ADD/UPDATE/DELETE
- Call
syncBGPPolicy again, assert
No goroutines, no timing dependencies, no waitAndGetDummyEvent. This is how most well-written controller tests in the Kubernetes ecosystem work.
Option B: Use testing/synctest (Go 1.24)
Wrap the tests in synctest.Test so all goroutines run on a deterministic fake scheduler. waitAndGetDummyEvent could remain but all goroutine scheduling would be deterministic. This is less invasive but requires understanding the synctest model.
Option C: Separate event handler tests from sync tests
Unit-test addBGPPolicy/deleteBGPPolicy/updateBGPPolicy in isolation to verify they enqueue correctly. Test syncBGPPolicy with a fake lister and no live informers. This is the right long-term architecture and avoids the integration-style setup.
Affected tests
All tests that call startInformers followed by waitAndGetDummyEvent:
TestSyncBGPPolicy
TestSyncBGPPolicyFailures
TestBGPPolicyAdd
TestBGPPolicyUpdate
TestBGPPolicyDelete
- Several sub-tests in table-driven test functions
References
Background
The BGP controller unit tests in
pkg/agent/controller/bgp/controller_test.gouse a pattern wheresyncBGPPolicyis driven manually while live informer goroutines run concurrently. The central synchronization primitive iswaitAndGetDummyEvent:This polls the work queue length and dequeues one item. The tests assume that when
c.queue.Len() == 1, the item was enqueued by a specific triggering action (e.g., a BGPPolicy DELETE). This assumption has never been sound.The problem
startInformerscallsWaitForCacheSyncon both informer factories.WaitForCacheSynconly guarantees that the informer store is populated from the initial list — it does not guarantee that event handlers have been called for those objects. The event-handler processor runs in separate goroutines and may still be delivering initial-list ADD notifications whenstartInformersreturns.In practice this means: after
startInformersand the firstwaitAndGetDummyEvent/doneDummyEventcycle, pending ADD notifications for the remaining initial BGPPolicy objects may still be in the processor's ring buffer. WhendoneDummyEventmarks the in-flight item as done, the rate-limiting queue re-queuesdummyKeyimmediately. The subsequentwaitAndGetDummyEvent(intended to wait for e.g. a DELETE event) fires on one of these stale re-queued items instead, while the lister may still have the deleted policy present.syncBGPPolicythen returns nil instead of the expected error.This was a pre-existing latent race that became consistently observable after the upgrade to client-go v0.36, which introduced the
watchSynced()goroutine per processor listener (an additional goroutine hop in the event delivery chain), making the window betweenWaitForCacheSyncreturning and event handlers being called larger and reliably visible on loaded CI runners.A short-term fix was applied in #8057 (register a no-op handler on
bgpPolicyInformerinstartInformersand wait for itsHasSynced, which only returns true once all initial-list ADD notifications have been delivered). This reduces the race window but does not eliminate the fundamental fragility.The fundamental issue
The tests mix two concerns that are hard to compose correctly:
syncBGPPolicyas if the controller were running synchronouslywaitAndGetDummyEventis an inherently fragile bridge: it checks a queue length snapshot with no guarantee about which event caused the enqueue, and no mechanism to drain stale events.Proposed improvements
Option A: Don't run live informers at all (preferred)
Pre-populate the fake lister with the desired objects and call
syncBGPPolicydirectly, without starting any informers. This is the standard approach for controller unit tests:syncBGPPolicy, assert on mock expectations and statesyncBGPPolicyagain, assertNo goroutines, no timing dependencies, no
waitAndGetDummyEvent. This is how most well-written controller tests in the Kubernetes ecosystem work.Option B: Use
testing/synctest(Go 1.24)Wrap the tests in
synctest.Testso all goroutines run on a deterministic fake scheduler.waitAndGetDummyEventcould remain but all goroutine scheduling would be deterministic. This is less invasive but requires understanding the synctest model.Option C: Separate event handler tests from sync tests
Unit-test
addBGPPolicy/deleteBGPPolicy/updateBGPPolicyin isolation to verify they enqueue correctly. TestsyncBGPPolicywith a fake lister and no live informers. This is the right long-term architecture and avoids the integration-style setup.Affected tests
All tests that call
startInformersfollowed bywaitAndGetDummyEvent:TestSyncBGPPolicyTestSyncBGPPolicyFailuresTestBGPPolicyAddTestBGPPolicyUpdateTestBGPPolicyDeleteReferences
watchSynced()goroutine per processor listener insharedProcessor.run