[receiver/oracledb] Add redo log metrics#49062
Conversation
| value_type: double | ||
| unit: s | ||
| attributes: [oracledb.redo.kind] | ||
| oracledb.redo.writes: |
There was a problem hiding this comment.
What about the below with an attribute to indicate read/write
| oracledb.redo.writes: | |
| oracledb.redo.operations: |
Or
| oracledb.redo.writes: | |
| oracledb.redo.actions: |
There was a problem hiding this comment.
Renamed to oracledb.redo.operations. Kept it a plain counter — redo writes is write-only in v$sysstat (the only redo-read stat is redo blocks read for recovery, a separate recovery concept), so a read/write attribute would be constant-valued.
There was a problem hiding this comment.
I like having the attribute as it makes clear that it is describing writes. It also enables the potential of read metric to be added at a later date.
There was a problem hiding this comment.
Done — added disk.io.direction attribute (write)
| gauge: | ||
| value_type: double | ||
| unit: By | ||
| oracledb.redo.blocks_written: |
There was a problem hiding this comment.
This doesn't flow like the others, what about
| oracledb.redo.blocks_written: | |
| oracledb.redo.blocks: |
With an attribute to indicate read/write.
There was a problem hiding this comment.
As above having the attribute would be advantageous.
There was a problem hiding this comment.
Done — added disk.io.direction (write)
|
Update after integration testing: I deployed a build against an Oracle 19c
This PR is now for 6 metrics; the remaining 8 source statistics all populate (verified on Oracle 19c). |
…ks per review feedback
Removes oracledb.redo.log_switch.interrupts and the latching oracledb.redo.kind value; their v$sysstat sources are not present on Oracle 12c+ (verified on Oracle 19c).
0ffb34f to
be4bac1
Compare
Description
The Oracle DB receiver scrapes
v$sysstatwithSELECT * FROM v$sysstat, which returns all system-statistics rows. However, the scraper only processes a subset of those rows in itsswitchblock and silently discards the rest -- including every redo-log statistic. This PR surfaces 8 discarded redo rows as 6 new opt-in metrics under theoracledb.redo.*namespace, giving operators visibility into redo write latency, redo throughput, and redo-log sizing pressure -- the foundation of Oracle's durability and recovery story (the equivalent of PostgreSQL's WAL).Because the data is already being fetched, the implementation is purely additive Go code:
Related statistics are consolidated under a single metric name with an OTel attribute rather than mapping each flat statistic to its own metric name, following the receiver's existing attributed-metric convention. The redo-pipeline timing statistics share one metric (
oracledb.redo.time) differentiated by theoracledb.redo.kindattribute, and the write-direction counters reuse the existingdisk.io.directionattribute (introduced fororacledb.physical_io.*).This PR adds 6 new opt-in metrics (disabled by default, stability: development):
oracledb.redo.time-- Cumulative time, in seconds, spent in each phase of the redo pipeline. Attribute:oracledb.redo.kind(write|log_space_wait|synch). Coversv$sysstat:redo write time,redo log space wait time,redo synch time(the raw centisecond value is divided by 100 in the scraper). High write/synch time directly raises commit latency.oracledb.redo.size-- Total amount of redo generated, in bytes. No attributes. Coversv$sysstat:redo size. The canonical redo write-throughput baseline.oracledb.redo.operations-- Number of redo I/O operations by the log writer (LGWR). Attribute:disk.io.direction(write). Coversv$sysstat:redo writes.oracledb.redo.blocks-- Number of redo blocks moved between the redo log and storage. Attribute:disk.io.direction(write). Coversv$sysstat:redo blocks written.oracledb.redo.buffer_allocation.retries-- Number of times a process waited and retried to allocate space in the redo buffer. No attributes. Coversv$sysstat:redo buffer allocation retries. A rising value indicates redo buffer or log writer contention.oracledb.redo.log_space.requests-- Number of times a process requested space in the redo log buffer and had to wait. No attributes. Coversv$sysstat:redo log space requests.oracledb.redo.timeis emitted as aSumwithaggregation_temporality: cumulative,monotonic: true, value typedouble, units. The other five are emitted asSumwithaggregation_temporality: cumulative,monotonic: true, value typeint. Per-metric units:s(oracledb.redo.time);By(oracledb.redo.size);{operations}(oracledb.redo.operations);{blocks}(oracledb.redo.blocks);{retries}(oracledb.redo.buffer_allocation.retries);{requests}(oracledb.redo.log_space.requests).oracledb.redo.timereports seconds (s) rather than the rawcs(centisecond) units that Oracle exposes viav$sysstat. The scraper divides the raw value by 100 before recording, matching the conversion already in place for the existingoracledb.cpu_timemetric. This keeps the receiver's time-unit story consistent and avoids forcing downstream consumers to convert.The new attribute
oracledb.redo.kinduses a dotted-namespaced receiver-scoped key, as no OTel semantic-convention attribute covers Oracle's redo-pipeline phases.oracledb.redo.operationsandoracledb.redo.blocksreuse the existingdisk.io.directionattribute (currentlywrite), which also leaves room for a future read counterpart without a schema change.The metric set covers the redo statistics present on currently-supported Oracle (12c+); statistics that were removed in 12c (e.g.
redo writer latching time) or that are not exposed byv$sysstatare intentionally excluded.These metrics can be enabled in the collector configuration:
Link to tracking issue
Fixes #49060
Testing
Unit tests added in scraper_test.go:
TestScraper_ScrapeRedoMetrics exercises all 6 new metrics end-to-end through the scraper using the existing fake DB client, asserting one expected value per data point (3 oracledb.redo.time data points -- one per oracledb.redo.kind -- plus 5 standalone data points = 8 data points per scrape across the 6 metrics).
The shared queryResponses[statsSQL] fixture is extended with 8 new fake v$sysstat rows (one per covered NAME), so TestScraper_Scrape, TestScraper_ScrapeOperationalMetrics, and TestScraper_ScrapeIOPerformanceMetrics continue to pass unchanged.
The test explicitly verifies the centiseconds -> seconds conversion on oracledb.redo.time: raw 1500/250/900 cs from redo write time / redo log space wait time / redo synch time produce 15.0/2.5/9.0 s on the write/log_space_wait/synch data points respectively, and asserts disk.io.direction=write on oracledb.redo.operations and oracledb.redo.blocks.
Auto-generated tests in internal/metadata/generated_metrics_test.go and generated_config_test.go are regenerated by make mdatagen and cover the new metric configs / metric builders.
Documentation
Auto-generated documentation.md updated with descriptions and metadata for the 6 new metrics and the new oracledb.redo.kind attribute (oracledb.redo.operations and oracledb.redo.blocks reuse the existing disk.io.direction attribute). internal/metadata/generated_*.go and internal/metadata/testdata/config.yaml regenerated via mdatagen. internal/metadata/config.schema.yaml was manually updated for the 6 new metric stanzas and the new oracledb.redo.kind enum, as that file is not regenerated by mdatagen.
Authorship