Skip to content

[receiver/oracledb] Add redo log metrics#49062

Open
rluidash wants to merge 5 commits into
open-telemetry:mainfrom
rluidash:feat/oracledb-redo-log
Open

[receiver/oracledb] Add redo log metrics#49062
rluidash wants to merge 5 commits into
open-telemetry:mainfrom
rluidash:feat/oracledb-redo-log

Conversation

@rluidash

@rluidash rluidash commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Description

The Oracle DB receiver scrapes v$sysstat with SELECT * FROM v$sysstat, which returns all system-statistics rows. However, the scraper only processes a subset of those rows in its switch block and silently discards the rest -- including every redo-log statistic. This PR surfaces 8 discarded redo rows as 6 new opt-in metrics under the oracledb.redo.* namespace, giving operators visibility into redo write latency, redo throughput, and redo-log sizing pressure -- the foundation of Oracle's durability and recovery story (the equivalent of PostgreSQL's WAL).

Because the data is already being fetched, the implementation is purely additive Go code:

  • No new SQL queries
  • No new database round-trips
  • Zero additional load on the Oracle instance

Related statistics are consolidated under a single metric name with an OTel attribute rather than mapping each flat statistic to its own metric name, following the receiver's existing attributed-metric convention. The redo-pipeline timing statistics share one metric (oracledb.redo.time) differentiated by the oracledb.redo.kind attribute, and the write-direction counters reuse the existing disk.io.direction attribute (introduced for oracledb.physical_io.*).

This PR adds 6 new opt-in metrics (disabled by default, stability: development):

  • oracledb.redo.time -- Cumulative time, in seconds, spent in each phase of the redo pipeline. Attribute: oracledb.redo.kind (write|log_space_wait|synch). Covers v$sysstat: redo write time, redo log space wait time, redo synch time (the raw centisecond value is divided by 100 in the scraper). High write/synch time directly raises commit latency.
  • oracledb.redo.size -- Total amount of redo generated, in bytes. No attributes. Covers v$sysstat: redo size. The canonical redo write-throughput baseline.
  • oracledb.redo.operations -- Number of redo I/O operations by the log writer (LGWR). Attribute: disk.io.direction (write). Covers v$sysstat: redo writes.
  • oracledb.redo.blocks -- Number of redo blocks moved between the redo log and storage. Attribute: disk.io.direction (write). Covers v$sysstat: redo blocks written.
  • oracledb.redo.buffer_allocation.retries -- Number of times a process waited and retried to allocate space in the redo buffer. No attributes. Covers v$sysstat: redo buffer allocation retries. A rising value indicates redo buffer or log writer contention.
  • oracledb.redo.log_space.requests -- Number of times a process requested space in the redo log buffer and had to wait. No attributes. Covers v$sysstat: redo log space requests.

oracledb.redo.time is emitted as a Sum with aggregation_temporality: cumulative, monotonic: true, value type double, unit s. The other five are emitted as Sum with aggregation_temporality: cumulative, monotonic: true, value type int. Per-metric units: s (oracledb.redo.time); By (oracledb.redo.size); {operations} (oracledb.redo.operations); {blocks} (oracledb.redo.blocks); {retries} (oracledb.redo.buffer_allocation.retries); {requests} (oracledb.redo.log_space.requests).

oracledb.redo.time reports seconds (s) rather than the raw cs (centisecond) units that Oracle exposes via v$sysstat. The scraper divides the raw value by 100 before recording, matching the conversion already in place for the existing oracledb.cpu_time metric. This keeps the receiver's time-unit story consistent and avoids forcing downstream consumers to convert.

The new attribute oracledb.redo.kind uses a dotted-namespaced receiver-scoped key, as no OTel semantic-convention attribute covers Oracle's redo-pipeline phases. oracledb.redo.operations and oracledb.redo.blocks reuse the existing disk.io.direction attribute (currently write), which also leaves room for a future read counterpart without a schema change.

The metric set covers the redo statistics present on currently-supported Oracle (12c+); statistics that were removed in 12c (e.g. redo writer latching time) or that are not exposed by v$sysstat are intentionally excluded.

These metrics can be enabled in the collector configuration:

receivers:
  oracledb:
    datasource: "oracle://user:password@host:1521/service"
    metrics:
      oracledb.redo.time:
        enabled: true
      oracledb.redo.size:
        enabled: true
      oracledb.redo.operations:
        enabled: true
      oracledb.redo.blocks:
        enabled: true
      oracledb.redo.buffer_allocation.retries:
        enabled: true
      oracledb.redo.log_space.requests:
        enabled: true

Link to tracking issue

Fixes #49060

Testing

Unit tests added in scraper_test.go:

TestScraper_ScrapeRedoMetrics exercises all 6 new metrics end-to-end through the scraper using the existing fake DB client, asserting one expected value per data point (3 oracledb.redo.time data points -- one per oracledb.redo.kind -- plus 5 standalone data points = 8 data points per scrape across the 6 metrics).
The shared queryResponses[statsSQL] fixture is extended with 8 new fake v$sysstat rows (one per covered NAME), so TestScraper_Scrape, TestScraper_ScrapeOperationalMetrics, and TestScraper_ScrapeIOPerformanceMetrics continue to pass unchanged.
The test explicitly verifies the centiseconds -> seconds conversion on oracledb.redo.time: raw 1500/250/900 cs from redo write time / redo log space wait time / redo synch time produce 15.0/2.5/9.0 s on the write/log_space_wait/synch data points respectively, and asserts disk.io.direction=write on oracledb.redo.operations and oracledb.redo.blocks.
Auto-generated tests in internal/metadata/generated_metrics_test.go and generated_config_test.go are regenerated by make mdatagen and cover the new metric configs / metric builders.

Documentation

Auto-generated documentation.md updated with descriptions and metadata for the 6 new metrics and the new oracledb.redo.kind attribute (oracledb.redo.operations and oracledb.redo.blocks reuse the existing disk.io.direction attribute). internal/metadata/generated_*.go and internal/metadata/testdata/config.yaml regenerated via mdatagen. internal/metadata/config.schema.yaml was manually updated for the 6 new metric stanzas and the new oracledb.redo.kind enum, as that file is not regenerated by mdatagen.

Authorship

  • I, a human, wrote this pull request description myself.

Comment thread receiver/oracledbreceiver/metadata.yaml Outdated
value_type: double
unit: s
attributes: [oracledb.redo.kind]
oracledb.redo.writes:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the below with an attribute to indicate read/write

Suggested change
oracledb.redo.writes:
oracledb.redo.operations:

Or

Suggested change
oracledb.redo.writes:
oracledb.redo.actions:

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to oracledb.redo.operations. Kept it a plain counter — redo writes is write-only in v$sysstat (the only redo-read stat is redo blocks read for recovery, a separate recovery concept), so a read/write attribute would be constant-valued.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like having the attribute as it makes clear that it is describing writes. It also enables the potential of read metric to be added at a later date.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — added disk.io.direction attribute (write)

Comment thread receiver/oracledbreceiver/metadata.yaml Outdated
gauge:
value_type: double
unit: By
oracledb.redo.blocks_written:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't flow like the others, what about

Suggested change
oracledb.redo.blocks_written:
oracledb.redo.blocks:

With an attribute to indicate read/write.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above having the attribute would be advantageous.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — added disk.io.direction (write)

@rluidash rluidash requested a review from thompson-tomo June 16, 2026 00:21
@rluidash

rluidash commented Jun 19, 2026

Copy link
Copy Markdown
Contributor Author

Update after integration testing: I deployed a build against an Oracle 19c
instance and found two of the proposed v$sysstat statistics are not present on
Oracle 12c+, so they would never emit data. I've removed them:

  • Dropped oracledb.redo.log_switch.interruptsredo log switch interrupts
    is not a v$sysstat statistic.
  • Dropped the latching value from oracledb.redo.kind — its source
    redo writer latching time was removed in 12c (replaced by the granular
    redo write * timing stats). oracledb.redo.kind is now write/log_space_wait/synch.

This PR is now for 6 metrics; the remaining 8 source statistics all populate (verified on Oracle 19c).
Updated the PR description

rluidash added 5 commits June 21, 2026 23:58
Removes oracledb.redo.log_switch.interrupts and the latching oracledb.redo.kind
value; their v$sysstat sources are not present on Oracle 12c+ (verified on Oracle 19c).
@rluidash rluidash force-pushed the feat/oracledb-redo-log branch from 0ffb34f to be4bac1 Compare June 22, 2026 07:29
@rluidash rluidash marked this pull request as ready for review June 22, 2026 08:20
@rluidash rluidash requested review from a team, atoulme, crobert-1 and dmitryax as code owners June 22, 2026 08:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add redo log metrics: write/sync timing, throughput, log space, and buffer allocation

3 participants