Skip to content

source-s3: CSV newlines_in_values option silently dropped in v4 migration #76853

@devin-ai-integration

Description

@devin-ai-integration

Summary

The newlines_in_values CSV option from source-s3 v3 was silently dropped in the v4 migration. Users who relied on this setting to handle embedded newlines in CSV values now experience sync failures when upgrading to v4, with no migration path or documentation about the change.

What changed (v3 → v4)

v3 (≤ 3.x) used PyArrow to parse CSVs and exposed a user-facing toggle newlines_in_values (default: False). When set to True, PyArrow's C-level parser allowed newline characters inside quoted CSV values.

  • Defined in: airbyte-integrations/connectors/source-s3/source_s3/source_files_abstract/formats/csv_spec.py (lines 54-59)

v4 (≥ 4.0.0) switched to the file-based CDK, which uses Python's csv.DictReader instead of PyArrow. The new CsvFormat model in the CDK (airbyte_cdk/sources/file_based/config/csv_format.py) has no newlines_in_values field.

The legacy config migration drops the field

The LegacyConfigTransformer (airbyte-integrations/connectors/source-s3/source_s3/v4/legacy_config_transformer.py, lines 70-143) converts v3 configs to v4 but does not map newlines_in_values at all. This is even asserted in unit tests — the test input has newlines_in_values: True, and the expected output strips it (unit_tests/v4/test_legacy_config_transformer.py, lines 115-148).

The v4 migration guide (docs/integrations/sources/s3-migrations.md) does not mention this change.

Practical effect

  1. The user-facing knob is gone. A v3 user who had newlines_in_values: True has no equivalent v4 setting. Behavior is now hard-coded to whatever Python's csv.DictReader does.

  2. v4's parser is stricter about malformed rows. When a multi-line value isn't fully RFC-4180-quoted, or when field counts don't match the header due to a stray unquoted newline, the CDK raises ERROR_PARSING_RECORD_MISMATCHED_COLUMNS / ERROR_PARSING_RECORD_MISMATCHED_ROWS (see airbyte_cdk/sources/file_based/file_types/csv_parser.py, lines 84-113). In v3 with newlines_in_values: True, PyArrow was more permissive about these edge cases.

  3. Potential newline="" issue. v4's stream_reader.open_file opens files via smart_open.open(..., mode="r", encoding=...) without newline="". Per Python's csv docs, this can mis-handle embedded newlines when line endings are translated — so CRLF vs LF vs lone-CR files can behave differently than the PyArrow implementation did.

Current workaround

  • Use ignore_errors_on_fields_mismatch: true in the v4 stream's CSV format config to skip offending rows instead of failing the sync.
  • Ensure fields with embedded newlines are fully double-quoted per RFC 4180, and confirm the file uses LF or standard CRLF (not lone CR).
  • If the file is not RFC 4180-compliant, pre-process before S3 ingest.

Suggested fix

One or more of:

  1. Re-add a newlines_in_values option (or equivalent) in the file-based CDK's CsvFormat that opens the file with newline="" to let csv.DictReader handle embedded newlines correctly per Python docs.
  2. Update the legacy config migration (legacy_config_transformer.py) to translate newlines_in_values into whatever the v4 equivalent is.
  3. Document the behavioral change in docs/integrations/sources/s3-migrations.md and mention the ignore_errors_on_fields_mismatch workaround.

Reported by

Community user via Slack.


Internal Tracking: https://github.com/airbytehq/oncall/issues/12046

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions