source-s3: CSV newlines_in_values option silently dropped in v4 migration

## Summary

The `newlines_in_values` CSV option from source-s3 v3 was silently dropped in the v4 migration. Users who relied on this setting to handle embedded newlines in CSV values now experience sync failures when upgrading to v4, with no migration path or documentation about the change.

## What changed (v3 → v4)

**v3 (≤ 3.x)** used PyArrow to parse CSVs and exposed a user-facing toggle `newlines_in_values` (default: `False`). When set to `True`, PyArrow's C-level parser allowed newline characters inside quoted CSV values.

- Defined in: `airbyte-integrations/connectors/source-s3/source_s3/source_files_abstract/formats/csv_spec.py` (lines 54-59)

**v4 (≥ 4.0.0)** switched to the file-based CDK, which uses Python's `csv.DictReader` instead of PyArrow. The new `CsvFormat` model in the CDK (`airbyte_cdk/sources/file_based/config/csv_format.py`) has no `newlines_in_values` field.

## The legacy config migration drops the field

The `LegacyConfigTransformer` (`airbyte-integrations/connectors/source-s3/source_s3/v4/legacy_config_transformer.py`, lines 70-143) converts v3 configs to v4 but does not map `newlines_in_values` at all. This is even asserted in unit tests — the test input has `newlines_in_values: True`, and the expected output strips it (`unit_tests/v4/test_legacy_config_transformer.py`, lines 115-148).

The v4 migration guide (`docs/integrations/sources/s3-migrations.md`) does not mention this change.

## Practical effect

1. **The user-facing knob is gone.** A v3 user who had `newlines_in_values: True` has no equivalent v4 setting. Behavior is now hard-coded to whatever Python's `csv.DictReader` does.

2. **v4's parser is stricter about malformed rows.** When a multi-line value isn't fully RFC-4180-quoted, or when field counts don't match the header due to a stray unquoted newline, the CDK raises `ERROR_PARSING_RECORD_MISMATCHED_COLUMNS` / `ERROR_PARSING_RECORD_MISMATCHED_ROWS` (see `airbyte_cdk/sources/file_based/file_types/csv_parser.py`, lines 84-113). In v3 with `newlines_in_values: True`, PyArrow was more permissive about these edge cases.

3. **Potential `newline=""` issue.** v4's `stream_reader.open_file` opens files via `smart_open.open(..., mode="r", encoding=...)` without `newline=""`. Per Python's csv docs, this can mis-handle embedded newlines when line endings are translated — so CRLF vs LF vs lone-CR files can behave differently than the PyArrow implementation did.

## Current workaround

- Use `ignore_errors_on_fields_mismatch: true` in the v4 stream's CSV format config to skip offending rows instead of failing the sync.
- Ensure fields with embedded newlines are fully double-quoted per RFC 4180, and confirm the file uses LF or standard CRLF (not lone CR).
- If the file is not RFC 4180-compliant, pre-process before S3 ingest.

## Suggested fix

One or more of:
1. **Re-add a `newlines_in_values` option** (or equivalent) in the file-based CDK's `CsvFormat` that opens the file with `newline=""` to let `csv.DictReader` handle embedded newlines correctly per Python docs.
2. **Update the legacy config migration** (`legacy_config_transformer.py`) to translate `newlines_in_values` into whatever the v4 equivalent is.
3. **Document the behavioral change** in `docs/integrations/sources/s3-migrations.md` and mention the `ignore_errors_on_fields_mismatch` workaround.

## Reported by

Community user via Slack.



---
**Internal Tracking:** https://github.com/airbytehq/oncall/issues/12046


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

source-s3: CSV newlines_in_values option silently dropped in v4 migration #76853

Summary

What changed (v3 → v4)

The legacy config migration drops the field

Practical effect

Current workaround

Suggested fix

Reported by

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

source-s3: CSV newlines_in_values option silently dropped in v4 migration #76853

Description

Summary

What changed (v3 → v4)

The legacy config migration drops the field

Practical effect

Current workaround

Suggested fix

Reported by

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions