Summary
The newlines_in_values CSV option from source-s3 v3 was silently dropped in the v4 migration. Users who relied on this setting to handle embedded newlines in CSV values now experience sync failures when upgrading to v4, with no migration path or documentation about the change.
What changed (v3 → v4)
v3 (≤ 3.x) used PyArrow to parse CSVs and exposed a user-facing toggle newlines_in_values (default: False). When set to True, PyArrow's C-level parser allowed newline characters inside quoted CSV values.
- Defined in:
airbyte-integrations/connectors/source-s3/source_s3/source_files_abstract/formats/csv_spec.py (lines 54-59)
v4 (≥ 4.0.0) switched to the file-based CDK, which uses Python's csv.DictReader instead of PyArrow. The new CsvFormat model in the CDK (airbyte_cdk/sources/file_based/config/csv_format.py) has no newlines_in_values field.
The legacy config migration drops the field
The LegacyConfigTransformer (airbyte-integrations/connectors/source-s3/source_s3/v4/legacy_config_transformer.py, lines 70-143) converts v3 configs to v4 but does not map newlines_in_values at all. This is even asserted in unit tests — the test input has newlines_in_values: True, and the expected output strips it (unit_tests/v4/test_legacy_config_transformer.py, lines 115-148).
The v4 migration guide (docs/integrations/sources/s3-migrations.md) does not mention this change.
Practical effect
-
The user-facing knob is gone. A v3 user who had newlines_in_values: True has no equivalent v4 setting. Behavior is now hard-coded to whatever Python's csv.DictReader does.
-
v4's parser is stricter about malformed rows. When a multi-line value isn't fully RFC-4180-quoted, or when field counts don't match the header due to a stray unquoted newline, the CDK raises ERROR_PARSING_RECORD_MISMATCHED_COLUMNS / ERROR_PARSING_RECORD_MISMATCHED_ROWS (see airbyte_cdk/sources/file_based/file_types/csv_parser.py, lines 84-113). In v3 with newlines_in_values: True, PyArrow was more permissive about these edge cases.
-
Potential newline="" issue. v4's stream_reader.open_file opens files via smart_open.open(..., mode="r", encoding=...) without newline="". Per Python's csv docs, this can mis-handle embedded newlines when line endings are translated — so CRLF vs LF vs lone-CR files can behave differently than the PyArrow implementation did.
Current workaround
- Use
ignore_errors_on_fields_mismatch: true in the v4 stream's CSV format config to skip offending rows instead of failing the sync.
- Ensure fields with embedded newlines are fully double-quoted per RFC 4180, and confirm the file uses LF or standard CRLF (not lone CR).
- If the file is not RFC 4180-compliant, pre-process before S3 ingest.
Suggested fix
One or more of:
- Re-add a
newlines_in_values option (or equivalent) in the file-based CDK's CsvFormat that opens the file with newline="" to let csv.DictReader handle embedded newlines correctly per Python docs.
- Update the legacy config migration (
legacy_config_transformer.py) to translate newlines_in_values into whatever the v4 equivalent is.
- Document the behavioral change in
docs/integrations/sources/s3-migrations.md and mention the ignore_errors_on_fields_mismatch workaround.
Reported by
Community user via Slack.
Internal Tracking: https://github.com/airbytehq/oncall/issues/12046
Summary
The
newlines_in_valuesCSV option from source-s3 v3 was silently dropped in the v4 migration. Users who relied on this setting to handle embedded newlines in CSV values now experience sync failures when upgrading to v4, with no migration path or documentation about the change.What changed (v3 → v4)
v3 (≤ 3.x) used PyArrow to parse CSVs and exposed a user-facing toggle
newlines_in_values(default:False). When set toTrue, PyArrow's C-level parser allowed newline characters inside quoted CSV values.airbyte-integrations/connectors/source-s3/source_s3/source_files_abstract/formats/csv_spec.py(lines 54-59)v4 (≥ 4.0.0) switched to the file-based CDK, which uses Python's
csv.DictReaderinstead of PyArrow. The newCsvFormatmodel in the CDK (airbyte_cdk/sources/file_based/config/csv_format.py) has nonewlines_in_valuesfield.The legacy config migration drops the field
The
LegacyConfigTransformer(airbyte-integrations/connectors/source-s3/source_s3/v4/legacy_config_transformer.py, lines 70-143) converts v3 configs to v4 but does not mapnewlines_in_valuesat all. This is even asserted in unit tests — the test input hasnewlines_in_values: True, and the expected output strips it (unit_tests/v4/test_legacy_config_transformer.py, lines 115-148).The v4 migration guide (
docs/integrations/sources/s3-migrations.md) does not mention this change.Practical effect
The user-facing knob is gone. A v3 user who had
newlines_in_values: Truehas no equivalent v4 setting. Behavior is now hard-coded to whatever Python'scsv.DictReaderdoes.v4's parser is stricter about malformed rows. When a multi-line value isn't fully RFC-4180-quoted, or when field counts don't match the header due to a stray unquoted newline, the CDK raises
ERROR_PARSING_RECORD_MISMATCHED_COLUMNS/ERROR_PARSING_RECORD_MISMATCHED_ROWS(seeairbyte_cdk/sources/file_based/file_types/csv_parser.py, lines 84-113). In v3 withnewlines_in_values: True, PyArrow was more permissive about these edge cases.Potential
newline=""issue. v4'sstream_reader.open_fileopens files viasmart_open.open(..., mode="r", encoding=...)withoutnewline="". Per Python's csv docs, this can mis-handle embedded newlines when line endings are translated — so CRLF vs LF vs lone-CR files can behave differently than the PyArrow implementation did.Current workaround
ignore_errors_on_fields_mismatch: truein the v4 stream's CSV format config to skip offending rows instead of failing the sync.Suggested fix
One or more of:
newlines_in_valuesoption (or equivalent) in the file-based CDK'sCsvFormatthat opens the file withnewline=""to letcsv.DictReaderhandle embedded newlines correctly per Python docs.legacy_config_transformer.py) to translatenewlines_in_valuesinto whatever the v4 equivalent is.docs/integrations/sources/s3-migrations.mdand mention theignore_errors_on_fields_mismatchworkaround.Reported by
Community user via Slack.
Internal Tracking: https://github.com/airbytehq/oncall/issues/12046