Problem
The MCP server's convert_pdf tool (python/opendataloader-pdf-mcp/) still exposes the legacy --format values markdown-with-html and markdown-with-images in its format parameter. These were split out of --format in the core CLI by #508 (PDFDLOSP-6) in favor of explicit flags:
- HTML-in-Markdown →
--markdown-with-html
- image extraction →
--image-output off|embedded|external
The core CLI keeps the old values for one major release with a deprecation warning, for backward compatibility. But MCP is a newer surface (introduced in #351) with no installed client base depending on the old tokens, so there's no reason to carry them — they just confuse agents about the correct interface.
There's also a latent bug: for format=\"markdown\" with no image_output, images default to external and get written into the temp directory the MCP server uses internally, which is then discarded — so the returned Markdown references images that no longer exist.
Proposed change (MCP server only)
- Remove
markdown-with-html / markdown-with-images from the accepted format values. Valid values: json, text, html, markdown.
- Add an explicit
markdown_with_html: bool parameter (maps to --markdown-with-html).
- For
markdown/html output, default image_output to embedded when the caller doesn't specify it, so images survive the temp-dir round-trip. Explicit values still win.
- Update README and tests.
Out of scope
Core java/, node/, and python/opendataloader-pdf keep the deprecated tokens per the one-major-release compatibility policy established in #508. The node / python legacy code paths that still emit the old tokens only do so from already-deprecated run()-style helpers, so they aren't user-facing; they'll be cleaned up when those helpers are removed.
Acceptance criteria
References
Problem
The MCP server's
convert_pdftool (python/opendataloader-pdf-mcp/) still exposes the legacy--formatvaluesmarkdown-with-htmlandmarkdown-with-imagesin itsformatparameter. These were split out of--formatin the core CLI by #508 (PDFDLOSP-6) in favor of explicit flags:--markdown-with-html--image-output off|embedded|externalThe core CLI keeps the old values for one major release with a deprecation warning, for backward compatibility. But MCP is a newer surface (introduced in #351) with no installed client base depending on the old tokens, so there's no reason to carry them — they just confuse agents about the correct interface.
There's also a latent bug: for
format=\"markdown\"with noimage_output, images default toexternaland get written into the temp directory the MCP server uses internally, which is then discarded — so the returned Markdown references images that no longer exist.Proposed change (MCP server only)
markdown-with-html/markdown-with-imagesfrom the acceptedformatvalues. Valid values:json,text,html,markdown.markdown_with_html: boolparameter (maps to--markdown-with-html).markdown/htmloutput, defaultimage_outputtoembeddedwhen the caller doesn't specify it, so images survive the temp-dir round-trip. Explicit values still win.Out of scope
Core
java/,node/, andpython/opendataloader-pdfkeep the deprecated tokens per the one-major-release compatibility policy established in #508. Thenode/pythonlegacy code paths that still emit the old tokens only do so from already-deprecatedrun()-style helpers, so they aren't user-facing; they'll be cleaned up when those helpers are removed.Acceptance criteria
convert_pdf(format=\"markdown-with-html\")raisesValueErrorconvert_pdf(format=\"markdown-with-images\")raisesValueErrormarkdown_with_html=Trueforwards--markdown-with-htmlto the JARformat=\"markdown\"defaults toimage_output=\"embedded\"; explicit value overridesformat=\"json\"does not set a defaultimage_outputReferences
fix(cli): separate markdown modifiers from --format values (PDFDLOSP-6)(core split; MCP not touched)feat: add MCP server for AI agent integration(original MCP surface)