Skip to content

Drop deprecated markdown-with-html/markdown-with-images format values #539

@hnc-jglee

Description

@hnc-jglee

Problem

The MCP server's convert_pdf tool (python/opendataloader-pdf-mcp/) still exposes the legacy --format values markdown-with-html and markdown-with-images in its format parameter. These were split out of --format in the core CLI by #508 (PDFDLOSP-6) in favor of explicit flags:

  • HTML-in-Markdown → --markdown-with-html
  • image extraction → --image-output off|embedded|external

The core CLI keeps the old values for one major release with a deprecation warning, for backward compatibility. But MCP is a newer surface (introduced in #351) with no installed client base depending on the old tokens, so there's no reason to carry them — they just confuse agents about the correct interface.

There's also a latent bug: for format=\"markdown\" with no image_output, images default to external and get written into the temp directory the MCP server uses internally, which is then discarded — so the returned Markdown references images that no longer exist.

Proposed change (MCP server only)

  • Remove markdown-with-html / markdown-with-images from the accepted format values. Valid values: json, text, html, markdown.
  • Add an explicit markdown_with_html: bool parameter (maps to --markdown-with-html).
  • For markdown/html output, default image_output to embedded when the caller doesn't specify it, so images survive the temp-dir round-trip. Explicit values still win.
  • Update README and tests.

Out of scope

Core java/, node/, and python/opendataloader-pdf keep the deprecated tokens per the one-major-release compatibility policy established in #508. The node / python legacy code paths that still emit the old tokens only do so from already-deprecated run()-style helpers, so they aren't user-facing; they'll be cleaned up when those helpers are removed.

Acceptance criteria

  • convert_pdf(format=\"markdown-with-html\") raises ValueError
  • convert_pdf(format=\"markdown-with-images\") raises ValueError
  • markdown_with_html=True forwards --markdown-with-html to the JAR
  • format=\"markdown\" defaults to image_output=\"embedded\"; explicit value overrides
  • format=\"json\" does not set a default image_output
  • README reflects the new interface

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions