Skip to content

feat(prepro): basic validation of raw read submissions#6773

Open
maverbiest wants to merge 8 commits into
file-sharing-testfrom
simple-fq-validation
Open

feat(prepro): basic validation of raw read submissions#6773
maverbiest wants to merge 8 commits into
file-sharing-testfrom
simple-fq-validation

Conversation

@maverbiest

@maverbiest maverbiest commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Partially resolves #6758

This PR adds general functionality to validate user-submitted files in the nextclade preprocessing pipeline. How files are processed is determined based on the FileCategory and a simple switch statement.

The PR introduces a function to validate a submission of raw sequencing reads as a first use case:

def validate_raw_reads_submission(
    files: list[FileIdAndName],
) -> tuple[list[ProcessingAnnotation], list[ProcessingAnnotation]]:

The validation is very basic for the moment, and only considers the number of submitted files and their files extensions. This could be built out in the future to include more robust checks.

Alternative approaches

Rather than having a switch statement on FileCategory in prepro to determine how different files are processed, we could define this in the values.yaml, similar to how we do for metadata fields. E.g.:

    submissionDataTypes:
      files:
        enabled: true
        categories:
          - name: raw_reads
            displayName: Raw reads
            preprocessing:
              function: validate_raw_reads_submission
          ...

I decided against this for now to get a working version with a minimal diff, but I would be happy to implement this if people feel it's better.

PR Checklist

  • All necessary documentation has been adapted.
  • The implemented feature is covered by appropriate, automated tests.
  • Any manual testing that has been done is documented (i.e. what exactly was tested?)

🚀 Preview: https://simple-fq-validation.loculus.org

@claude claude Bot added the preprocessing Issues related to the preprocessing component label Jun 26, 2026
@maverbiest maverbiest marked this pull request as ready for review June 26, 2026 16:21
@claude

claude Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Claude finished @maverbiest's task in 2m 46s —— View job


Code Review

  • Gather context
  • Read changed files
  • Review code quality, security, performance, testing, and documentation
  • Post inline comments
  • Summarize findings

Overall this is a clean, well-scoped PR. The architecture choices (typed enum instead of plain string, dispatching via match, integrating into both aligned/unaligned paths) are solid. One actual bug and a few minor issues noted below.

One bug to fix before merge:

file.name.endswith(extension) without a leading dot gives false positives. A file named reads.notfastq passes because "reads.notfastq".endswith("fastq") is True. Fix: use "." + extension in the check (see inline comment on line 1969).

Other observations (inline comments posted):

  • The "too many files" annotation uses a semicolon-joined string as the source field name via from_single. Using from_fields with individual file names listed separately would be more consistent with the per-file annotation style used below it.
  • The error message displays the extension list without leading dots (e.g. ['fastq', ...]), which won't match what users see in filenames — should show .fastq, etc. once the dot fix is applied.
  • FileCategory.ANNOTATIONS is used for preprocessing output, not user input — the case _ branch will silently log a warning if it ever appears in submitted files, which may be confusing. Worth a short comment to document the distinction.
  • No test covers the single-valid-file case (only the two-file case is tested as valid input).

Comment thread preprocessing/nextclade/src/loculus_preprocessing/processing_functions.py Outdated
Comment thread preprocessing/nextclade/tests/test_nextclade_preprocessing.py
rr_errors, rr_warnings = validate_raw_reads_submission(files)
errors.extend(rr_errors)
warnings.extend(rr_warnings)
case _:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FileCategory.ANNOTATIONS is currently in the enum (added for internal use by upload_flatfiles), so if the backend ever sends annotations as an input category, this branch would log a warning. Worth noting for future maintainers that ANNOTATIONS is preprocessing-generated output, not a user-submittable input category — so it should never appear in unprocessed.files.

@maverbiest maverbiest Jun 29, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes currently when we add a new input file category, this match statement will need to be updated to handle it. That is as intended (see discussion of alternative in the PR description)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this should emit a hard exception, as (if my understanding is correct) a file category could be added to the values.yaml, and so will be accepted in the backend, but then if its missing in this match statement, it would pass through without validation?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess a hard exception would prevent other categories being allowed for other users of loculus which might not be a good thing

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file caught some stray updates from running the formatter

@maverbiest maverbiest requested review from anna-parker and tombch June 29, 2026 13:49
@tombch tombch added the preview Triggers a deployment to argocd label Jun 29, 2026

@tombch tombch left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested it out and this looks good to me!

return output_metadata, errors, warnings


def process_submitted_files(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth moving the file related functions (this one and validate raw reads) to a new file to not grow this 2k line monstrosity even further

class AnnotationSourceType(StrEnum):
METADATA = "Metadata"
NUCLEOTIDE_SEQUENCE = "NucleotideSequence"
SUBMITTED_FILE = "SubmittedFile"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this name is a bit confusing as it implies this is a file submitted by the user but annotations are created by the prepro pipeline, I think just "File" is ok

warnings: list[ProcessingAnnotation] = []

if len(files) > 2: # noqa: PLR2004
message = f"Raw reads must be submitted as one or two files, got {len(files)}"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a message for submitters so I would actually state we want to have paired-end or single-end raw reads, and thus accept a max of 2 files.

@anna-parker anna-parker left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some small improvements but also looks good overall!

@anna-parker

Copy link
Copy Markdown
Contributor

Regarding the alternative suggestion, I like it but I would hold off for now - its not required for the beta and we can make the config a bit nicer in a later step, you can create an issue for this though to track it :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

preprocessing Issues related to the preprocessing component preview Triggers a deployment to argocd

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants