feat(prepro): basic validation of raw read submissions#6773
Conversation
|
Claude finished @maverbiest's task in 2m 46s —— View job Code Review
Overall this is a clean, well-scoped PR. The architecture choices (typed enum instead of plain string, dispatching via One bug to fix before merge:
Other observations (inline comments posted):
|
| rr_errors, rr_warnings = validate_raw_reads_submission(files) | ||
| errors.extend(rr_errors) | ||
| warnings.extend(rr_warnings) | ||
| case _: |
There was a problem hiding this comment.
FileCategory.ANNOTATIONS is currently in the enum (added for internal use by upload_flatfiles), so if the backend ever sends annotations as an input category, this branch would log a warning. Worth noting for future maintainers that ANNOTATIONS is preprocessing-generated output, not a user-submittable input category — so it should never appear in unprocessed.files.
There was a problem hiding this comment.
Yes currently when we add a new input file category, this match statement will need to be updated to handle it. That is as intended (see discussion of alternative in the PR description)
There was a problem hiding this comment.
I'm wondering if this should emit a hard exception, as (if my understanding is correct) a file category could be added to the values.yaml, and so will be accepted in the backend, but then if its missing in this match statement, it would pass through without validation?
There was a problem hiding this comment.
I guess a hard exception would prevent other categories being allowed for other users of loculus which might not be a good thing
There was a problem hiding this comment.
This file caught some stray updates from running the formatter
tombch
left a comment
There was a problem hiding this comment.
Tested it out and this looks good to me!
| return output_metadata, errors, warnings | ||
|
|
||
|
|
||
| def process_submitted_files( |
There was a problem hiding this comment.
Maybe worth moving the file related functions (this one and validate raw reads) to a new file to not grow this 2k line monstrosity even further
| class AnnotationSourceType(StrEnum): | ||
| METADATA = "Metadata" | ||
| NUCLEOTIDE_SEQUENCE = "NucleotideSequence" | ||
| SUBMITTED_FILE = "SubmittedFile" |
There was a problem hiding this comment.
I think this name is a bit confusing as it implies this is a file submitted by the user but annotations are created by the prepro pipeline, I think just "File" is ok
| warnings: list[ProcessingAnnotation] = [] | ||
|
|
||
| if len(files) > 2: # noqa: PLR2004 | ||
| message = f"Raw reads must be submitted as one or two files, got {len(files)}" |
There was a problem hiding this comment.
This is a message for submitters so I would actually state we want to have paired-end or single-end raw reads, and thus accept a max of 2 files.
anna-parker
left a comment
There was a problem hiding this comment.
some small improvements but also looks good overall!
|
Regarding the alternative suggestion, I like it but I would hold off for now - its not required for the beta and we can make the config a bit nicer in a later step, you can create an issue for this though to track it :-) |
Partially resolves #6758
This PR adds general functionality to validate user-submitted files in the nextclade preprocessing pipeline. How files are processed is determined based on the
FileCategoryand a simple switch statement.The PR introduces a function to validate a submission of raw sequencing reads as a first use case:
The validation is very basic for the moment, and only considers the number of submitted files and their files extensions. This could be built out in the future to include more robust checks.
Alternative approaches
Rather than having a switch statement on
FileCategoryin prepro to determine how different files are processed, we could define this in thevalues.yaml, similar to how we do for metadata fields. E.g.:I decided against this for now to get a working version with a minimal diff, but I would be happy to implement this if people feel it's better.
PR Checklist
🚀 Preview: https://simple-fq-validation.loculus.org