Genomic sem runs#720
Draft
trafalmadorian97 wants to merge 59 commits into
Draft
Conversation
Runs the basic GenomicSEM workflow (munge → multivariable LDSC → usermodel) via rpy2, taking a lavaan-syntax factor model plus tasks producing reference LD scores, HapMap3 SNPs, and gwaslab-format summary statistics. Returns a DirectoryAsset containing munged sumstats, LDSC outputs, and factor model results. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Extends the existing GenomicSEMTask with two new tasks that add SNP effects to GenomicSEM models and produce per-factor summary statistics: - GenomicSEMCommonFactorGWASTask wraps commonfactorGWAS, writing one parquet for the common factor. - GenomicSEMUserGWASTask wraps userGWAS with a user-supplied lavaan model and required sub_components selector, writing one parquet per requested component. Both run munge -> ldsc -> sumstats -> GWAS, reusing the helpers from genomic_sem_task.py. Per-trait OLS / logistic / linear-prob estimation must be declared explicitly via GWASMethod on each source. The sumstats reference panel is treated as an opaque Task[FileAsset] dependency. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replaces lavaan's per-SNP fit with a vectorized Gauss-Newton DWLS optimization plus closed-form sandwich SE. Removes ~20 ms of lavaan dispatch overhead per SNP, which on a 7M-SNP genome-wide run is the difference between days of compute and minutes. v1 limitations (asserted at runtime): k = 2 traits, GC = "standard", DWLS only, no Q heterogeneity, no nearPD smoothing of non-PD V_Full / S_Fullrun (such SNPs are flagged warning=1). Validation: a headline test runs R commonfactorGWAS via rpy2 on 30 synthetic SNPs and asserts our kernel matches lavaan to ~1e-7 absolute / ~1e-5 relative on `est`, and ~1e-5 / ~1e-3 on `se_c`. Empirical agreement on the same fixture: ~7e-10 absolute on `est` and ~1e-7 on `se_c` -- 7+ significant figures. GenomicSEMCommonFactorGWASPythonTask sits alongside the existing R task and shares all setup helpers (munge, ldsc, sumstats); only the per-SNP loop differs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add explicit linear equations (F1 = beta*SNP + zeta, trait_j = lambda_j*F1 + epsilon_j) alongside the lavaan notation so readers unfamiliar with SEM software can follow the model structure. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
For a pair (composite, reference) of GWAS sumstats, decomposes per-SNP effects into a reference-aligned factor C and an orthogonal residual factor NC (the "subtraction"). Canonical application: educational attainment (composite) vs cognitive performance (reference), yielding a "non-cognitive" GWAS (Demange et al. 2021, Nat Genet). The model is exactly identified at k=2 (5 informative moments, 5 free parameters), so the per-SNP fit is closed-form -- no Newton iteration needed. SEs come from the delta method applied to the parameter-solving map. The entire pipeline is a single vectorised matrix multiply. Validation: a headline test runs GenomicSEM::userGWAS with the Cholesky model on 30 synthetic SNPs and asserts agreement with our kernel. Both betas and sandwich SEs match within ~0.1-0.8% (the discrepancy comes from userGWAS(fix_measurement=TRUE) obtaining loadings via lavaan's DWLS optimizer vs our closed-form -- both correct at exact identification). Also includes an effective-sample-size column (N_eff = 1/(varSNP*se^2)) for downstream tools like MAGMA. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace finite-difference d(beta_C, beta_NC)/d(s) with hand-derived analytic partials. The closed-form subtraction estimates are elementary functions of the moment vector, so the Jacobian is ~10 one-line array assignments — no runtime sympy dependency needed. Motivation: - Robustness: the FD path evaluated _solve_betas_from_s, whose max(s4 - a_C^2, 1e-30) clamp could be straddled by a centred step when a_NC is small, silently corrupting the derivative. The analytic form never touches the clamp. - Removes the scale-aware step-size heuristic and its roundoff loss. - Brings the kernel in line with the common-factor kernel, which already uses an analytic model_jacobian. The old FD routine is kept as _jacobian_betas_wrt_s_fd (test oracle only). Two new tests cross-check analytic vs FD at an interior point and at a deliberately small-a_NC point where FD would degrade. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…on-PD Reader-facing cleanups to the subtraction kernel and task wrapper: - Rename factors C/NC -> F/R. F = the genetic factor common to both traits (defined by the reference trait); R = the remainder unique to the composite trait. Use-case-neutral, replacing the cognitive/non-cognitive naming from the original educational-attainment application. - Replace bare tuple returns with @Frozen attrs objects: SubtractionLoadings (b, a_F, a_R), FactorBetas (beta_F, beta_R), and _CovStruct / _SubtractionFrames in the task wrapper. Call sites are now self-documenting. - Add shape asserts at the top of the kernel helpers (_solve_loadings, _solve_betas, _solve_betas_from_s, compute_v_snp_batch, the Jacobians). - Add a key-matrices table (shapes + meaning) to the module docstring. - Fail loudly (ValueError) when V_s is non-PD instead of emitting an ignorable per-SNP warning flag; PD-ness is set by the shared LDSC inputs, so it is a malformed-input condition, not a per-SNP one. Drop the `warning` result field. - `fail` is now a plain bool array (was int8). Compute SE/Z/p/N_eff over the good mask without redundant per-element `> 0` guards. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The subtraction moment vector s = (s2, s3, s4, s5, s6) does not include the SNP-SNP variance entry, so varSNPSE2 was dead in fit_gwas_by_subtraction. Drop the parameter, the _resolve_snp_se helper, and the wrapper pass-through. (The common-factor kernel still uses varSNPSE2 — its augmented vech does include the SNP-SNP moment — and is unchanged.) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
`_genomic_sem_ldsc.run_ldsc` is a faithful port of GenomicSEM::ldsc, producing the same covstruc (S, V, I, N, m, S_Stand, V_Stand) for k>=2 traits. This is the make-or-break piece for a fully-Python GWAS-by-subtraction pipeline (the numpy kernel is already done). Validation: - test_genomic_sem_ldsc.py: numpy unit tests for the block-jackknife regression, block boundaries, and liability conversion, plus an R comparison (gsem.ldsc on synthetic LD scores + munged sumstats) covering S/V/I/S_Stand/V_Stand. - experiments/.../validate_python_ldsc.py: dev reproducer that runs both R ldsc and run_ldsc on the real ME/CFS + multisite-pain GWASes. Agreement is machine precision (S rel err ~6e-16, V ~2e-13, I ~4e-15; r_g = 0.447). Reproduced subtleties: covariance-branch regression weights use weights_cov (R relies on a $weights -> $weights_cov partial match); V.hold stores raw-slope pseudo-values; v_out = cov(V.hold)/(N_a N_b n_blocks/m^2); vech order is row-major upper-triangle = (S00, S01, S11), matching the kernel's V_LD convention. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…raction) `_genomic_sem_munge.munge_sumstats` and `_genomic_sem_sumstats.run_sumstats` are polars ports of GenomicSEM::munge / ::sumstats, operating on in-memory DataFrames so the pipeline avoids R's multi-minute text reads. munge: QC vs HapMap3, allele-align, Z = sign(effect)*Phi^-1(1-P/2). sumstats: align to the 1000G reference, rescale per method (OLS / se.logit / linprob / default OR), listwise-merge to the common SNP set, returning SNP, CHR, BP, MAF, A1, A2 + beta.<trait>/se.<trait>. Validation (synthetic R-comparison unit tests + real ME/CFS + pain data): - munge: identical SNP sets, Z to ~1e-13, ~108x faster (325s -> 3s). - sumstats: identical SNP sets, beta/se to ~1e-13, ~20x faster (422s -> ~11s). Notable fix: R's na.omit drops NaN but polars drop_nulls does not. An OLS SNP with P==1 gives Z==0 -> standardized effect 0 -> se=|0/0|=NaN, which R discards; the output now filters is_not_nan too (regression-tested with a P==1 case). Also: varSNP uses the file MAF when present (R partial-matches MAF.y), else the reference MAF; the reported MAF column is the reference MAF. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
GenomicSEMGWASBySubtractionFullPythonTask chains the validated Python ports (munge_sumstats -> run_ldsc -> run_sumstats -> fit_gwas_by_subtraction) so the whole pipeline runs with no R at execution time. Reuses the R-backed task's _make_result_df/_SubtractionFrames and output filenames, so the F/R parquet outputs are schema-identical and interchangeable. Also drop a now-stale ty:ignore[not-subscriptable] in genomic_sem_task.py that a stub update made unused. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Refactor toward merging the Python tasks separately from the R tasks: - No _task module imports from a peer _task module any more. - genomic_sem_gwas_by_subtraction_full_python_task no longer transitively imports rpy2. Extract shared code into new modules: - _genomic_sem_config.py (rpy2-free): constants + source/config dataclasses - _genomic_sem_inputs.py (rpy2-free): munge-input prep, path resolution, source validation, gwas-method flag mapping - _subtraction_result.py (rpy2-free): F/R per-factor result-frame builders - _genomic_sem_r_bridge.py (rpy2): the single home for shared R calls Each *_task.py now holds only its Task class and imports shared symbols from those modules. Importers (asset, tests, experiment scripts) point at the new canonical locations. Behaviour is unchanged; green + all 64 genomic_sem tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…periments genomic_sem follow-ups from code review: - GenomicSEMGWASBySubtractionFullPythonTask now takes composite_trait_source + reference_trait_source (named) instead of an order-sensitive sources list. - Shared/imported helpers in the util modules drop their underscore prefix (private-by-convention names should not cross files); the R-call wrappers are named run_r_* to avoid colliding with the Python ports run_ldsc/run_sumstats. - resolve_*_path / ld_dir_with_genomic_sem_naming return/take Path; bridge wrappers accept Path and str() only at the R-call boundary. - Read tabular inputs via scan_dataframe_asset (added read_dataframe_from_task and build_munge_input_df; added read_specs to the hapmap3 / 1000G reference tasks) instead of pl.read_csv with a hardcoded separator. The full-Python task no longer round-trips the munge input through a temp TSV. - Move my validation scripts to experiments/claude/runs/. green passes (lint, link check, import contracts, ty, tests). full-Python task import stays rpy2-free; no _task module imports a peer _task module. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replace hardcoded column-name string literals ("P", "effect", "MAF", "INFO",
"N", "SNP", "A1", "A2", "Z") with the MUNGE_* constants from
_genomic_sem_config. Added MUNGE_INFO_COL and MUNGE_Z_COL so the module is fully
literal-free; the internal reference-allele join names are derived from the
constants.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replace hardcoded canonical column-name literals (SNP, A1, A2, effect, SE, P, N, MAF, INFO, Z) with the MUNGE_* constants from _genomic_sem_config, and factor the internal alignment column names (A1_ref/A2_ref/A1_file/A2_file, maf_ref/ maf_file, varSNP) and the beta/se output columns into local module constants. Behaviour unchanged; output columns remain beta.<trait>/se.<trait>. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…sult Remove the redundant stored fields z_F, p_F, z_R, p_R from GWASBySubtractionResult; they are pure functions of the betas and sandwich SEs. Expose them as functools.cached_property (z = beta/se, p = 2*norm.sf(|z|)), memoized on first access. The class becomes @Frozen(slots=False) so cached_property has an instance __dict__ to write into. Drops the now-redundant local z/p computation and constructor args in fit_gwas_by_subtraction. Call sites are unchanged (attribute access is identical). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
build_munge_input_df renamed GWASLAB_EFFECT_ALLELE_FREQ_COL to MUNGE_MAF_COL verbatim, but the effect-allele frequency need not be the minor-allele frequency, so the MAF column's name did not match its contents. Fold the column to min(p, 1-p) where it is created. There is no functional change: downstream munge/sumstats already fold defensively (idempotently) and only use the minor frequency (MAF filter, varSNP = 2p(1-p)); nothing consumes the directional EAF. This just makes the column name honest at the source. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
get_sample_size returns NaN when PhenotypeInfo carries no sample size (the R-idiomatic "not provided", which the R bridge passes straight to gsem.munge as an NA-valued FloatVector). GenomicSEM guards its N override with !is.na(N), so an NA N means "keep the file's N column". The Python ports mistranslated this as `if n is not None`, and since `nan is not None` is True, a source that supplies N as a per-SNP column (with no sample size in PhenotypeInfo) had its N clobbered with NaN -- corrupting LDSC (munge) and OLS/linprob standardization (sumstats). Guard both overrides with `is not None and not isnan(...)` to mirror R's is.na. Real scalar N still overrides as before, so validated behavior is unchanged. Add rpy2-free regression tests pinning both branches. Also remove a dead sample_sizes list in the full-Python task and document the N flow (build_munge_input_df resolves the N column; the PhenotypeInfo scalar overrides it only when present), addressing the "done twice" readability concern. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The "drop rows with missing P or effect" step filtered with is_not_null() only. R's is.na() is true for NA *and* NaN, but polars is_not_null keeps NaN, so a float-NaN P or effect survived where R would drop it -- and a surviving NaN also poisons the odds-ratio median() (R uses na.rm=TRUE) a few lines down, which can flip the log-transform decision for the whole trait. Add is_not_nan() to both filters to mirror R. No effect on clean inputs, so validated machine-precision agreement is preserved. Add a regression test asserting NaN-P and NaN-effect rows are dropped. Audited the other NA-adjacent sites in the ports; the rest already mirror R: ldsc's prevalence guard checks None-or-NaN, ldsc's munged read uses pandas dropna (== na.omit), and sumstats' final output filter already excludes both null and NaN. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
GenomicSEM's munge/sumstats guess that an effect column is an odds ratio when round(median(effect)) == 1 and silently log-transform it. That heuristic exists because R's munge ingests arbitrary user files with no schema. In this pipeline the schema is known: gwaslab keeps odds ratios in a separate OR column and build_munge_input_df wires only BETA into `effect`, so the effect column is always a log-scale beta. A median near 1 therefore indicates a mis-specified input, not an OR to convert. Replace the silent log() in both ports with a loud ValueError that tells the caller to supply odds ratios via the OR column / convert upstream, rather than re-interpreting their data. Convert the former R-comparison odds-ratio test into a non-gated test asserting the raise. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
_genomic_sem_ldsc.py referred to columns by string literal. Replace the munged columns (SNP/N/Z/A1) with the existing MUNGE_* constants, and add LDSC_CHR_COL/LDSC_BP_COL/LDSC_L2_COL to _genomic_sem_config for the LD-score reference file columns. Derived names that are local to this module (the weight LD score wLD and the .x/.y cross-trait merge suffixes) become module-level constants built from the canonical names, mirroring the _VARSNP_COL / _A1_REF_COL pattern in the munge/sumstats ports. No behavior change; the .l2.ldscore.gz / .l2.M_5_50 filename patterns are left as-is since they are not column names. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…exclude_ambig The full-Python task reused the R-facing configs (GenomicSEMConfig, GenomicSEMSumstatsConfig, GenomicSEMGWASRunConfig), most of whose fields it silently ignored -- run_config entirely, plus cores/parallel/keep_indel/ estimation/std_lv/etc. Replace those three with a single GWASBySubtractionFullPythonConfig whose every field is actually consumed (ld_file_filename_pattern, munge/sumstats info+maf filters, exclude_ambig). munge and sumstats thresholds stay separate because GenomicSEM defaults them differently (INFO 0.9 vs 0.6). Also rename the ambiguous-SNP flag ambig -> exclude_ambig in the sumstats port (run_sumstats / _filter_reference) and in GenomicSEMSumstatsConfig, so True clearly means "exclude". Same polarity as before and as R's ambig=TRUE; the R bridge still passes it to gsem.sumstats(ambig=...). Update the validate experiment and add config-defaults tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… bools SumstatsTrait mirrored GenomicSEM's R sumstats() signature with three mutually-exclusive booleans (ols/se_logit/linprob), duplicating the GWASMethod enum already carried by GenomicSEMGWASSumstatsSource. Replace them with a single `gwas_method: GWASMethod`, removing the invalid-state surface (multiple-True / all-False) and the logistic<->se_logit naming split. _standardize_trait now branches on `gwas_method ==` the OLS/LINEAR_PROB/ LOGISTIC constants. GenomicSEM's no-flag "default" OR-with-OR-SE transform is dropped (this pipeline never feeds raw odds ratios -- the odds-ratio guard already rejects them); an unrecognised method raises instead. sumstats_trait collapses to a straight pass-through of the source's method. Update the sumstats/full-python tests and the validate experiment; the R bridge and R-comparison calls still pass gsem.sumstats its native se_logit/OLS/linprob BoolVectors. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
P is already filtered to finite, non-null values upstream, so ~isfinite(p) never selected a row; and even a stray NaN would yield NaN z in either branch (norm.isf(nan) is also nan), so the term changed nothing. Reduce the tiny-P mask to the load-bearing `p < _TINY_P` underflow case and note in a comment why no NaN/inf guard is needed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
make_result_df referred to its output columns by string literal. Reuse the
existing MUNGE_*/LDSC_* constants for SNP/CHR/BP/MAF/A1/A2, and add
SUBTRACTION_*_COL constants to _genomic_sem_config for the subtraction-
specific estimate columns (lhs/op/rhs/est/se_c/Z_Estimate/Pval_Estimate/
N_eff/fail). The model-term values placed in lhs/op/rhs ("~", "SNP") stay as
local constants since they are data values, not column references. No schema
change -- the emitted column names are identical.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Both subtraction tasks read frames.*_df["fail"] by literal when logging the fail count. Switch to the SUBTRACTION_FAIL_COL constant now that the result schema is defined with constants. The common-factor python task is left as-is -- it emits a different (richer commonfactorGWAS) schema, so the SUBTRACTION_* names would not fit there. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Subtract Johnston et al. multisite chronic pain from DecodeME GWAS-1 using GenomicSEMGWASBySubtractionFullPythonTask. DecodeME is the composite trait (T1, logistic case/control), Johnston pain the reference trait (T2, OLS on the 0-7 BOLT-LMM count), so the remainder factor R is the ME/CFS-specific signal after the pain-shared common factor F is removed. Reuses the same HapMap3 / 1000G / EUR LD references and DecodeME rsid+P pipe as the existing common-factor asset, with the "LDscore." LD-prefix stripping. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.