Skip to content

Commit 3773c98

Browse files
committed
notes on what linkml.py and SchemaView() do vs what schema_induction.js does
1 parent d787e25 commit 3773c98

1 file changed

Lines changed: 365 additions & 0 deletions

File tree

docs/README_schema_yaml_loading.md

Lines changed: 365 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,365 @@
1+
# Loading schema.yaml Directly in the Browser
2+
3+
This document analyses the feasibility of replacing DataHarmonizer's pre-built `schema.json` with a runtime load-and-process pipeline that reads `schema.yaml` directly in the browser, and proposes the implementation steps required to do so.
4+
5+
---
6+
7+
## Background: The Current Pipeline
8+
9+
DataHarmonizer currently requires a build step before a schema can be used in the browser.
10+
11+
```
12+
schema.yaml ──► script/linkml.py ──► schema.json ──► webpack bundle / HTTP serve
13+
```
14+
15+
1. `script/linkml.py` opens `schema.yaml`, passes it through the Python `linkml-runtime` library, and writes `schema.json`.
16+
2. Webpack's dynamic-import picks up `schema.json` from `web/templates/<folder>/schema.json` (localhost path) or the file is fetched via HTTP from the `dist/` directory (hosted path).
17+
3. `lib/utils/templates.js → TemplateProxy.create()` calls `fetchSchema()`, which returns the parsed JSON object.
18+
4. The schema object is stored in `template.default.schema` and accessed throughout `AppContext.js`, `DataHarmonizer.js`, and `Validator.js`.
19+
20+
The `schema.yaml` files are **already copied to dist** by `webpack.schemas.js` via CopyPlugin, so they are already available at `/templates/<folder>/schema.yaml` at runtime. The missing piece is processing.
21+
22+
---
23+
24+
## What `linkml.py` Does
25+
26+
`script/linkml.py` performs three key transformations, in order:
27+
28+
### 1. YAML parsing
29+
30+
```python
31+
SCHEMA = yaml.safe_load(schema_handle)
32+
schema_view = SchemaView(yaml.dump(SCHEMA))
33+
```
34+
35+
Parses the YAML text and constructs a `SchemaView` object (the LinkML runtime's schema-aware wrapper).
36+
37+
### 2. Import resolution
38+
39+
```python
40+
schema_view.merge_imports()
41+
```
42+
43+
Follows every entry in the schema's `imports:` list, fetches and merges those schemas recursively. Common imports in DataHarmonizer templates:
44+
45+
- `linkml:types` — defines the built-in scalar types (`String`, `Integer`, `Date`, `Boolean`, etc.)
46+
- Shared parent schemas (e.g. `grdi_core` for the GRDI template family)
47+
48+
After `merge_imports()`, `schema_view.schema` contains all slot, type, class, and prefix definitions from every imported schema as if they were written inline.
49+
50+
### 3. Inheritance flattening (induced classes)
51+
52+
```python
53+
for name, class_obj in schema_view.all_classes().items():
54+
if schema_view.class_slots(name):
55+
new_obj = schema_view.induced_class(name)
56+
schema_view.add_class(new_obj)
57+
```
58+
59+
`induced_class(name)` is the critical transformation. For each class it:
60+
61+
- Walks the full `is_a` ancestry chain (parent, grandparent, …)
62+
- Collects every slot defined at each level in the hierarchy via the class's `slots:` list
63+
- Applies each level's `slot_usage:` overrides on top of the global `slots:` definition
64+
- Emits the result as a flat `attributes: { slot_name: { merged_definition } }` dictionary
65+
66+
**Before induction (schema.yaml structure)**:
67+
```yaml
68+
classes:
69+
MpoxInternational:
70+
is_a: dh_interface
71+
slots:
72+
- specimen_collector_sample_id
73+
- sample_collected_by
74+
slot_usage:
75+
specimen_collector_sample_id:
76+
required: true
77+
rank: 1
78+
```
79+
80+
**After induction (schema.json structure)**:
81+
```json
82+
"classes": {
83+
"MpoxInternational": {
84+
"attributes": {
85+
"specimen_collector_sample_id": {
86+
"name": "specimen_collector_sample_id",
87+
"title": "Specimen Collector Sample ID",
88+
"range": "WhitespaceMinimizedString",
89+
"required": true,
90+
"rank": 1,
91+
"description": "...",
92+
...
93+
},
94+
"sample_collected_by": { ... }
95+
}
96+
}
97+
}
98+
```
99+
100+
The `attributes` dict is what every consumer inside DataHarmonizer iterates:
101+
102+
| Consumer | Access pattern |
103+
|----------|---------------|
104+
| `AppContext.js` | `schema.classes[template_name].attributes` |
105+
| `AppContext.js` | `schema.classes['Container'].attributes` (to discover all template classes) |
106+
| `Validator.js` | `this.#targetClass.attributes` (iterated to build induced slot map) |
107+
| `DataHarmonizer.js` | `this.schema.classes[this.template_name]` (columns) |
108+
109+
Additionally:
110+
111+
- `schema.slots` — global slot registry, used as base layer in `Validator.js`'s manual re-merge
112+
- `schema.enums[name].permissible_values` — picklist validation and dropdowns
113+
- `schema.types[name].uri` — XSD type URIs for `Datatypes` parser
114+
- `schema.prefixes[name].prefix_reference` — CURIE expansion in `DataHarmonizer.js`
115+
- `schema.extensions.locales.value` — DH-specific locale/translation extension data
116+
117+
---
118+
119+
## What Already Exists on the JavaScript Side
120+
121+
### `yaml` npm package
122+
123+
The project already depends on `yaml` v2.8.0 (listed in `package.json`). `DataHarmonizer.js` already imports it:
124+
125+
```javascript
126+
import YAML from 'yaml';
127+
```
128+
129+
Parsing a schema.yaml text string in the browser is therefore a one-liner:
130+
131+
```javascript
132+
const schema = YAML.parse(text);
133+
```
134+
135+
### `buildTemplateFromUploadedSchema()` in `templates.js`
136+
137+
This function (used when the Schema Editor uploads a file) already accepts a raw schema object (no induction required for the upload path, because the Schema Editor works with the raw YAML structure). For normal data templates, however, the calling code paths depend on `attributes` being fully induced.
138+
139+
### `Validator.js` partial re-merge
140+
141+
`Validator.useTargetClass()` already performs a three-way merge at runtime:
142+
143+
```javascript
144+
this.#targetClassInducedSlots[slotName] = Object.assign(
145+
{},
146+
this.#schema.slots?.[slotName], // global slot definition
147+
this.#targetClass.slot_usage?.[slotName], // class-level override
148+
this.#targetClass.attributes[slotName] // induced attributes (from JSON)
149+
);
150+
```
151+
152+
This code was added because `gen-linkml --materialize-attributes` did not merge correctly from the LinkML side. It shows the pattern already exists in JS — the challenge is doing the equivalent for `AppContext.js`'s column-building path, which relies on `attributes` being pre-populated.
153+
154+
---
155+
156+
## Required Transformations to Implement in JavaScript
157+
158+
### Step 1 — Fetch and parse YAML
159+
160+
```javascript
161+
async function loadYamlSchema(folder) {
162+
const text = await fetch(`/templates/${folder}/schema.yaml`).then(r => r.text());
163+
return YAML.parse(text);
164+
}
165+
```
166+
167+
The schema YAML text is already served from dist (via webpack CopyPlugin). No changes to webpack are needed for fetching.
168+
169+
### Step 2 — Resolve imports
170+
171+
`merge_imports()` is the most complex transformation to reproduce. Most DataHarmonizer schemas import from two sources:
172+
173+
**a. `linkml:types`** — defines `String`, `Integer`, `Float`, `Boolean`, `Date`, `Datetime`, `Time`, `Uri`, `Uriorcurie`, and related scalars. Because these are needed frequently and never change, they can be **baked into a JS constant** rather than fetched:
174+
175+
```javascript
176+
// lib/utils/linkml_types.js
177+
export const LINKML_BUILTIN_TYPES = {
178+
string: { uri: 'xsd:string', base: 'str' },
179+
integer: { uri: 'xsd:integer', base: 'int' },
180+
float: { uri: 'xsd:float', base: 'float' },
181+
boolean: { uri: 'xsd:boolean', base: 'Bool' },
182+
date: { uri: 'xsd:date', base: 'XSDDate' },
183+
datetime:{ uri: 'xsd:dateTime', base: 'XSDDateTime' },
184+
time: { uri: 'xsd:time', base: 'XSDTime' },
185+
uri: { uri: 'xsd:anyURI', base: 'URI' },
186+
// ... (full list from linkml-model/types.yaml)
187+
};
188+
```
189+
190+
**b. Schema-relative imports** — e.g. `imports: [../../grdi_core/schema]`. These need to be fetched relative to the current schema's URL, parsed, and merged. A simple recursive fetch resolves these:
191+
192+
```javascript
193+
async function resolveImports(schema, baseUrl) {
194+
for (const imp of (schema.imports || [])) {
195+
if (imp === 'linkml:types') {
196+
// merge LINKML_BUILTIN_TYPES into schema.types
197+
continue;
198+
}
199+
const importUrl = new URL(imp.replace(/\.(yaml)?$/, '') + '.yaml', baseUrl).href;
200+
const importedSchema = YAML.parse(await fetch(importUrl).then(r => r.text()));
201+
await resolveImports(importedSchema, importUrl); // recurse
202+
mergeSchemas(schema, importedSchema);
203+
}
204+
}
205+
```
206+
207+
`mergeSchemas()` needs to combine `slots`, `enums`, `types`, `prefixes`, and `subsets` from the imported schema into the main one (not overwriting, since the main schema's own definitions take precedence).
208+
209+
### Step 3 — Induce classes (flatten inheritance)
210+
211+
This is the core transformation. For each class with slots, produce an `attributes` dictionary equivalent to `SchemaView.induced_class()`:
212+
213+
```javascript
214+
function induceClass(schema, className) {
215+
// Collect the is_a chain: [GreatGrandparent, ..., Parent, ClassName]
216+
const chain = [];
217+
let current = className;
218+
while (current) {
219+
chain.unshift(current);
220+
current = schema.classes[current]?.is_a ?? null;
221+
}
222+
223+
const attributes = {};
224+
225+
for (const ancestorName of chain) {
226+
const ancestor = schema.classes[ancestorName];
227+
if (!ancestor) continue;
228+
229+
// Each slot listed in this class's 'slots:' array gets merged
230+
for (const slotName of (ancestor.slots || [])) {
231+
const globalSlot = schema.slots?.[slotName] ?? {};
232+
const slotUsage = ancestor.slot_usage?.[slotName] ?? {};
233+
attributes[slotName] = Object.assign(
234+
{},
235+
attributes[slotName] ?? {}, // already accumulated from ancestors
236+
globalSlot,
237+
slotUsage
238+
);
239+
attributes[slotName].name = slotName;
240+
}
241+
242+
// Class-level attributes (defined directly, not via slots list) are merged as-is
243+
for (const [name, def] of Object.entries(ancestor.attributes ?? {})) {
244+
attributes[name] = Object.assign({}, attributes[name] ?? {}, def);
245+
attributes[name].name = name;
246+
}
247+
}
248+
249+
return attributes;
250+
}
251+
```
252+
253+
Then for each class in the schema:
254+
255+
```javascript
256+
function induceAllClasses(schema) {
257+
for (const className of Object.keys(schema.classes)) {
258+
if (schema.classes[className].slots?.length ||
259+
Object.keys(schema.classes[className].attributes ?? {}).length) {
260+
schema.classes[className].attributes = induceClass(schema, className);
261+
}
262+
}
263+
}
264+
```
265+
266+
### Step 4 — Handle `in_language` coercion
267+
268+
`linkml.py` has an explicit workaround for this:
269+
270+
```python
271+
if 'in_language' in SCHEMA:
272+
schema_view.schema['in_language'] = SCHEMA['in_language']
273+
```
274+
275+
In JS, YAML.parse preserves arrays faithfully, so no workaround is needed — this issue is Python-library-specific.
276+
277+
---
278+
279+
## Where to Hook Into the Existing Code
280+
281+
The single integration point is `fetchSchema()` in `lib/utils/templates.js`. Currently:
282+
283+
```javascript
284+
async function fetchSchema(path) {
285+
if (window.location.href.startsWith('http://localhost:') || ...) {
286+
const schema_path = path.replace(/\/templates\/(.+)\/schema.json/, '$1');
287+
return await getSchema(schema_path); // webpack dynamic import of .json
288+
} else {
289+
return await fetchFileAsync(path); // HTTP fetch of .json
290+
}
291+
}
292+
```
293+
294+
**Proposed change**: attempt to load `schema.yaml`, process it, and fall back to `schema.json` if YAML is not available:
295+
296+
```javascript
297+
async function fetchSchema(path) {
298+
const yamlPath = path.replace(/schema\.json$/, 'schema.yaml');
299+
try {
300+
const text = await fetch(yamlPath).then(r => {
301+
if (!r.ok) throw new Error(r.status);
302+
return r.text();
303+
});
304+
const schema = YAML.parse(text);
305+
const baseUrl = new URL(yamlPath, window.location.href).href;
306+
await resolveImports(schema, baseUrl);
307+
induceAllClasses(schema);
308+
return schema;
309+
} catch (_) {
310+
// Fall back to pre-built schema.json
311+
if (window.location.href.startsWith('http://localhost:') || ...) {
312+
const schema_path = path.replace(/\/templates\/(.+)\/schema.json/, '$1');
313+
return await getSchema(schema_path);
314+
} else {
315+
return await fetchFileAsync(path);
316+
}
317+
}
318+
}
319+
```
320+
321+
No other file needs to change — `template.default.schema` downstream is consumed identically regardless of source.
322+
323+
---
324+
325+
## Known Gaps and Edge Cases
326+
327+
| Area | Status | Notes |
328+
|------|--------|-------|
329+
| YAML parsing | Ready | `yaml` npm package already imported |
330+
| `linkml:types` import | Requires a baked-in constant | The canonical types list is stable; rarely changes |
331+
| Schema-relative imports | Requires async fetch loop | URL resolution works with standard `URL()` |
332+
| `is_a` chain flattening | Requires JS implementation | Single-level and two-level `is_a` cover 99% of DH schemas |
333+
| `slot_usage` override merge | Requires JS implementation | Same merge pattern Validator.js already uses |
334+
| `unique_keys` on classes | No change needed | YAML structure matches JSON; no transformation required |
335+
| `foreign_key` annotations | No change needed | Passed through as-is in both formats |
336+
| `extensions.locales.value` | No change needed | YAML structure matches JSON |
337+
| `in_language` array | No change needed | YAML.parse preserves arrays; Python issue does not apply |
338+
| Circular `is_a` references | Edge case | Guard against infinite loops in chain-walking |
339+
| Remote imports (URLs) | Requires CORS-compliant server | Schemas importing from external HTTPS URLs need CORS headers |
340+
341+
---
342+
343+
## Benefits of the Direct YAML Approach
344+
345+
1. **Eliminates the build step** for schema authors: edit `schema.yaml`, reload the browser — no `linkml.py` run needed.
346+
2. **Removes Python `linkml-runtime` as a build dependency** for end-users deploying their own schemas.
347+
3. **Simpler schema distribution**: ship only `schema.yaml`; the `.json` is no longer required in the dist package.
348+
4. **Live schema editing**: the Schema Editor could save a `schema.yaml` and reload it immediately without a Python step (useful for local development mode).
349+
5. **Consistency**: the same schema source file is used for both browser display and any downstream LinkML tooling.
350+
351+
## Costs and Risks
352+
353+
1. **Increased initial page load time**: YAML fetch + import resolution + induction runs on every page load vs. one pre-built JSON import.
354+
2. **Import fetch failures**: if an imported schema URL is unreachable (e.g. offline use, CORS), induction fails. The fallback to `schema.json` mitigates this.
355+
3. **JS induction vs Python induction gap**: the Python `SchemaView.induced_class()` handles edge cases (multiple inheritance via `mixins:`, `apply_to:`, abstract classes). A JS re-implementation would need to be validated against the full DH schema corpus.
356+
4. **`linkml:types` baked-in constant** must be kept in sync with the upstream `linkml-model` project if types are added or changed.
357+
358+
---
359+
360+
## Recommended Implementation Order
361+
362+
1. **Add `lib/utils/schema_induction.js`** containing `resolveImports()`, `induceAllClasses()`, `induceClass()`, and the `LINKML_BUILTIN_TYPES` constant.
363+
2. **Modify `fetchSchema()` in `lib/utils/templates.js`** to try YAML first, fall back to JSON.
364+
3. **Validate** by running the existing test suite with the YAML load path active on the `grdi`, `mpox`, and `grdi_1m` schemas (these cover single-class, multi-class, and 1-to-many hierarchies).
365+
4. **Optionally deprecate** the `script/linkml.py` step from the schema publishing workflow once the YAML path has been validated against all production schemas.

0 commit comments

Comments
 (0)