FT-Transformer EHR Retrieval: Controlled Experiment Summary

1. What method did I test?

This project tests an FT-Transformer-style tabular EHR encoder against a baseline MLP EHR encoder in a two-tower multimodal retrieval setup.

The goal is to study whether a transformer-based tabular encoder improves retrieval alignment compared with a simpler MLP encoder.

Both models use the same retrieval framework:

CXR branch: synthetic image encoder input
EHR branch: MLP or FT-Transformer encoder
Loss: symmetric InfoNCE contrastive loss
Task: retrieve the matching EHR sample for a given CXR sample

2. What dataset setup did I use?

This repository uses controlled synthetic multimodal data, not real clinical data.

Three synthetic dataset modes were tested:

Setup	Description
Linear	Image patterns are generated from mostly linear combinations of tabular features
Interaction	Image patterns depend more strongly on feature interactions
Noisy	A fraction of image-EHR pairings are intentionally corrupted

The purpose is to evaluate method behavior under controlled conditions without using restricted patient data.

3. What metric did I measure?

The experiments report:

Recall@1
Recall@5
Recall@10
Recall@50
Lift over random baseline
Positive-pair cosine similarity
Training loss

Lift over random is reported because retrieval difficulty changes with candidate pool size.

4. What changed across experiments?

Variable	Values tested
Encoder type	MLP, FT-Transformer
Data pattern	Linear, interaction, noisy
Pairing quality	Clean pairs, 25% noisy/corrupted pairs
Sample size	1000, 2000
Training duration	10 epochs

5. Results

Pilot sample-size analysis: 500 samples

Before running the larger 1000- and 2000-sample benchmarks, I first ran a smaller 500-sample pilot study. The goal was to verify that the training pipeline, retrieval metrics, and synthetic data modes behaved as expected.

These runs are not treated as the main benchmark. They are included to show the progression from a small controlled pilot to larger sample-size experiments.

Setup	Samples	Encoder	R@1	R@5	R@10	R@50	Lift@50	Pos Sim	Train Loss
Linear	500	MLP	0.04	0.15	0.34	0.94	1.88x	0.3630	3.0041
Linear	500	FT-Transformer	0.04	0.17	0.33	0.97	1.94x	0.4992	2.8632
Interaction	500	MLP	0.03	0.13	0.26	0.83	1.66x	0.2264	3.1512
Interaction	500	FT-Transformer	0.03	0.09	0.24	0.90	1.80x	0.2902	2.9687
Noisy	500	MLP	0.02	0.08	0.22	0.74	1.48x	0.1293	3.2743
Noisy	500	FT-Transformer	0.03	0.11	0.20	0.69	1.38x	0.2104	3.2186

The 500-sample pilot showed that FT-Transformer often increased positive-pair similarity, but this did not always translate into better top-k retrieval. This motivated the larger 1000- and 2000-sample runs, where sample-size effects became clearer.

Final 10-Epoch Results

Setup	Samples	Encoder	R@1	R@5	R@10	R@50	Lift@50	Pos Sim	Train Loss
Linear	1000	MLP	0.025	0.125	0.270	0.925	3.70x	0.7758	2.6304
Linear	1000	FT-Transformer	0.030	0.135	0.305	0.900	3.60x	0.8360	2.5074
Linear	2000	MLP	0.025	0.100	0.195	0.7475	5.98x	0.8304	2.4916
Linear	2000	FT-Transformer	0.015	0.1025	0.1825	0.6800	5.44x	0.8958	2.3236
Interaction	1000	MLP	0.025	0.105	0.210	0.685	2.74x	0.4580	2.9626
Interaction	1000	FT-Transformer	0.040	0.130	0.235	0.795	3.18x	0.5265	2.7065
Interaction	2000	MLP	0.025	0.155	0.250	0.8150	6.52x	0.7708	2.3643
Interaction	2000	FT-Transformer	0.010	0.0925	0.2175	0.6725	5.38x	0.7972	2.1039
Noisy	1000	MLP	0.030	0.090	0.200	0.685	2.74x	0.4477	3.0976
Noisy	1000	FT-Transformer	0.020	0.110	0.190	0.660	2.64x	0.2625	3.0341
Noisy	2000	MLP	0.010	0.055	0.1075	0.5125	4.10x	0.4121	3.0226
Noisy	2000	FT-Transformer	0.0175	0.0925	0.1575	0.5775	4.62x	0.3136	2.9554

6. What did I learn?

Linear setup

In the linear setup, the MLP was already highly competitive. FT-Transformer achieved higher positive-pair similarity and lower training loss, but it did not consistently improve retrieval ranking.

This suggests that for simple linear tabular-image relationships, a well-tuned MLP can be sufficient.

Interaction setup

At 1000 samples, FT-Transformer improved all main retrieval metrics compared with MLP:

R@1 improved from 0.025 to 0.040
R@5 improved from 0.105 to 0.130
R@10 improved from 0.210 to 0.235
R@50 improved from 0.685 to 0.795
Lift@50 improved from 2.74x to 3.18x

However, at 2000 samples, MLP achieved stronger retrieval ranking even though FT-Transformer still had higher positive similarity and lower training loss.

This shows that higher embedding similarity does not always translate into better top-k retrieval.

Noisy setup

With corrupted pairings, both models degraded. FT-Transformer improved several top-k metrics at 2000 samples, but MLP remained competitive at 1000 samples.

This supports an important conclusion: better encoder architecture cannot fully compensate for noisy or weak pairing quality.

Main conclusion

FT-Transformer did not universally outperform MLP. Its advantage depended on the data pattern, sample size, and pairing quality.

The strongest lesson from this experiment is that representation learning performance is controlled by both architecture and data construction. A more expressive encoder can improve alignment, but retrieval quality still depends heavily on clean pairing and evaluation design.

7. Limitations

The dataset is synthetic and does not represent clinical performance.
The generated images are controlled synthetic visual patterns rather than clinical medical images, so the results should be interpreted as method-behavior analysis rather than clinical performance.
Experiments were limited to 10 epochs.
The goal is method behavior analysis, not clinical diagnosis or deployment.
More seeds should be tested before making strong claims about model superiority.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FT-Transformer EHR Retrieval: Controlled Experiment Summary

1. What method did I test?

2. What dataset setup did I use?

3. What metric did I measure?

4. What changed across experiments?

5. Results

Pilot sample-size analysis: 500 samples

Final 10-Epoch Results

6. What did I learn?

Linear setup

Interaction setup

Noisy setup

Main conclusion

7. Limitations

FilesExpand file tree

results_summary.md

Latest commit

History

results_summary.md

File metadata and controls

FT-Transformer EHR Retrieval: Controlled Experiment Summary

1. What method did I test?

2. What dataset setup did I use?

3. What metric did I measure?

4. What changed across experiments?

5. Results

Pilot sample-size analysis: 500 samples

Final 10-Epoch Results

6. What did I learn?

Linear setup

Interaction setup

Noisy setup

Main conclusion

7. Limitations