This project studies whether an FT-Transformer-style tabular encoder improves multimodal retrieval alignment compared with a simpler MLP encoder. The benchmark uses controlled synthetic EHR-like tabular features and synthetic CXR-like visual patterns instead of restricted clinical data. The goal is not to make clinical claims, but to study method behavior under known data-generating conditions.
The project compares MLP and FT-Transformer EHR encoders inside the same two-tower contrastive retrieval pipeline. Experiments vary dataset structure, sample size, and pair quality using linear, interaction-based, and noisy-pair settings. The results show that FT-Transformer does not universally outperform the MLP. Its usefulness depends on data structure, sample size, and pair quality. In some interaction settings, FT-Transformer improves retrieval, but in noisy-pair settings both models degrade, showing that better architecture cannot fully compensate for weak supervision.
Electronic health record data is often represented as structured tabular data. Unlike images or free text, tabular clinical data contains heterogeneous features such as continuous measurements, binary indicators, categorical variables, and derived medical attributes.
In multimodal medical AI, tabular EHR features may need to be aligned with another modality, such as medical images or reports. A common question is whether more expressive tabular encoders, such as transformer-style models, improve representation learning compared with simpler MLP encoders.
This project studies that question in a controlled retrieval setting.
The public version uses synthetic data so that the data-generating process, feature interactions, pair quality, and evaluation setup are known and reproducible.
The main research question is:
Can an FT-Transformer-style tabular encoder improve retrieval alignment compared with a simpler MLP encoder when the data contains feature interactions, noisy pairings, or larger sample sizes?
The project studies this question by varying:
- encoder architecture
- dataset mode
- sample size
- pair quality
- retrieval behavior
- positive-pair similarity
Structured EHR data is often tabular: each patient or admission can be represented by a row of clinical variables. These variables may include demographics, diagnoses, measurements, medications, laboratory-style features, or synthetic proxies for those concepts.
A simple MLP can model nonlinear relationships after preprocessing, but it treats the feature vector as a flat input.
Transformer-style tabular models instead tokenize features and use attention across feature tokens. This can help when interactions between features matter.
The FT-Transformer idea is based on representing tabular features as tokens and applying Transformer layers over those feature tokens. In this project, the FT-Transformer-style encoder is used as the EHR-side encoder inside a two-tower retrieval model.
The goal is not to reproduce FT-Transformer exactly for supervised tabular classification. Instead, the goal is to test a transformer-style tabular encoder inside a contrastive retrieval pipeline.
The model uses a two-tower setup:
- one encoder processes synthetic CXR-like visual inputs
- one encoder processes synthetic EHR-like tabular inputs
The two encoders produce embeddings in a shared space. A symmetric InfoNCE-style contrastive loss pulls matched image-EHR pairs together and pushes non-matching pairs apart.
The benchmark compares two EHR encoder settings:
| Method | Description |
|---|---|
| MLP EHR Encoder | A baseline feed-forward encoder for transformed tabular features |
| FT-Transformer EHR Encoder | A feature-tokenizer plus Transformer encoder that applies self-attention over tabular feature tokens |
Both methods use:
- the same synthetic data generator
- the same image-side encoder
- the same contrastive objective
- the same retrieval metrics
- the same train/evaluation protocol
This isolates the main comparison: whether the EHR encoder architecture changes retrieval behavior.
The public repository uses controlled synthetic multimodal data.
Each sample contains:
| Component | Description |
|---|---|
| Synthetic EHR profile | Tabular feature vector representing structured clinical-style information |
| Synthetic CXR-like input | Controlled visual pattern associated with the EHR profile |
| Pair ID | Known image-EHR match used for retrieval evaluation |
| Dataset mode | Controls whether the signal is linear, interaction-based, or noisy |
Three dataset modes are included:
| Setup | Purpose |
|---|---|
| Linear | Tests whether a simple MLP is already sufficient when the signal is mostly linear |
| Interaction | Tests whether FT-Transformer helps when feature interactions matter |
| Noisy | Tests robustness when a fraction of image-EHR pairings are intentionally corrupted |
The visual inputs are controlled synthetic patterns, not real clinical images. This makes the benchmark privacy-safe and reproducible.
The final benchmark varies:
| Factor | Values |
|---|---|
| EHR encoder | MLP, FT-Transformer |
| Dataset mode | Linear, interaction, noisy |
| Sample size | 500 pilot, 1000, 2000 |
| Training duration | 10 epochs |
The benchmark is designed to test whether architecture improvements appear consistently or only under specific data conditions.
The project evaluates retrieval quality and embedding behavior.
| Metric | Meaning |
|---|---|
| Recall@1 | Whether the correct match is ranked first |
| Recall@5 | Whether the correct match appears in the top 5 |
| Recall@10 | Whether the correct match appears in the top 10 |
| Recall@50 | Whether the correct match appears in the top 50 |
| Lift@K | Improvement over random retrieval |
| Positive-pair cosine similarity | Average similarity between true image-EHR pairs |
| Training loss | Final contrastive optimization loss |
This metric set is important because positive-pair similarity alone does not guarantee strong top-k retrieval. A model may pull positive pairs closer on average while still failing to rank the exact match highly among many candidates.
The results show that FT-Transformer does not universally outperform the MLP encoder.
Instead, model behavior depends on the dataset condition.
In mostly linear settings, the MLP is already highly competitive. This suggests that a stronger architecture is not automatically useful when the data structure is simple enough for a baseline encoder.
In interaction-based settings, FT-Transformer can improve retrieval in some sample-size regimes. This supports the idea that attention over feature tokens may help when feature interactions are important.
However, the improvement is not uniform across all settings. This means the architecture should be evaluated under controlled conditions rather than assumed to be better.
In noisy-pair settings, both MLP and FT-Transformer degrade. This shows that architecture alone cannot fix weak or corrupted supervision.
This is an important result for multimodal learning: pair quality can dominate model architecture.
The FT-Transformer-style encoder does not universally outperform the simpler MLP baseline. Its benefit depends on the data structure.
When the signal is mostly linear or easy to model, the MLP baseline can already perform well.
In interaction-based synthetic settings, FT-Transformer can improve retrieval behavior, especially when the data contains relationships that may benefit from feature-token attention.
When pair assignments are noisy, both models degrade. Better architecture cannot fully compensate for weak image-EHR pairing.
FT-Transformer may improve positive-pair similarity without always improving top-k retrieval. This shows why retrieval evaluation requires ranking metrics such as Recall@K and Lift@K, not only cosine similarity.
This repository is a controlled method-analysis project, not a clinical deployment study.
The data is synthetic, and the visual inputs are controlled patterns rather than real chest X-ray images. Therefore, the results should be interpreted as evidence of experimental design, retrieval-evaluation reasoning, and model-behavior analysis, not as clinical performance.
The benchmark also uses a limited number of sample sizes, dataset modes, and training epochs. Stronger evidence would require repeated random seeds, larger sample-size sweeps, longer training, additional baselines, and authorized real-world datasets.
Possible extensions include:
- repeating all experiments across multiple random seeds
- testing larger sample sizes
- training for longer durations
- adding stronger tabular baselines
- testing additional transformer-style tabular encoders
- adding pair-noise sweeps
- analyzing embedding geometry after training
- applying the pipeline to authorized real multimodal datasets
- comparing retrieval behavior under different contrastive loss variants
This project taught me that stronger architecture does not automatically mean stronger retrieval.
The most important lesson is:
Architecture, data structure, sample size, and pair quality must be evaluated together.
FT-Transformer-style encoders can be useful when feature interactions matter, but they are not magic. If the data is simple, an MLP may be enough. If the pairs are noisy, both models can fail.
This changed how I think about multimodal retrieval. A serious benchmark should not only compare models. It should also test when a model helps, when it fails, and what data conditions make the difference.
This project is inspired by deep learning methods for tabular data and contrastive retrieval.
The FT-Transformer-style encoder is motivated by work on feature tokenization and Transformer layers for tabular prediction tasks. The broader idea of applying self-attention to tabular features is also related to TabTransformer, which uses contextual embeddings for categorical tabular features.
This repository does not reproduce those papers directly. Instead, it adapts the idea of transformer-style tabular encoding to a controlled multimodal retrieval setting.
- Yury Gorishniy, Ivan Rubachev, Valentin Khrulkov, and Artem Babenko. Revisiting Deep Learning Models for Tabular Data. NeurIPS, 2021.
- Xin Huang, Ashish Khetan, Milan Cvitkovic, and Zohar Karnin. TabTransformer: Tabular Data Modeling Using Contextual Embeddings. arXiv, 2020.