ML for Personalized Psychiatric Diagnostics

Sneiderman, Robby

ML Applications

ML for Personalized Psychiatric Diagnostics

Multimodal models for psychiatric prediction: digital phenotyping, EHR plus wearable fusion, RDoC-style biotyping, treatment-response prediction, and the cross-site replication failures that have limited clinical adoption.

AdvancedTier 3CurrentReference~14 min

Prereq Map

Learning position

Read this page in the graph.

ml-applications | layer 3 | tier 3. This page has 0 direct prerequisites and 0 published dependents.

Open Atlas Prerequisites Leads to

What next

Take the diagnostic

No published continuation is declared yet, so the diagnostic is the clean next route.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Psychiatry has weak predictive validity at the individual level. Two patients with the same DSM-5 diagnosis can have nearly disjoint symptom profiles, biological markers, and treatment responses. The hope behind personalized psychiatric ML is that combining many weak signals (passive sensor data, EHR text, lab values, imaging, genotype) will recover the heterogeneity that diagnostic categories average over.

The hope has not been free. The same heterogeneity that makes the problem interesting also breaks the standard ML pipeline. Models trained on one hospital's patients tend not to transfer; models that look strong in cross-validation collapse on a held-out site; effect sizes that look clinically meaningful in a discovery cohort shrink on independent samples. Reading this literature requires keeping the prediction-versus-discovery distinction sharp at all times.

Core Methods

Digital phenotyping (Insel 2017, JAMA) names the program of using continuous passive measurements from a smartphone or wearable (typing rhythm, GPS variance, sleep regularity from accelerometry, voice features from call audio) as an objective behavioral substrate. Models built on these features predict mood transitions, relapse, and adherence. The strengths are sample frequency (thousands of observations per patient per week) and ecological validity. The weaknesses are device drift, missingness that correlates with state (a depressed patient stops carrying the phone), and informed consent that becomes harder with each added stream.

EHR plus wearable fusion combines structured data (diagnoses, prescriptions, labs) with unstructured clinical notes (handled by BERT-style encoders) and continuous signals from devices. Architectures range from late-fusion ensembles to multimodal transformers with modality-specific tokenizers. The hard problem is alignment: an EHR record updates on visits, a wearable updates every minute, and a self-report updates daily. Most reported gains over single-modality baselines are modest (a few AUC points) and unstable across sites.

The Research Domain Criteria (RDoC) framework, introduced by NIMH in 2010, reframes psychiatric phenotypes around dimensional constructs (negative valence, cognitive control, arousal regulation) rather than DSM categories. ML on RDoC data tries to recover patient subgroups that cut across diagnostic labels. Drysdale et al. 2017 reported four resting-state fMRI biotypes of depression with differential rTMS response. Subsequent replication attempts found weaker and partially inconsistent biotypes. The RDoC program continues; the specific biotype claims have not stabilized.

Treatment-response prediction is the most policy-relevant application. Chekroud et al. 2016 (Lancet Psychiatry) trained gradient-boosted trees on baseline clinical and demographic features from STAR*D and predicted citalopram response with internally validated AUC near 0.7, outperforming random selection. The same group's 2024 Science paper revisited the field at scale: across 53 trials and 36,000 patients, models that performed well within a trial generalized poorly across trials. Median AUC dropped from 0.7 to roughly 0.5 — chance — when train and test came from different studies. The within-trial signal was real and the cross-trial collapse was a property of the underlying problem (different inclusion criteria, different outcome measurement, different placebo effects), not of any specific model class.

Watch Out

In-sample AUC is not clinical readiness

A treatment-response model with cross-validated AUC 0.75 on STAR*D is not a deployable tool. The Chekroud 2024 result is the explicit evidence: when the test set came from a different trial, performance collapsed to chance for most models. This is distribution shift in the structural sense, not an optimization or hyperparameter problem. Adding more model capacity does not fix it. External validation across heterogeneous sites is the only honest readiness criterion.

Watch Out

Prediction does not equal recommendation

A model that predicts which patients respond to citalopram, even with perfect calibration, does not on its own license a recommendation to prescribe citalopram to predicted responders. The decision-theoretic analysis requires the counterfactual: what would have happened under the alternative drug. Most studies in this space use observational or single-arm designs; the predicted-responder cohort and the predicted-non-responder cohort differ on baseline characteristics that are themselves prognostic. See causal inference for policy evaluation for the framing.

References

Insel (2017). "Digital phenotyping: technology for a new science of behavior." JAMA 318(13):1215-1216. The naming paper for the program; the case for passive sensor data as a clinical primitive.
Chekroud, Zotti, Shehzad, Gueorguieva et al. (2016). "Cross-trial prediction of treatment outcome in depression: a machine learning approach." Lancet Psychiatry 3(3):243-250. Gradient-boosted treatment-response model on STAR*D with external validation on CO-MED.
Chekroud et al. (2024). "Illusory generalizability of clinical prediction models." Science 383(6679):164-167. The cross-trial replication failure paper; 53 trials, 36,000 patients, AUC dropping to chance across studies.
Drysdale et al. (2017). "Resting-state connectivity biomarkers define neurophysiological subtypes of depression." Nature Medicine 23:28-38. The four-biotype paper; later replication concerns.
Cuthbert and Insel (2013). "Toward the future of psychiatric diagnosis: the seven pillars of RDoC." BMC Medicine 11:126. The RDoC framework as an alternative to DSM-style categorical diagnosis.
Onnela (2021). "Opportunities and challenges in the collection and analysis of digital phenotyping data." Neuropsychopharmacology 46:45-54. The methodological caveats for passive sensing studies, with a focus on missingness and informativeness.

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

0

No direct prerequisites are declared; this is treated as an entry point.

Derived topics

0

No published topic currently declares this as a prerequisite.