AI as Clinical Tool, Not Replacement

There is a version of the AI-in-mental-health story that gets pitched in venture decks and on podcasts. It goes like this: the access problem is enormous, therapists are scarce and expensive, and large language models can scale infinitely. Therefore, replace the therapist. The implication is sometimes loud and sometimes quiet, but it is there. I want to be specific about why that framing is not where the clinical literature points, and why I think the more honest pitch — AI as clinical tool inside a clinician’s workflow — is also the more useful one.

I am not anti-AI. I run a mental-health software company that uses Vertex AI under the Google Cloud Business Associate Agreement to assist with structured intake, validated-measure scoring, between-session continuity, and clinician-reviewed note scaffolding. I have spent more time on prompt engineering than I would like to admit. The argument I am making is narrower than “AI is bad for mental health.” The argument is: the strongest evidence we have is for AI-as-tool inside a clinical relationship, and we should build accordingly.

Where the evidence actually is

Linardon and colleagues published a meta-analysis in 2019 of randomized controlled trials of smartphone mental-health interventions. The result that gets cited most is that smartphone interventions outperformed inactive controls on outcomes including depression and anxiety, with small-to-moderate effect sizes — but the strongest effects in subgroup analyses came from interventions that included some form of human guidance.[1] That is one finding inside one meta-analysis, but the pattern is consistent.

Olthuis et al.’s 2016 Cochrane review of therapist-supported internet-delivered cognitive behavioral therapy for anxiety disorders concluded that therapist-supported iCBT was effective compared with waiting-list controls, with effect sizes broadly comparable to those reported for face-to-face CBT in similar populations.[2] The qualifier “therapist-supported” is doing real work in that finding. The interventions that had a clinician in the loop performed differently than the interventions that did not.

The World Health Organization’s 2022 World Mental Health Report identified digital tools as a meaningful part of the strategy to scale mental-health support, but the report frames digital tools explicitly as components of a stepped-care model where guided self-help and clinician-supervised digital interventions sit between unguided self-help and full clinical care.[3] The report does not endorse a model in which a chatbot is the clinical care.

Why the alliance literature pushes back on replacement framing

The therapeutic alliance is one of the most consistent predictors of outcome in psychotherapy across modalities, populations, and decades of meta-analytic synthesis. Wampold’s contextual model and the alliance literature he and Norcross have summarized find that the relational components of psychotherapy account for a non-trivial share of outcome variance, often comparable to the share attributable to specific technique.[4][5] Lambert and Barley’s widely cited summary placed the alliance and related common factors at the top of empirically supported predictors of psychotherapy outcome.[6] Flückiger and colleagues’ 2018 meta-analysis of the alliance in adult psychotherapy synthesized data from over three hundred studies and reaffirmed a robust, modality-agnostic alliance–outcome association.[7]

The alliance is a relationship. The relationship is between two human beings, in a defined clinical frame, over time. A language model can produce text that mimics empathic acknowledgment. It does not have the things the alliance literature is describing: a continuous developmental relationship, an ethical and licensed accountability structure, a body that is present in the room, a clinical history with the client, and a set of professional obligations that bind the relationship in ways that text generation cannot bind it. That is not a sentimental claim. That is the literature.

What AI is good at, in a clinical workflow

The tasks where large language models, in a HIPAA-compliant infrastructure, can earn their keep inside clinical work are narrower and more boring than the venture pitch suggests. They are also genuinely useful.

Structured intake summarization. The client fills in a long structured intake, the model generates a clinically organized summary that the clinician reviews and edits before session 1. The clinician’s thirty minutes of intake review collapses into ten minutes of editing a draft that follows the clinician’s preferred structure. The clinician is the editor, not the consumer.

Validated-measure scoring and trajectory tracking. The PHQ-9 score is arithmetic. The reliable-change threshold is documented. Surfacing the trajectory in the right view at the right time — a pre-session brief, an alert when delta crosses the published threshold — is a tooling problem, not a clinical-judgment problem. The clinical judgment is what the clinician does with the trajectory once they can see it.[8]

Between-session check-in synthesis. A client’s seven daily check-ins between sessions are useful only if the clinician can read them at a glance before session. The model can summarize seven entries into one paragraph, flag deltas, surface direct quotes that warrant attention. The clinician still walks into the room with full responsibility for the clinical interpretation.

Note scaffolding. The clinician dictates or writes the clinical content; the model produces a structured-note draft (BIRP, DAP, SOAP, the clinician’s preference) that the clinician reviews, edits, and signs. The clinician owns the note. The model produces a starting point. This is the use case where most therapists I talk to recover meaningful time without losing fidelity, when the prompt and the output structure are tight.[9]

Triage and routing inside a clinician’s panel. A client check-in that includes a meaningful elevation in the C-SSRS or a free-text mention that meets a documented escalation rule should produce a clinician notification, not get lost in a queue. The model is helping the routing logic, not making the clinical decision.

Where AI is dangerous in mental health

The danger zones are well-documented and worth listing plainly.

Direct-to-consumer chatbot “therapy” with no licensed clinician in the loop. The clinical literature does not support this as a substitute for psychotherapy. The reputational and regulatory risk for the field is high, and there have been documented harms — including, in well-publicized cases, encouragement of harmful behaviors — produced by general-purpose chat models that were never designed as clinical tools.[10]

Crisis triage without a clinician backstop. A model can be tuned to produce crisis-resource language. A model is not licensed to assess imminent risk. The 988 Lifeline, the Stanley-Brown safety plan, the Columbia Suicide Severity Rating Scale, and the licensed clinician are the clinical infrastructure for risk. AI in the workflow can route to those resources faster; AI cannot replace them.[11][12]

Diagnostic claims at scale. The DSM and ICD diagnostic frameworks are operationalized by trained clinicians inside a clinical relationship. A model that produces diagnostic-language output without that clinical container is producing something that looks like a diagnosis but is not a diagnosis. The downstream harm — insurance, employment, custody, self-perception — is not theoretical.[13]

Long-context summarization of clinical material without source-verification. Models hallucinate. Hallucinations in clinical text matter more than hallucinations in marketing copy. A summary that invents a symptom, a date, or a quoted statement is dangerous in a clinical record. The mitigation is human-in-the-loop verification at every point where the model output enters the clinical record.[14]

The HIPAA architecture that has to come first

Before the conversation about AI is even a clinical conversation, it is an architecture conversation. PHI cannot transit infrastructure that is not under a Business Associate Agreement. That is not a preference; it is the regulation.[15] Most consumer chat APIs are not under a BAA suitable for PHI. Some enterprise-tier offerings from Google Cloud, AWS, and Microsoft Azure are.[16][17] The clinician-facing tool that uses AI to assist with PHI must use the BAA-covered surface. There is no clever way around this.

The architecture for VibeCheck reflects that. PHI lives in PostgreSQL on AWS with pgcrypto encryption at rest, in an environment under executed AWS BAA. AI assistance routes through Vertex AI on Google Cloud under executed Google Cloud BAA. Prompts are scoped to the specific clinical task, log retention is configured for the clinical-grade environment, and the marketing site you are reading does not see PHI at all. The application lives at app.vibecheck.luxury; the marketing site at vibecheck.luxury. That split is not aesthetic, it is HIPAA hygiene.

The HRSA workforce data on mental-health professional shortages is the reason the access argument keeps coming up.[18] The shortage is real. The response to a real shortage is not to lower the clinical floor; it is to make the existing clinicians more effective per hour without sacrificing the clinical responsibility that makes mental-health treatment work in the first place.

What the published reasoning on LLMs in clinical contexts says

Thirunavukarasu and colleagues’ 2023 review of large language models in medicine summarized the technical capabilities and the clinical-translation pitfalls and concluded that LLMs in healthcare require disciplined evaluation, human oversight, and domain-specific guardrails before they can be relied on for any task adjacent to direct clinical care.[19] Miner and colleagues’ earlier 2016 work on conversational agents and mental-health responses found that general-purpose conversational agents at that time responded inconsistently and sometimes inappropriately to statements about mental-health crises — a finding that has been partially mitigated by intentional guardrails in newer systems but that has not gone away as a category of risk.[10]

Lui and Penney’s 2021 review of digital mental-health interventions for diverse populations highlights another piece of the picture: the evidence base is much thinner for interventions targeted at clinical populations defined by ethnicity, language, age, and socioeconomic status. The interventions that work in the published trials work for the populations that were enrolled in the trials. Generalizing AI-mediated mental-health interventions to populations underrepresented in the literature without clinician oversight is a known equity risk.[20]

Wampold’s framing — that psychotherapy works in significant part because of the relational, contextual, and meaning-making properties of the clinician–client relationship — is the part of the literature that the replacement framing tends to skip past.[4] A model can produce empathic-sounding text. The model is not in a relationship with the client in the sense the alliance literature is describing.

What I tell other clinicians who ask about this

If you are a licensed clinician evaluating an AI-assisted tool for your practice, the questions I would ask, in this order, are: Is the infrastructure under a Business Associate Agreement appropriate for PHI? Is the AI assist scoped to specific workflow tasks, or is it pitched as a substitute for clinical judgment? Is the human clinician the editor and signer of every clinical artifact the tool produces? Is there a documented crisis-handoff path that does not depend on the AI to make a clinical-acuity call? Is the tool built and reviewed by clinicians, not just by engineers?

If those answers are clean, the tool is probably useful. If those answers are slippery, the tool is probably going to create work and risk you cannot see yet.

The reason I built VibeCheck is that I wanted an AI-assisted clinical layer that I, as a licensed clinician, would feel honest using on my own caseload. Every clinical decision was made by a licensed clinician. The architecture lives under appropriate BAAs. The clinician is the editor of every clinical artifact. The crisis handoff routes to 988, to the Stanley-Brown safety plan, and to clinician notification — not to a model output. The relationship between the clinician and the client stays where the relationship belongs.

That is the version of AI in mental health that the literature actually supports. That is what I am willing to put my license behind. That is what the next decade of clinically credible AI in mental-health work needs to look like. The marketing pitch will keep saying something else. The literature will keep saying what it has been saying for thirty years: the alliance does the work, the clinician carries the responsibility, and the tools that earn their keep are the ones that make the clinician more effective without pretending to be the clinician.

AI as Clinical Tool, Not Replacement: What the Research Says and Why the Distinction Matters

Where the evidence actually is

Why the alliance literature pushes back on replacement framing

What AI is good at, in a clinical workflow

Where AI is dangerous in mental health

The HIPAA architecture that has to come first

What the published reasoning on LLMs in clinical contexts says

What I tell other clinicians who ask about this

References

Where the evidence actually is

Why the alliance literature pushes back on replacement framing

What AI is good at, in a clinical workflow

Where AI is dangerous in mental health

The HIPAA architecture that has to come first

What the published reasoning on LLMs in clinical contexts says

What I tell other clinicians who ask about this

References

More Field Notes