Before AI Meets Your Biobank: Five Must-Do Steps for Research Institutions

By Kimberly Chew on April 29, 2026

Posted in Clinical Research & Trials, Compliance, Data Privacy & Security/HIPAA/HITECH, Life Sciences, Technology & Information Systems

Listen to this post

This is the first in a series of blog posts focused on AI and clinical trials/research space, highlighting topics to be discussed at the upcoming BIO International convention on June 22-25, 2026.

A major research institution combines 15 years of patient genomic data with metabolomic profiles and clinical outcomes, feeding it into an AI model for drug target discovery. The resulting insights are groundbreaking—until the legal department asks: “Did any of those consent forms authorize AI analysis? Or data sharing across these datasets? Or commercial drug development?”

The answer is almost always no. It’s an emerging problem as AI transforms biomedical research.

The Consent Gap: Why Historical Forms Don’t Cover AI

Traditional informed consent forms were designed for a different era of research. They contemplated specific research projects focused on defined disease areas, with de-identification as the primary privacy protection and limited data sharing within named research teams.

AI research operates in a fundamentally different paradigm. It requires cross-dataset integration, combining genomics with proteomics, metabolomics (“-omics”), clinical records, and environmental data. AI models demand indefinite data retention for ongoing training and validation. The applications are often unanticipated: drug repurposing, target discovery, and commercial uses far beyond the original study’s intent. Most critically, AI can derive information through algorithmic inference that was never explicitly collected.

The legal problem is straightforward: informed consent must be both “informed” and “specific” under the Common Rule, HIPAA, and state privacy laws. Broad “future research” language may not cover AI-driven commercial applications. The consequences of getting this wrong are severe: institutions face potential lawsuits from participants alleging unauthorized use of their biological data, investigations by the Office for Civil Rights (OCR) for HIPAA violations that can result in millions in fines, state attorney general enforcement actions under consumer protection and health privacy laws, and even suspension of research activities by OHRP if human subjects protections are found inadequate.^[1]

This isn’t a theoretical concern. It’s happening now at institutions racing to stay competitive in AI-powered drug discovery.

The Genomic Data Multiplier Effect

Genomic data presents unique challenges that AI amplifies dramatically. Unlike other health information, genomic data is immutable and it’s inherently familial, revealing information about blood relatives who never consented to participate. It’s also perpetually identifiable, even when “de-identified,” through public genealogy databases, cross-referencing with other datasets and advancing AI re-identification techniques.

When AI enters the equation, these risks multiply. Combining genomic data with other “-omics” and clinical information through machine learning creates exponential re-identification risk—each additional data layer makes individuals more identifiable. Research has demonstrated that just 15 demographic attributes can uniquely identify 99.98% of Americans,[2] and genomic data is inherently far more identifying. Now consider adding proteomic profiles, metabolomic signatures, and clinical outcomes. The resulting multi-dimensional profile is effectively a unique biological identifier that no traditional de-identification technique can protect.

AI can predict sensitive information like disease susceptibility or behavioral traits that were never disclosed in the original consent. These inferences may affect not just participants, but their family members who never agreed to have their genetic privacy exposed.

Consider this hypothetical: A researcher uses AI to link genomic data from a cancer study done 15 years ago with recent metabolomic data and social determinants of health. The AI identifies a genetic marker linked to early-onset Alzheimer’s, information never contemplated in the original consent. When participants or their families discover such unauthorized uses—whether through data breach notifications, published research, or commercial product development—they can bring individual or class action lawsuits for breach of consent, violation of state genetic privacy laws, and even fraud claims if commercial profits are involved.[3] Yet such scenarios are happening in research institutions across the country.

The Grant Submission Blind Spot

While institutions focus on patient data privacy, many researchers are creating exposure through an unexpected channel: grant submissions. Researchers increasingly use ChatGPT, Claude, and other large language models to edit and refine grant applications. These tools may retain data, use it for training, or expose it to third parties.

Grant applications are treasure troves of sensitive information: preliminary unpublished data, methodological innovations that may be patentable, collaborator details, and budget information. Most funding agencies have strict confidentiality requirements for peer review processes.

Uploading grant text to AI systems may violate grant submission agreements, institutional IP policies, and collaborator NDAs. Some funders now explicitly prohibit AI tool use for this reason, but many researchers remain unaware.

The irony is stark: the tool helping you write the grant could torpedo your IP protection before you even receive funding.

Multi-Omics Integration: Where Risks Converge

AI’s transformative power in drug discovery lies in its ability to integrate diverse data types: genomics, proteomics, metabolomics, clinical data, environmental exposures, and in vitro and in vivo model results. This integration creates unprecedented opportunities for identifying drug targets and biomarkers.

But here’s the legal collision: each data type likely came from different consent forms, different time periods, different institutions, and different regulatory frameworks. Some data may predate HIPAA. Some may have been collected before GDPR existed. Some may have come from collaborators with their own consent language.

When AI combines these datasets, you’re creating something new with compounded privacy risks. Original consents almost certainly didn’t authorize cross-dataset linkage. De-identification challenges become exponential. And commercial applications may trigger entirely new consent requirements under evolving state privacy laws.

The compliance maze is daunting. HIPAA requires authorization for research uses, but its de-identification standards don’t account for AI re-identification capabilities. State laws like California’s CPRA and Washington’s My Health My Data Act impose stricter requirements than federal frameworks. The Common Rule requires IRB review of consent adequacy,[4] but many IRBs haven’t updated their frameworks to address AI-specific risks.[5] Standard IRB review criteria don’t typically assess algorithmic re-identification potential, cross-dataset integration risks, or commercial AI model training.

The more data types you integrate, the more consent forms you need to audit, and the more likely you’ll find critical gaps.

The Five Must-Do Steps

Research institutions combining datasets and deploying AI for discovery must take immediate action to mitigate legal and regulatory risks:

Step 1: Audit Existing Consent Forms for AI and Future Use Language

Identify all studies with data you plan to use in AI research. Flag genomic data studies as highest priority given their unique re-identification risks and familial implications. Review consent language for explicit authorization of AI analysis, cross-dataset integration, commercial applications, and indefinite retention. Document gaps and assess whether re-consent is feasible or required.

Step 2: Update Consent Templates to Address AI-Specific Risks

Revise institutional consent templates to explicitly address AI analysis and algorithmic processing, cross-dataset integration across multiple “-omics” platforms, commercial applications including licensing and spin-outs, and indefinite retention for ongoing model training and validation. Ensure language is specific enough to meet regulatory requirements while flexible enough to accommodate evolving AI applications.

Step 3: Implement AI Use Policies for Grant Submissions

Develop clear policies on what can and cannot be uploaded to AI systems. Train researchers on IP and confidentiality risks associated with using commercial AI tools for grant writing. Require researchers to certify compliance with funder policies on AI use before submission.

Step 4: Review and Strengthen AI Vendor Contracts

Ensure all contracts with AI platform vendors include clear data retention and deletion policies, robust audit trail requirements for regulatory compliance, indemnification clauses specifically addressing privacy breaches and re-identification risks, and regulatory compliance warranties. Require vendors to disclose whether models are trained on your data and whether data is shared with third parties.

Step 5: Engage IRBs Early on AI Research Protocols

Update IRB review criteria to specifically address AI-related risks including re-identification potential, algorithmic bias, and commercial applications. Require researchers to document AI model validation assumptions and limitations. Mandate re-consent analysis when using historical data for new AI applications. Ensure IRB members receive training on AI-specific privacy and ethical considerations.

The Bottom Line

AI is moving fast, and consent forms, IRB frameworks, and institutional policies can’t keep up. Research institutions integrating AI into drug discovery, target identification, and biomarker development are building legal exposure on pillars of sand—outdated consent language.

The time to address this isn’t after the first privacy breach, regulatory action, or licensing dispute—it’s now, before your groundbreaking AI research becomes a cautionary tale about inadequate consent.

[1] See, e.g., 45 C.F.R. Part 46; Henrietta Lacks: Science Must Right A Historical Wrong, Nature (Sept. 1, 2020), https://www.nature.com/articles/d41586-020-02494-z; The Legacy of Henrietta Lacks – Frequently Asked Questions, Johns Hopkins Medicine (last accessed Sept. 10, 2025), https://www.hopkinsmedicine.org/henrietta-lacks/frequently-asked-questions; Pub. L. No. 93-348, 88 Stat. 342; 45 C.F.R. Part 46; U.S. Dep’t of Health & Human Servs., The Belmont Report, https://www.hhs.gov/ohrp/regulations-and-policy/belmont-report/index.html (last accessed April 25, 2026)

[2] Rocher et al., “Estimating the success of re-identifications in incomplete datasets using generative models,” Nature Communications (2019)

[3] See e.g., Moore v. Regents of Univ. of California, 793 P.2d 479 (Cal. 1990); Washington Univ. v. Catalona, 490 F.3d 667 (8th Cir. 2007).

[4] 45 CFR 46.111(a)(4); 45 CFR 46.116

[5] Ferretti A, Ienca M, Hurst S, Vayena E. Big Data, Biomedical Research, and Ethics Review: New Challenges for IRBs. Ethics Hum Res. 2020 Sep;42(5):17-28. doi: 10.1002/eahr.500065. Erratum in: Ethics Hum Res. 2020 Nov;42(6):20.

Healthcare Law Insights

Before AI Meets Your Biobank: Five Must-Do Steps for Research Institutions

Healthcare Law Insights

About Our Firm