Wednesday, July 23, 2025

< + > Learning to Fish in Data Lakes

The following is a guest article by Sujay Jadhav, Chief Executive Officer at Verana Health

Life sciences companies are continually searching for a way to improve the odds in the costly, high-stakes quest that is clinical research. One massive resource: the growing volume of data generated across the healthcare spectrum, holding potential clues pointing to research breakthroughs.

More data, in theory, means more clues. What if we assembled a LOT of data?

As research organizations do just that, “data lakes” are playing an increasing role in the life sciences industry. Driven by advancements in cloud technology, data lakes act as a central repository, enabling the collection, storage, and analysis of petabytes of data in its raw, native format.

But without proper data governance and a coherent strategy, a data lake can devolve into a data swamp. Hoping for fish to jump into your boat isn’t a strategy. Turning big data into big insights requires actionable analytics.

Bringing Intelligence to the Lake 

Unlike a data warehouse (which typically contains processed, structured data), a data lake holds both structured and unstructured data. This allows organizations to amass the rich, unwieldy information from real-world data (RWD) sources beyond clinical trials, such as claims and billing data, electronic health records (EHRs), clinical notes, radiology, imaging, and more.

While data lakes can be crucial for managing this diverse pool of information, data lakes must be paired with business intelligence in order to generate meaningful insights from volumes of diverse data in order to drive informed decision-making, efficient clinical operations, and improved patient care.

Finding actionable insights is driven as much by smart humans as it is by actual data. To support researchers, life sciences companies need sophisticated data analysis tools and techniques to navigate massive data lakes.

Specialized Tools: Navigating to Insight

With the right resources, researchers can begin to make sense of RWD from various healthcare settings and sources. This RWD can be critical to remove knowledge gaps from clinical studies data alone, and offer an understanding of disease progression, diverse patient populations, and long-term patient outcomes.

Artificial intelligence (AI), machine learning (ML), and natural language processing (NLP) techniques have become critical tools for facilitating the curation of real-world evidence (RWE) from vast amounts of unstructured data such as clinical notes contained in EHRs. Meanwhile, tokenization has emerged as a valid solution to link RWD from disparate data sets while protecting patient confidentiality and complying with health information regulations.

In light of these advances, it’s important to remember that AI-driven models are only as good as the data that powers them. Underlying the successful generation of RWE from unstructured data is a deliberate, clinically informed approach.

It takes a team of clinicians, nurses, clinical informaticians, data scientists, epidemiologists, biostatisticians, and engineers working together to effectively curate and standardize data while retaining its original clinical context. And harmonized data (integrated structured and unstructured data) and models must be continuously refined to prevent bias and maintain accuracy.

Done correctly, these processes can provide researchers with access to high-quality, curated datasets for therapeutic areas that are disease-indication specific.

Reaping the Harvest

Within the digital records of doctors’ visits, lab results, and treatment histories lies a wealth of information to advance clinical trial design and execution.

When data scientists figure out where the big insights are and how to build the data pipeline that turns raw, unstructured data into business insights, the application of RWD offers a range of benefits to clinical researchers, patients, and healthcare professionals.

By analyzing historical recruitment success, patient demographics, and disease burden, researchers can identify high-performing clinical trial sites and speed protocol optimization and site selection. They can evaluate trial-eligibility criteria, match eligible patients, and recruit potential participants to clinical trials.

This increases efficiency, leads to shorter timelines, and improves patient access to research. Data-driven trials informed by RWD start with a stronger foundation, potentially avoiding mismatched enrollment, unexpected side effects, and costly delays that plague traditional trials.

RWD also helps researchers understand the natural progression of disease, compare the effectiveness of different treatments, learn how treatments are used in practice, and gain insight into commercial applications.

Recent FDA guidance and a growing range of use cases have accelerated RWE usage. Increasingly, RWE is finding its way into regulatory submissions for new products. Armed with RWE, sponsors have compelling and complementary data to augment Randomized Controlled Trials (RCTs) enabling them to accelerate the development of innovative treatment approaches, including discovering new indications for approved therapies.

Capturing clinical data before, during, and after a trial enables researchers to obtain a greater breadth and depth of information. RWD fills in missing insights that a trial doesn’t capture but are essential to understanding a therapy’s impact, and can improve interpretability and generalizability (e.g., by remediating missing data or losses to follow-up), extending follow-up beyond trial closeout, and characterizing the applicability of trial results to under-represented groups.

Data lakes provide a powerful and flexible platform for storing and processing the massive volumes of RWD relevant to modern research. With a skillful, deliberate approach, this disparate data can be transformed into RWE, driving better-informed decisions and improving healthcare practices.

About Sujay Jadhav

Sujay Jadhav is the Chief Executive Officer at Verana Health, where he is helping to accelerate the company’s growth and sustainability by advancing clinical trial capabilities, data-as-a-service offerings, medical society partnerships, and data enrichment. Sujay joins Verana Health with more than 20 years of experience as a seasoned executive, entrepreneur, and global business leader. Most recently, Sujay was the Global Vice President, Health Sciences Business Unit at Oracle, where he ran the organization’s entire product and engineering teams. Before Oracle, Sujay was the CEO at cloud-based clinical research platform goBalto, where he oversaw the acquisition of the company by Oracle. Sujay is also a former executive for the life sciences technology company Model N, where he helped to oversee its transition to a public company. Sujay holds an MBA from Harvard University and a bachelor’s degree in electronic engineering from the University of South Australia.



No comments:

Post a Comment

< + > Healthcare and the Robotics Revolution: Bridging the Trust Gap

The following is a guest article by Winston Leung, Senior Manager at QNX , a division of BlackBerry Robotics technology is experiencing r...