Friday, March 27, 2026

< + > From Programmer to Director: My 25-Year Journey into the Heart of Data at Dana-Farber

The following is a guest article by Doug Buell, retired Technical Director, Research Computing Services at Dana-Farber Cancer Institute

When I walked into Dana-Farber, nonprofit cancer treatment and research center based in Boston, 25 years ago, I didn’t think of myself as a “data person.” I was hired as a computer programmer – someone who made systems work, fixed what broke, and built tools when nothing existed.

Over time, the work revealed something deeper: data isn’t the byproduct of science. It is the science.

Back then, data lived in scattered documents across shared drives, desktops, and even floppy disks. Clinical research teams were using Adobe to link protocol files together. The master file was constantly overwritten, connections were lost, and no one trusted what they were looking at.

My first major assignment was to bring order to that chaos. We built a web-based system to organize hundreds of protocol documents, forms, and amendments, making them accessible and consistent. That project taught me an early lesson in stewardship: if you don’t control your information, it controls you.

Evolving from Developer to Data Leader

For years, I stayed close to the code – building applications, managing clinical trial systems, supporting administrative data needs. The real shift came when I stepped into leadership and was asked to guide the institute’s research data ecosystem.

That’s when the full scope of the problem became clear. Data wasn’t centralized, tagged, cataloged, or consistently backed up. Dozens of aging servers held billions of files across multiple petabytes, with little redundancy beyond RAID. A single catastrophic event could have erased years of irreplaceable research.

When a power outage exposed that vulnerability, we faced a hard truth: storage isn’t strategy. We needed a comprehensive data management approach.

Building a Modern Data Infrastructure

My goal became straightforward: make research data discoverable, resilient, secure, and trustworthy.

We rebuilt from the hardware up, adopting a three-tier model:

  • Hot copy for day-to-day research access
  • Warm copy for operational recovery
  • Cold copy for long-term preservation

We explored AWS Deep Glacier early on, but hidden retrieval costs and multi-day delays made it impractical for research environments that require responsiveness. Tape – surprisingly – emerged as our most reliable and cost-effective long-term solution. Multiple petabytes on-premises, fast retrieval, predictable costs, and no tolls to access our own data.

Sometimes the future looks like a smarter version of the past.

Mediaflux and the Power of Metadata

We had been using Mediaflux to manage our initial tape system, and it became the backbone of our file management strategy. It allowed us to track data across environments and harvest rich metadata.

But we made an early mistake: we treated Mediaflux like a newer version of our old system instead of leveraging its full tagging and cataloging capabilities. That limited our ability to find and interpret data after the fact.

Which brings me to one of my strongest convictions: Metadata must be created at birth.

Trying to tag data retroactively is nearly impossible at scale. We still have hundreds of reel-to-reel tapes from the 1970s stored off-site. We know they contain data – but no one knows what data. Without metadata, they’re essentially useless.

The same problem applies today. New instruments generate multi-terabyte datasets in hours. Without tagging at creation, neither researchers nor AI systems can meaningfully interpret what they’re looking at.

Some research teams began experimenting with scripts that write meaningful metadata the moment a file is created. It isn’t perfect, but it’s the right direction. True data visibility requires metadata from day one.

The Human Challenge

The biggest obstacle was never technical. It was human.

Researchers treat their data like gold – and they should. When they didn’t fully trust central systems, they kept private copies. Sometimes dozens of them. After several power outages, some labs even built their own mini data centers rather than rely on shared infrastructure.

Rebuilding trust takes years. Losing it takes minutes.

Fragmented data becomes invisible data. And if we can’t see it, we can’t protect it, catalog it, or prepare it for the future – especially a future driven by AI.

Preparing for an AI-Driven World

My philosophy on AI is simple: AI is only as good as the data it learns from.

If AI is going to accelerate cancer research, then:

  • The data must be accurate and timely
  • The metadata must be meaningful
  • The infrastructure must scale
  • The expertise to curate it must grow

One of the most important emerging roles in research is the data librarian – professionals who understand both science and information architecture.

AI can help predict failures, optimize storage, and streamline operations. But it must be fed truth, not noise.

What I Leave Behind

As I retired, there was a quiet symmetry to my final major decision: purchasing a new tape library from Spectra Logic.

This wasn’t the tape system of decades past – cartridges on the floor and manual handling. It was automated, intelligent, scalable to virtually limitless capacity, dramatically more energy-efficient than disk solutions, and fully air-gapped for privacy.

It delivered an on-premises Glacier-like solution – but on our terms. Predictable costs. Fast access. No penalties to retrieve our own data.

The lesson is familiar: old ideas don’t disappear. They evolve.

The challenges ahead are real. Data growth is exponential. Instruments generate larger files every year. Storage needs double rapidly. But the cultural challenge is just as significant: encouraging researchers to tag, curate, and trust.

My successors must balance both – building resilient infrastructure while helping researchers understand how their data becomes searchable, reusable, and foundational for breakthroughs.

What makes this work exciting is that it never stops changing. New data types, new storage models, new computing platforms, new discovery tools – they arrive constantly. Through it all, one truth remains: Data only becomes powerful when it is managed with intention.

That was my journey. And it’s a journey that will continue long after I’m gone.

About Doug Buell 

Doug Buell is a seasoned technology leader and retired Technical Director for Research Computing Services in the Informatics & Analytics division at Dana-Farber Cancer Institute, where he provided advanced computing solutions and user support to the research community. He previously held senior roles in data integration and application support at the University of Massachusetts Medical School and Partners HealthCare, contributing to large-scale research IT operations. With decades of experience at the intersection of scientific research and computing, Doug combines deep technical expertise with a collaborative approach to problem-solving. He resides in Massachusetts and remains engaged with technology and research communities.



No comments:

Post a Comment

< + > Bringing AI Agents to Healthcare with Anshar AI

In our recent interview with Pinaki Saha, CEO & Founder at Anshar AI , he put some numbers on the problems that are commonly cited in he...