Big Biotech Data: Implementing Large-Scale Data Processing and Analysis for Bioprocessing

View PDF

Managing large amounts of data presents biopharmaceutical companies of all sizes with the need to adopt more efficient ways to handle the ongoing influx of information. At KNect365‚Äôs September 2017 Cell and Gene Therapy conference in Boston, Lisa Graham (founder and chief executive officer of Alkemy Innovation, Inc. in Bend, OR) spoke about the need for data management, data analysis, and process monitoring systems to evolve. Although she was speaking at a cell therapy event, her points are applicable to the biopharmaceutical industry at large ‚ÄĒ which can learn a great deal overall from data management practices in other industries.

Alkemy Innovation is an engineering company delivering science-based solutions in business/engineering intelligence and strategic data acquisition and process analytical technology (PAT). In a recent eBook for BPI, Graham examined how different software and hardware applications can provide required scientific insight for success with bioprocessing in two industries: brewing and biopharmaceuticals (1). Techniques for developing both PAT and novel data analytics approaches could cross over between those two. After the conference session in Boston, we spoke about the topic of ‚Äúbig data‚ÄĚ in more depth. Highlights of our conversation follow.

Business Value of Big-Data Strategies
Revenue recovery from historian data
Upstream‚Äďdownstream data integration
Improved product quality
Minimized waste
Forecasting asset failure
Remote monitoring, predictive maintenance
Increased line availability
Optimized maintenance schedules
Data cleansing and contextualization

What We Measure and Why
Whether a company is creating cell/ gene therapies or other categories of products, your presentation emphasizes that the first step is to determine what should be measured and why. What information is important, how is that information stored for later access, and how can it be analyzed? ‚ÄúWe don‚Äôt necessarily look for a single answer regarding what software to use,‚ÄĚ Graham said. ‚ÄúInstead, we need to develop a strategy or methodology for how to analyze information. Whether we‚Äôre brewing beer or creating medicines for injection ‚ÄĒ or even creating some foods ‚ÄĒ the biotechnology industry needs an optimized PAT coupled with comprehensive data analytics. Companies should develop such a strategy after they define the product quality attributes (PQAs) that they want to understand. They can use tools to develop functionality in the chemistry or the biology at work with their processing environments.‚ÄĚ

She went on: ‚ÄúPATs provide part of that functionality, but a core important determination is how to couple PAT with a comprehensive data analytics application. Without a fundamental understanding of your bioprocess, you cannot achieve desired PQAs consistently. And whether your cells are making a protein or just increasing their own numbers, a centralized manufacturing model can be applied across the board to apply what we learn from other industries.”

Figure 1: The challenge of working with process data

How does that differ from¬†traditional approaches driven by¬†empirical data collection and business¬†needs?¬† Graham recalled, ‚ÄúBefore the¬†advent of cloud computing and our¬†ability to take advantage of high-throughput screening, the¬†biotechnology industry brought quite a¬†few PAT tools into practice: sensors¬†and other technologies for measuring¬†parameters in real time, for example.¬†We would look at the big ‚Äėlake‚Äô of data¬†we generated, then decide from that¬†what it tells us, and then figure out¬†what our insights are‚ÄĚ (Figure 1).¬†

‚ÄúBut with PAT, cloud storage, and¬†a wealth of high-throughput data,‚ÄĚ she¬†continued, ‚Äúwe approach the situation¬†in reverse: beginning with the insight¬†that we need and what information¬†will provide that for us, then using¬†that set of expectations to help us
choose our PAT. If you look at it that¬†way, then perhaps it is no longer a¬†‚Äėbig-data‚Äô problem but a matter of¬†selectively gathering the data we plan¬†to use. Of course, we don‚Äôt always¬†know what insights we‚Äôre looking for.¬†We need a way to go down that path
quickly without a lot of cutting and¬†pasting of process data into¬†spreadsheets.‚ÄĚ

Spreadsheets are still ubiquitous in¬†bioprocess development.¬† ‚ÄúYes,‚Ä̬†Graham said, ‚Äúand those can be a¬†disaster. Somebody once joked that¬†spreadsheets are like rabbits: They¬†multiply rapidly until you‚Äôre wading¬†through a sea of them, and nobody¬†knows where they are going or where¬†they went. Add to that the challenges¬†that come with collaboration,¬†compounded by how frequently people¬†change jobs in this industry, and you¬†see that this is not a sustainable¬†solution. Even many financial¬†organizations and finance officers are¬†urging a move away from¬†spreadsheets. It‚Äôs time to use¬†something better.”

So it sounds like you are¬†encouraging a change in human¬†behavior toward acceptance of a topdown approach.¬†‚ÄúYes,‚ÄĚ Graham¬†confirmed. ‚ÄúTo illustrate this, I
generally ask my audience how many¬†people have fitness programs (or think¬†they should). In a fitness program, our¬†‚ÄėPQAs‚Äô are ourselves as related to our¬†activity and our field and our genetics.¬†It‚Äôs actually analogous.

‚ÄúIn my case,‚ÄĚ she explained, ‚ÄúI¬†started a new fitness program: I went¬†out and started running, ran a few¬†laps, and came home. Then what did I¬†do? I opened up an Excel sheet to¬†keep track of my laps with some¬†details, and I continued that for a few¬†weeks. As I started to progress to¬†running one mile, then two miles,¬†then five kilometers, I wished I had¬†gathered more information in my¬†spreadsheet. There were new questions
I wanted to ask about my progress, but I couldn’t because I hadn’t bothered to record the data. And then I started to see errors in the spreadsheet because I knew I didn’t run 25 miles, but that’s what the spreadsheet indicated.

‚ÄúSo when someone says to me, ‚ÄėI¬†can use Excel with all my templating¬†to handle my data,‚Äô I know that such¬†tracking systems are unsustainable and¬†can become unreliable over time.¬†Particularly when the number of¬†experiments increases and the need to
overlay your breadth of experience¬†develops, you need a more facile¬†solution than trying to combine¬†multiple spreadsheets. Without a real¬†solution, we simply won‚Äôt ask the¬†questions that we should ask.‚ÄĚ

Recalling her own fitness example,¬†Graham pointed out, ‚ÄúFor the whole¬†time I was doing my running¬†program, I had a Fitbit device sitting next to my desk in a box. And I was¬†too hesitant to change, to open it up¬†and learn about the technology and¬†use it. So when I reached the point of¬†losing interest in keeping my¬†spreadsheet updated with all its errors,¬†I finally put the Fitbit on and ‚ÄĒ wow¬†‚ÄĒ¬†now I didn‚Äôt have to do anything¬†after I ran. I eliminated work, and I¬†could instantly see the results of what¬†I was doing. I even started analyzing¬†my progress differently because it was¬†so easy to model the data. So this¬†underscores the importance of using¬†the right tool and accepting that it¬†could require a change in behavior.¬†Effort today brings a big payoff¬†tomorrow.‚ÄĚ

Framework For A ‚ÄúSpeed-of-Thought‚ÄĚ Workflow

Manufacturing intelligence using Seeq software . . .
8:00 am ‚ÄĒ Request: A process¬†deviation or optimization study requires¬†ad hoc analysis across data silos to¬†achieve quality and yield.
9:00 am ‚ÄĒ Assemble and¬†contextualize data. Seeq software¬†connects to other process historians and¬†databases to aggregate, organize, and¬†model the required information from¬†disparate data sources and types.
9:30 am ‚ÄĒ Investigate and collaborate.¬†Advanced trending and visualization
tools, pattern and limit searches, and user-defined functions accelerate insight
with colleagues in real time.
10:15 am ‚ÄĒ Monitor and publish.¬†Insights can be shared directly with
colleagues, captured as a report or dashboard in Seeq Topics, or exported to
other tools (e.g., SIMCA, JMP, Excel).

Figure 2: A data analytics framework can provide access to discrete‚Äďtime-point, batch data sets for analysis; VCD = viable cell density.

How We Store It
So data storage and connectivity systems are needed to collect information for a customer’s project. Do those include visualization?
‚ÄúEven¬†though some programs offer¬†visualization or trending of charts,‚Ä̬†said Graham, ‚Äúunless you collect all¬†your data into databases and/or data¬†historians, you‚Äôre still missing that¬†ability to connect what you‚Äôre¬†measuring from offline analytical¬†instruments with what the online¬†sensors measure. For a complete¬†strategy, you need data storage that¬†captures the metadata about an¬†experiment, too, perhaps in an¬†electronic laboratory notebook (ELN)¬†or other application. But that is only¬†part of the solution. You may have a¬†data historian and ELNs, but without¬†those sources connected to one¬†another, you are missing the big
picture‚ÄĚ (Figure 2).

How we Analyze It
Graham explained that at the start of a project, she helps a client company take stock of what it wants to accomplish and what programs already are in place for gathering the
necessary information. Working from client spreadsheets, she can help engineers create process models and develop insight into what should be measured and how. But for longer-term solutions, she encourages companies to redefine their workflows and data-analytics strategies for gaining new insights into the products under development through iterative data interaction.

‚ÄúOne example is the Seeq data¬†analytics application with its Biotech¬†Lab Module feature,‚ÄĚ she said ( ‚ÄúThis is a batch-focused application that we use to¬†connect different data sources so that¬†pharmaceutical scientists and biologists
can perform rapid ad hoc interactive¬†analyses without needing to hire¬†additional data analytics experts. The¬†program gives everyone the necessary¬†historical information in a single¬†environment‚ÄĚ (see the ‚ÄúFramework‚Ä̬†box). ‚ÄúWe use such tools to assemble
and contextualize data, which then enables an interactive investigation and real-time collaboration with colleagues who have developed other data sets. And of course it’s important to be able to monitor, publish, and share your results. Contrasting with the
proliferating spreadsheet system, here all the information resides in a centrally available location. And analysts can continue to use sophisticated visualization software (e.g., Tableau or Spotfire) if they want to.

Case Studies: Toward Improving Business Value
The ‚ÄúBusiness Value‚ÄĚ box offers¬†examples of benefits biopharmaceutical¬†companies might realize through¬†implementing a big-data strategy.¬†Graham typically begins by helping her¬†clients assess their goals: Do they want¬†to integrate upstream with downstream¬†bioproduction? Do they need to¬†minimize the waste spent in¬†bioprocessing or set up remote¬†monitoring? She emphasizes that this is¬†not a new approach: Other industries¬†use scalable data analytics and¬†visualization software already, in some
cases (e.g., the oil and gas industry) on a far grander scale. “Can you imagine
trying to monitor a thousand oil wells¬†using spreadsheets?‚ÄĚ she asked.

‚ÄúThe first example of the approach¬†that we take is from work done with¬†Merrimack Pharmaceuticals‚ÄĚ (2, 3). The¬†company had scaled up production, and¬†all results looked good when it¬†produced its good manufacturing¬†practice (GMP) batch at 1,000-L scale.
“The team saw excellent viable growth in the cells. Everything looked wonderful. But then, the downstream group said that the protein was not high-enough quality to proceed to the clinic. That group saw aggregation, among other issues, which emphasizes the point that you can’t just judge success based on one half of your
process. Development and¬†manufacturing groups tend to be¬†fractured from each other.‚ÄĚ

To troubleshoot the problem, Merrimack needed to conduct a number of benchtop experiments. “Using a spreadsheet approach, this would have involved cutting and
pasting in Excel documents, taking¬†measurements of cell health that come¬†from one offline device and adding¬†cell environment measurements from¬†another offline device, as well as¬†adding performance for the¬†measurement of protein productivity¬†and titer, then overlaying all that with¬†real-time processing data.‚ÄĚ If all those¬†measurements are pasted into a¬†
Microsoft Excel file from thumb drives and other spreadsheets, none of
the timestamps will be aligned because they come from different times and places.

‚ÄúAnd what if you have 20¬†bioreactors?‚ÄĚ Graham posited. ‚ÄúA¬†company shouldn‚Äôt have to do all that¬†prework. The rate at which new¬†medicines are developed is¬†bottlenecked not only by the time it¬†takes for people to cut and paste in¬†spreadsheet files, but also because it is¬†too difficult to overlay, say, the past¬†six months of bioreactor information
for a certain cell line and see what can¬†be learned from those data. So people¬†are not asking the questions they need¬†to because they don‚Äôt want to be¬†immersed in manually integrating all¬†that information.‚ÄĚ

Figure 3: Integrating disparate data sets ‚ÄĒ upstream/production examples

In the case study, the upstream¬†group saw positive signs before the¬†downstream group detected the¬†problem ‚ÄĒ but troubleshooting was¬†complicated by the need to assess data¬†from six bioreactors, 14 individual¬†sample points, several offline pieces of¬†analytical equipment, and thousands¬†of online trending data points ‚ÄĒ all¬†manually handled in Excel¬†spreadsheets (Figure 3). By contrast,¬†Figure 4 charts an approach using a¬†data-analytics framework providing¬†access to discrete time-point batch-data sets. It takes the click of two or¬†three buttons to calculate what is¬†happening inside a given bioreactor¬†and/or downstream purification¬†process and export those results ‚ÄĒ¬†instead of needing to find where the¬†data are sitting, then copy and paste¬†them manually (with an inherent¬†potential for error).

Figure 4: Case study: overview of a scientist or engineer’s workflow…

‚ÄúIn this project,‚ÄĚ Graham said, ‚Äúwe¬†used offline data for viability,¬†metabolites, and titer measurements,¬†as well as on-line process data from an¬†Emerson DeltaV data historian. In¬†essence, this was a scale-down study¬†to assess the impact of media changes¬†on cell growth and productivity to¬†help us understand the science of the¬†process and enable a successful¬†technology transfer back to the 1,000-L bioreactor scale for GMP¬†manufacturing. As the scientists¬†executed the work, they captured¬†experimental data from online and¬†offline instruments, all of which were¬†read into the data analytics application¬†for analysis and sharing with¬†colleagues.‚ÄĚ

I commented about legacy issues¬†and the fact that this new approach¬†will keep all of a company‚Äôs data in a¬†much more accessible format rather¬†than forcing future engineers to go¬†back through 20-year-old¬†spreadsheets. ‚ÄúAnd the raw data are¬†preserved,‚ÄĚ Graham added, ‚Äúso you¬†can indeed go back to do whatever you¬†needed to do from a filing perspective¬†or apply lessons learned.‚ÄĚ

Her next example concerned a research focus area in which scientists routinely perform screening assays to optimize media or cell-line performance. Here the Seeq-based
Biotech Lab Module feature applied¬†with design of experiments (DoE) to¬†compare performance among different¬†media-development assays. ‚ÄúIn this¬†project,‚ÄĚ she said, ‚Äúwe used offline¬†data for viability, metabolites, and titer¬†measurements. Imagine the ability to
look at not only one assay with 24 centrifuge tubes or 96 plate wells, but also to overlay assays performed across time. A scientist can select a cell line or media component of interest, then let the application filter everything related for him or her to analyze and
then share learnings without hours¬†spent on spreadsheets.‚ÄĚ

In such an approach, metadata¬†capture is key for context of the work.¬†‚ÄúFor example,‚ÄĚ Graham said, ‚Äúan ELN¬†network or database can capture what¬†is done and why with the names of¬†project members involved, lot numbers,¬†cell-line information, observations, and¬†conclusions ‚ÄĒ all important¬†information that could be needed later.‚ÄĚ

‚ÄúAn automated system removes the¬†need for individual templates,‚Ä̬†Graham pointed out, ‚Äúcreating¬†thumbnail views for all the data that¬†team members have chosen to look at.
They can compare cell growth curves, for example, with metabolite
concentrations or titer levels from a¬†central set of experimental data.‚ÄĚ

Thinking About the Entire Data Analytics Experience
Graham concluded our conversation by noting that the biopharmaceutical industry is very interested in knowledge management right now. And it ought to address the
operational side, choosing optimal data processing and analysis technologies and ensuring scientific leadership throughout bioprocessing. She emphasized that successful knowledge management is a team effort, requiring acceptance of new
tools that streamline access to the growing amount of data accumulating throughout a product’s life cycle.

Strategies exist to address large-scale data processing and analysis needs in the realm of biopharmaceutical development. Options are available to support immediate action for bringing in solutions that can improve the pipeline now while providing flexibility for future changes. An important aspect of implementing data processing and analysis includes defining a strategy in which the scientific community can take a lead role in
strong partnership with information technology professionals.

Graham LJ. Of Microbrews and Medicines: Understanding Their Similarities and Differences in Bioprocessing Can Help Improve Yields and Quality While Reducing
BioProcess Int. eBook 8 February 2018:

2 Graham LJ. Deriving Insight at the Speed of Thought. Bioprocessing J. 16(1) 2017:
32‚Äď41; doi:10.12665/J161.Graham.

3 Barreira T. Merrimack Pharmaceuticals, When Novel Molecules Meet Bioreactor
Realities: A Case Study in Scale-Up Challenges and Innovative Solutions. ISBioTech Conference: Virginia Beach, VA, 12 December 2016.

Further Reading
Brown F, Hahn M. Informatics¬†Technologies in an Evolving R&D Landscape.¬†BioProcess Int. 10(6) 2012: 64‚Äď69;

Chviruk B, et al. Transforming Deviation¬†Management. BioProcess Int. 15(8) 2017: 12‚Äď17;¬†

Fan H, Scott C. From Chips to CHO Cells: IT Advances in Upstream Bioprocessing.
BioProcess Int. 13(11) 2015: 24‚Äď29;

Moore C. Harnessing the Power of Big Data to Improve Drug R&D. BioProcess Int. 14(8) 2016: 64;

Rekhi R, et al. Decision-Support Tools for Monoclonal Antibody and Cell Therapy
Bioprocessing: Current Landscape and Development Opportunities.
BioProcess Int.
13(11) 2015: 30‚Äď39;

Rios M. Analytics for Modern Bioprocess Development. BioProcess Int. 12(3) 2014: insert;

Sinclair A, Monge M, Brown A. A¬†Framework for Process Knowledge¬†Management. BioProcess Int. 10(11) 2012:¬†22‚Äď29;

Snee RD. Think Strategically for Design of Experiments Success. BioProcess Int. 9(3)
2011: 18‚Äď25;

Whitford W. The Era of Digital Biomanufacturing. BioProcess Int. 15(3) 2017:

S. Anne Montgomery is cofounder and editor in chief of BioProcess International, Lisa J. Graham, PhD, PE, is founder and chief executive officer of Alkemy Innovation, 63527 Boyd Acres Road, Bend, OR, 97701; 1-541-408-4175;