Back to all posts
Systems Design | Data12 min read

AI Systems: The Data Debt in Biotech

I spend less time thinking about AI directly and more about the data. Specifically, the data issues that plague me day and night within biotech and BSD (Biosolution Designs).

Everyone is obsessed with the model architecture, but look at the foundation we are building on. Papers have an insanely low replicability rate... and that's just based on the tiny fraction that have an attempted reproduction. Within gene and cell therapy, the data collected is often filtered in any number of semi-arbitrary ways. There are multiple sites dedicated to doing just that: curating some specific subset in their own specific way, creating silos of data that are then combined and filtered ad nauseam.

Oh, and the original source files hosted on a seemingly reputable university site from 2014? 404 Not Found.

Then there’s the GenBank flatfile. Unless it’s coming directly from NCBI, it’s probably outside of their spec. I know for BSD, it doesn’t have the capability to hold the necessary information we need, so we end up with a bastardized version of the format.

Luckily for us, it’s always internal and never meant to be shared, and we are figuring out ways to ditch it. But grabbing a file from the wild? Good luck ensuring your program imports it correctly. Or worse, it might import just fine; you make a few edits, export to a different program, and suddenly the file is all wonky. Was the export program wrong? The importing one? Who knows. The data integrity is broken.

Just write it all out again in the new program or have a custom script to fix the file between programs, and have a system to reannotate, etc. Then do it all again when you get sequencing results back.

Oh, and you want to combine some data from Uniprot and Reactome and and and... Time for a new script and get ready to transform that data.

In comes current AI solutions. They will glue together the mess for you. Yet the underlying layer is still bad and in some cases, as we are starting to see now, it just adds another blow to the OSS that these tools rely on. Especially in the life sciences, most tools are made and abandoned after a few years. To get them to work? Install an old version of linux with an old version of python and...

Not to mention ~90% of the time, (in my case) the requirements don't fit any existing tool, OSS or not. Maybe these platforms will vibe code new ones for you. These are mostly scripts,- they don't need to be long running enough where major bugs show... I do like the idea of auditability with pipelines in this fashion, but I am an extremest. I think Data Provenance from generation to pipelines and results should be partially solved at the OS level.

But this data issue is why I find BoP (Bird of Prey Vector Construction) so interesting: the standardization.

It's allowing BSD to build unified software systems and data pipelines for its therapy, manufacturing and product Co's.

I think standardization in vector construction is something other companies will start to put together soon because the value proposition is undeniable. Whatever you build and get results out of, anyone else can build from scratch and replicate it. No translation errors, no "bastardized" file formats, no need to send your DNA somewhere to hold in the small event that someone will actually attempt to reproduce results.

Imagine this: Even your own team can reproduce it when one of your freezers fail... or something get's misplaced... small company problems.

So, what does this mean for AI? It allows a stochastic, (and yes they are probabilistic one way or another), model to be built upon structured, clean data across design, construction and functional blocks. That has been the holy grail for some time.

People like to talk about specific LLMs and experiment with applying GNNs and GANs to biology with what's currently there. And sure, sometimes great stuff comes out. But mostly... not. However, with structured cell-specific data, even simple MLPs (Multilayer Perceptrons. Anyone remember those?) have a massive place in this system.

Without clean data... well, there’s a reason so many AI therapeutic companies fail, and why Eli Lilly seems to be doing it right.