Many of you shared this article with me recently, saying this group in China has been able to store 36 petabytes of data.
At the outset it seems cool. But I dug into the research paper.
They haven’t stored 36 petabytes of data. It says they have created a medium that can store 36 petabytes, the highlight here is CAN.
They have actually stored only 156.71 KB and extrapolate the potential of their medium to 36 petabytes if they increase the length of the tape.
How does the system work?
They have created a system similar to a cassette tape using nylon.
They then create compartments in the cassette by alternating between hydrophobic and hydrophilic sections. Hydrophilic (literally meaning water-loving) regions are where DNA can be deposited. Hydrophobic (literally water-hating) regions separate the hydrophilic regions from each other, so that data from one region can be selectively read out without disturbing the others. A barcode (exactly like the ones you see on products you buy from the supermarket) is printed on each section. This acts as an address marker. Note, this is a physical barcode and not a DNA barcode, so it can be read out using an optical scanner similar to the ones used in supermarkets.
i) Writing data in
First, a short DNA sequence of 20 nucleotides (i.e. fundamental units of DNA - A, T, G, C) is immobilized in each compartment. This is referred to as a handle sequence. The handle sequence in each compartment is different.
Now the files to be stored is synthesized (using chemical methods), with an extra short sequence at the 5’ end that is complementary to the handle sequence. Complementarity in DNA is the basic principle on which a double stranded DNA is formed. An A on the first strand always binds to a T on the second (and vice versa). Similarly, a C on the first strand always bind to a G on the second (and vice versa).
So when the synthesized DNA is added to the specific section of the cassette where it should bind, the short sequence at the end (called an adapter) binds to the handle sequence. And then a PCR mix is added to extend the handle sequence so that the entire thing becomes a double stranded DNA, with the lower strand (3’-5’) bound to the nylon and the upper strand (5’-3’) bound by hydrogen bonds to the lower strand because of complementarity. PCR is a process by which multiple copies of a DNA are made. Here is an explainer video if you want to learn more about PCR.
Data in this case is stored in DNA sequences 100 nucleotides long. So if you need to store a file of 1 KB (and each hydrophilic section stores one file), this is what you would need to do. Assuming each base stores 2 bits, one 100 nucleotide sequence stores 200 bits. 1 KB = 1024 bytes = 8192 bits. So we would need 8192/200 = 40.96, i.e. 41 sequences of 100 nucleotide length to store this data. So you synthesize each of these 41 sequences with the adapter, add them to the mix, and when the cassette is dipped in the solution, they bind to the specific hydrophilic region. This is how data is written in.
The DNA is then protected by depositing a metal organic framework (of recent Nobel Prize fame) to prevent exposure to moisture and oxygen.
I am skipping the specifics of the chemicals and reactions involved here. If you are curious, feel free to check out the paper.
ii) Reading data out
In the paper, the authors have stored 5 different files in 5 compartments of the cassette. When they want to retrieve file 1 for instance, they insert the cassette into the cassette reader they have built, connect this cassette reader to their computer, and enter the file they want. The cassette starts rotating, and stops at the point where the reader identifies the right barcode for the file.
A mini-enclosure is then created to enclose that specific hydrophilic region associated with the barcode of the desired file. This is where their engineering prowess comes in. They have designed the compartment in such a way that all the reagents needed to dissolve the protective cover, separate the upper strand (5’-3’) from the lower strand are all accessible through different valves in the same fluidic system. The strands are separated by breaking hydrogen bonds between them using sodium hydroxide (NaOH). You can imagine this like having different taps connected to the same water supply in your house.
Once the lower strand goes into the solution, the cassette is removed from it, the solution separated out, DNA separated and purified from the solution, and then sequenced using a separate sequencer to . When you initially read the paper, you might wrongly assume that the reading of data is happening within the cassette-reader system. But that’s not the case. You still need a separate reader.
Which sequencer do they use?
MGI DNBSEQ-T7. This sequencer is developed and made in China. More on how this sequencer works, and how it compares to other sequencing systems in a different blog soon.
Once the upper strand goes into solution, the PCR mix is added to regenerate the upper strand by using the handle sequence on the lower strand as the template. That way, the data is regenerated each time you retrieve it. And then you put in the protective coating as before.
The authors were able to achieve 10 read cycles with 100% accuracy. They were able to store 1 text file of 0.11 KB and four image files of 51.6, 24.8, 46.9 and 33.3 KB respectively.
iii) Erasing and Rewriting data
Now what happens if you want to erase the data in a compartment and rewrite it with a different piece of data?
So here you would need to keep the handle sequence as it is, and cut out the remaining part of the sequence that has the data that was initially stored. You can use an enzyme called a restriction endonuclease to achieve this (note: all files are synthesized with a recognition site for the Mbo I restriction endonuclease, ignore this if you don’t understand what this means). Once the cutting and clearing are done, the new data with an adapter complementary to the handle sequence can be added.
The authors performed this operation once, and they had reasonable success with it.
iv) Data Reconstruction
This is done using an algorithm called DNA Fountain. I will write in detail about this in another blog soon.
A few notes on this paper:
They have used a 15 meter long tape to store 156.71 KB, and calculate that a 1km long tape would be able to store 36 petabytes.
Another claim they make is of the time taken to encode data and retrieve it - 25 minutes if you are writing for the first time and 50 minutes if you are erasing and re-writing. This is misleading, because it does not include the time taken for DNA synthesis (which was done through an external vendor), DNA purification, sequencing and computational analysis. The entire cycle would take 20 hours at the least.
The only real innovation in this paper is the building of the cassette and cassette reader, which offers a good electro-mechanical system to separate different files and retrieve selectively, as opposed to trying to retrieve data from an oligo pool (fancy word for a mix of short DNA sequences in a single solution). Constructing fluidic systems of this precision controlled by computers is by itself an admirable feat, but this is not an end-to-end DNA data storage system.
The cost of synthesis still remains a big bottleneck. I believe this is why they didn’t extend their work to the 1km tape and store 36 petabytes. If you need to rewrite data you need to re-synthesize in this case.
*********
Some updates:
BioCompute has won the CXXO Deep Tech Disruptor Award which comes with a 5 lakh ($5000) grant. I will be in Delhi on 30th October at the Kalaari Capital CXXO Summit to share what we are building and receive the award. I am in Delhi till 1st November EOD, if you want to catch up please reach out.
Note: Just because you reached out, I don’t have an obligation to respond or meet. You are extending an invitation and I can choose to decline.We have been offered a fully-funded opportunity to present our work at the ‘MoleculArXiv Autumn School on DNA Data Storage’ being held at the Institut d’Études Scientifiques de Cargèse (IESC) in France. We are the only non-academic lab to be selected. We are still planning our trip, if you are in France and would love to catch up, drop me a note. Especially if you are a researcher, either in academia or otherwise.
We FINALLY got our automation system. We are setting it up and getting it ready to zoom us into scale up.
Image Description: Our intern Franci setting up the system



