[3dem] Advice on storage server

Kikuti Carlos Carlos.Kikuti at curie.fr
Thu Feb 15 00:08:13 PST 2024


We are relatively new in SPA, also keeping drives on shelves for the moment… I’m considering the following strategy:

  1.  Movies corresponding to published structures: ship to EMPIAR, delete from local disks unless we intend to reprocess data in the short term;
  2.  Output of data processing pipeline (only the jobs that lead to the good maps, or relevant observations): keep in the drive on the shelf – as even after job selection, this can take a few Tb per dataset – mirror two disks for safety;
  3.  Movies corresponding to unpublished, but important structures: keep in the drive on the shelf – mirror two disks for safety;
  4.  Movies corresponding to bad collections, or bad samples, we could never get anything useful from them: delete.

My doubts are:
For point 1, how hard it is to get everything uploaded? Any reasons not to do it?
On point 2, I usually keep maps, extracted particles and aligned movies, but maybe I only need to keep particle locations and a detailed description of the pipeline beside the maps? Then it would only take a few Mb, and I could place them in our eLabFTW. The major problem here is that selecting the jobs already takes a while, and there is no reliable way to do it automatically.
On point 4, I’m always afraid of deleting something that some fancy new software will be able to process… we often work with flexible proteins that are really reluctant to process to high resolution, despite good contrast sometimes. But at some point one needs to take decisions… in the end, I have to admit that I’ve only deleted one dataset, with a cold sweat running over my spine.

Please let me know of your opinions on that.

Considering the long-term storage, I’ve been told that:

  1.  The famous tape system has a lot of logistical drawbacks: they often update software and hardware, and the old tapes need to be converted to new formats, periodically (very time consuming, and it gets expensive if you need to replace equipment) – places that have this kind of resource usually have a dedicated crew;
  2.  Transfer to tape is often prone to errors, and nobody is checking byte per byte if the copy went fine;
  3.  Hard drives fail if they are used too much, and also if they are not used at all. So the best would be to plug them every now and them, a bit like the old car in the garage. Not very time consuming, but one needs to think of doing this, and keeping track of which disk was plugged when (mental load, who hasn’t enough?) – and still, this doesn’t guarantee that they will last 10 years;
  4.  Some people are praying for the development of data storage in DNA, but I expect the copy to be extremely slow…

This is a very serious issue, and I only see it getting worse as we accumulate more and more data. I know I sound pessimistic, but I wish a great day to everyone.

Cheers,

---------------------------------------------------
Carlos KIKUTI, PhD
UMR144 - CNRS - Institut Curie
Pavillon Trouillet Rossignol
26 Rue d’Ulm - 75005 Paris, France
carlos.kikuti at curie.fr<mailto:carlos.kikuti at curie.fr>


Message: 3
Date: Thu, 15 Feb 2024 03:57:05 +0000
From: "Ludtke, Steven J." <sludtke at bcm.edu>
To: Jobichen <jobichenc at yahoo.com>
Cc: 3DEM Mailing List <3dem at ncmir.ucsd.edu>
Subject: Re: [3dem] Advice on storage server
Message-ID: <26C5131B-164C-4C2F-A578-87D6C5797849 at bcm.edu>
Content-Type: text/plain; charset="utf-8"

I should add that for long term backup, the most typical strategy is the convenient but unsafe "drives on a shelf". That would be a one-time purchase of ~$2k, but the chances that all of the drives work and you can fully recover the data in 5 or 10 years may be a little marginal. Worth noting also that portable USB drives as opposed to drives designed to be internal drives in a PC have massively lower reliability ratings in general. Also note that SSD's lose data over time if they aren't plugged in to a power source periodically for a "refresh".

---
Steven Ludtke, Ph.D. <sludtke at bcm.edu>                      Baylor College of Medicine
Charles C. Bell Jr., Professor of Structural Biology        Dept. of Biochemistry
Deputy Director, Advanced Technology Cores                  and Molecular Pharmacology
Academic Director, CryoEM Core
Co-Director CIBR Center


On Feb 14, 2024, at 8:06?PM, Ludtke, Steven J. <sludtke at bcm.edu> wrote:

If you don't expect to need to access it again, ie - purely an emergency backup, Amazon Glacier is a cost effective solution, as long as you have $ to continue paying for it. 100 TB of deep archive Glacier storage would run about $1200/year (+ additional cost if you need to retrieve it).

If you are storing it for possible additional processing, then you want the storage to be "close" in data transfer terms to the processing power. ie - if you are processing in the cloud, then storing the data in the cloud makes sense. Clearly you would not want to process the data directly from cloud storage. Keep in mind the relative speeds of transfer for different devices/transfer methods:

M.2 SSD -> 2-4 GB/s
8 drive RAID array with spinning platters directly on the machine -> ~1 GB/s
SATA SSD -> 0.6 GB/s
single spinning platter on machine -> 0.15 GB/s
gigabit network remote access -> 0.1 GB/s
less than gigabit remote access (cloud at typical institutions) -> <0.1 GB/s

For size comparison, a 4k x 4k x 1k tomogram at 8 bits is 16 GB, so opening that from an M.2 SSD might take 4-8 seconds, whereas opening the same file over a gigabit NAS would take almost 3 minutes.

Personally, I have a 12 bay Synology NAS box with a 10 Gb network card in it under my desk. With 16 TB drives and RAID6 this gives about 150 TB of usable storage space, which you can access at ~1 GB/s. Cost ~$5000, with an expected drive life of ~5 years, ie - expect you will have to periodically replace bad drives occasionally after the first few years.

It's worth noting here that at $5000, with an expected life of ~5 years before you start having to pay for more drives, this is $1000/year and gives high speed access, compared to the $1200/year for deep Glacier storage above. However, the Glacier storage has much better reliability than a single RAID6 array with no additional backup.

Anyway, some food for thought  :^)

---
Steven Ludtke, Ph.D. <sludtke at bcm.edu>                      Baylor College of Medicine
Charles C. Bell Jr., Professor of Structural Biology        Dept. of Biochemistry
Deputy Director, Advanced Technology Cores                  and Molecular Pharmacology
Academic Director, CryoEM Core
Co-Director CIBR Center


On Feb 14, 2024, at 6:14?PM, Jobichen <jobichenc at yahoo.com> wrote:

Dear All,
We are looking for some suggestions on storing the raw datasets/movies. What will be best option for storing around 100TB of movies/processed data.
What will be pros/cons of having own storage server vs cloud storage options.
Thank you for your time.
Jobi



_______________________________________________
3dem mailing list
3dem at ncmir.ucsd.edu
https://urldefense.proofpoint.com/v2/url?u=https-3A__mail.ncmir.ucsd.edu_mailman_listinfo_3dem&d=DwICAg&c=ZQs-KZ8oxEw0p81sqgiaRA&r=GWA2IF6nkq8sZMXHpp1Xpg&m=-fMPusn_TT7DVAUweasDDQG4kEyzhEAyjRtShGQYPmx9cRVoBtVsmUUqEMrMPs9w&s=LBeNcMDu7IJx1_Y7BTp2_JFhuug6w0oVJobkLUozOFc&e=


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ncmir.ucsd.edu/pipermail/3dem/attachments/20240215/7302abc1/attachment.html>

------------------------------

Subject: Digest Footer

_______________________________________________
3dem mailing list
3dem at ncmir.ucsd.edu
https://mail.ncmir.ucsd.edu/mailman/listinfo/3dem


------------------------------

End of 3dem Digest, Vol 198, Issue 21
*************************************
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ncmir.ucsd.edu/pipermail/3dem/attachments/20240215/c6dc7be8/attachment.html>


More information about the 3dem mailing list