[3dem] Advice on storage server

Takanori Nakane tnakane.protein at osaka-u.ac.jp
Thu Feb 15 00:45:37 PST 2024


Hi,

 >  1. Movies corresponding to published structures: ship to EMPIAR, delete
 >     from local disks unless we intend to reprocess data in the short 
term;
 > For point 1, how hard it is to get everything uploaded?

This depends on your network connection.

Around 2020, upload from MRC-LMB to EBI was very fast,
reaching 250 MB/sec (not megabits, megabytes!).
Upload from Japan to EBI is not so fast, but we can transfer
one entry (say 5 TB) in two or three days.
For those in Japan with poor internet connection, you can send your HDDs
to us (PDBj/EMPIAR Japan) and we will upload files on behalf of you
https://urldefense.com/v3/__https://empiar.pdbj.org/en/deposition/__;!!Mih3wA!EmdFPHK0U3JothuSkvYDWh2poft3QGH1ZsOXnFQekU_3TUrblFl6IGaHyvVUXmLVmW5aCWhxUM3S2Uc1_O0VN8qC-jO5thwQVBY$ .
I wish that EBI-EMPIAR offered the same mail-in deposition
service but they said this is impossible due to staff shortage and
strict security rules at EBI.

I wouldn't delete raw data (even after upload to EMPIAR).
Considering the cost of protein expression,
purification etc and the machine time, the cost of storage is
relatively small. I would make two copies
(in physically different locations) of raw data for case 1 and 3.
One copy is on our cluster (but not expensive distributed file system, 
just a JBOD RAID) and the other copy is on my shelf.
For 4, I might not make two copies, just one.

The big drawback of "lots of HDDs on the shelf" is that
one quickly loses track of which disk contains what.
Once the owner leaves the group, this becomes intractable.
Moreover, one has to regularly check each disk is still alive
and make another copy when one fails. This is very tedious.
If you have a central storage, everything is
in one place and one will be notified of failing/failed disk.
Unless many disks fail simultaneously, the data is safe.
(but you might do "rm" by mistake or your computer might be
infected by malware; so I keep one offline disk as a backup)

Eventually this depends on what one values.
I myself believe that keeping raw data and making it public is
essential for science. Others think it is waste of time and
money and one should focus on new research. Some people even say
it is waste of energy and bad for climate.

Just my two cents.

Best regards,

Takanori Nakane

On 2024/02/15 17:08, Kikuti Carlos wrote:
> We are relatively new in SPA, also keeping drives on shelves for the 
> moment… I’m considering the following strategy:
> 
>  1. Movies corresponding to published structures: ship to EMPIAR, delete
>     from local disks unless we intend to reprocess data in the short term;
>  2. Output of data processing pipeline (only the jobs that lead to the
>     good maps, or relevant observations): keep in the drive on the shelf
>     – as even after job selection, this can take a few Tb per dataset –
>     mirror two disks for safety;
>  3. Movies corresponding to unpublished, but important structures: keep
>     in the drive on the shelf – mirror two disks for safety;
>  4. Movies corresponding to bad collections, or bad samples, we could
>     never get anything useful from them: delete.
> 
> My doubts are:
> 
> For point 1, how hard it is to get everything uploaded? Any reasons not 
> to do it?
> 
> On point 2, I usually keep maps, extracted particles and aligned movies, 
> but maybe I only need to keep particle locations and a detailed 
> description of the pipeline beside the maps? Then it would only take a 
> few Mb, and I could place them in our eLabFTW. The major problem here is 
> that selecting the jobs already takes a while, and there is no reliable 
> way to do it automatically.
> 
> On point 4, I’m always afraid of deleting something that some fancy new 
> software will be able to process… we often work with flexible proteins 
> that are really reluctant to process to high resolution, despite good 
> contrast sometimes. But at some point one needs to take decisions… in 
> the end, I have to admit that I’ve only deleted one dataset, with a cold 
> sweat running over my spine.
> 
> Please let me know of your opinions on that.
> 
> Considering the long-term storage, I’ve been told that:
> 
>  1. The famous tape system has a lot of logistical drawbacks: they often
>     update software and hardware, and the old tapes need to be converted
>     to new formats, periodically (very time consuming, and it gets
>     expensive if you need to replace equipment) – places that have this
>     kind of resource usually have a dedicated crew;
>  2. Transfer to tape is often prone to errors, and nobody is checking
>     byte per byte if the copy went fine;
>  3. Hard drives fail if they are used too much, and also if they are not
>     used at all. So the best would be to plug them every now and them, a
>     bit like the old car in the garage. Not very time consuming, but one
>     needs to think of doing this, and keeping track of which disk was
>     plugged when (mental load, who hasn’t enough?) – and still, this
>     doesn’t guarantee that they will last 10 years;
>  4. Some people are praying for the development of data storage in DNA,
>     but I expect the copy to be extremely slow…
> 
> This is a very serious issue, and I only see it getting worse as we 
> accumulate more and more data. I know I sound pessimistic, but I wish a 
> great day to everyone.
> 
> Cheers,
> 
> ---------------------------------------------------
> Carlos KIKUTI, PhD
> UMR144 - CNRS - Institut Curie
> 
> Pavillon Trouillet Rossignol
> 26 Rue d’Ulm - 75005 Paris, France
> carlos.kikuti at curie.fr <mailto:carlos.kikuti at curie.fr>
> 
> 
> Message: 3
> Date: Thu, 15 Feb 2024 03:57:05 +0000
> From: "Ludtke, Steven J." <sludtke at bcm.edu>
> To: Jobichen <jobichenc at yahoo.com>
> Cc: 3DEM Mailing List <3dem at ncmir.ucsd.edu>
> Subject: Re: [3dem] Advice on storage server
> Message-ID: <26C5131B-164C-4C2F-A578-87D6C5797849 at bcm.edu>
> Content-Type: text/plain; charset="utf-8"
> 
> I should add that for long term backup, the most typical strategy is the 
> convenient but unsafe "drives on a shelf". That would be a one-time 
> purchase of ~$2k, but the chances that all of the drives work and you 
> can fully recover the data in 5 or 10 years may be a little marginal. 
> Worth noting also that portable USB drives as opposed to drives designed 
> to be internal drives in a PC have massively lower reliability ratings 
> in general. Also note that SSD's lose data over time if they aren't 
> plugged in to a power source periodically for a "refresh".
> 
> ---
> Steven Ludtke, Ph.D. <sludtke at bcm.edu>                      Baylor 
> College of Medicine
> Charles C. Bell Jr., Professor of Structural Biology        Dept. of 
> Biochemistry
> Deputy Director, Advanced Technology Cores                  and 
> Molecular Pharmacology
> Academic Director, CryoEM Core
> Co-Director CIBR Center
> 
> 
> On Feb 14, 2024, at 8:06?PM, Ludtke, Steven J. <sludtke at bcm.edu> wrote:
> 
> If you don't expect to need to access it again, ie - purely an emergency 
> backup, Amazon Glacier is a cost effective solution, as long as you have 
> $ to continue paying for it. 100 TB of deep archive Glacier storage 
> would run about $1200/year (+ additional cost if you need to retrieve it).
> 
> If you are storing it for possible additional processing, then you want 
> the storage to be "close" in data transfer terms to the processing 
> power. ie - if you are processing in the cloud, then storing the data in 
> the cloud makes sense. Clearly you would not want to process the data 
> directly from cloud storage. Keep in mind the relative speeds of 
> transfer for different devices/transfer methods:
> 
> M.2 SSD -> 2-4 GB/s
> 8 drive RAID array with spinning platters directly on the machine -> ~1 GB/s
> SATA SSD -> 0.6 GB/s
> single spinning platter on machine -> 0.15 GB/s
> gigabit network remote access -> 0.1 GB/s
> less than gigabit remote access (cloud at typical institutions) -> <0.1 GB/s
> 
> For size comparison, a 4k x 4k x 1k tomogram at 8 bits is 16 GB, so 
> opening that from an M.2 SSD might take 4-8 seconds, whereas opening the 
> same file over a gigabit NAS would take almost 3 minutes.
> 
> Personally, I have a 12 bay Synology NAS box with a 10 Gb network card 
> in it under my desk. With 16 TB drives and RAID6 this gives about 150 TB 
> of usable storage space, which you can access at ~1 GB/s. Cost ~$5000, 
> with an expected drive life of ~5 years, ie - expect you will have to 
> periodically replace bad drives occasionally after the first few years.
> 
> It's worth noting here that at $5000, with an expected life of ~5 years 
> before you start having to pay for more drives, this is $1000/year and 
> gives high speed access, compared to the $1200/year for deep Glacier 
> storage above. However, the Glacier storage has much better reliability 
> than a single RAID6 array with no additional backup.
> 
> Anyway, some food for thought  :^)
> 
> ---
> Steven Ludtke, Ph.D. <sludtke at bcm.edu>                      Baylor 
> College of Medicine
> Charles C. Bell Jr., Professor of Structural Biology        Dept. of 
> Biochemistry
> Deputy Director, Advanced Technology Cores                  and 
> Molecular Pharmacology
> Academic Director, CryoEM Core
> Co-Director CIBR Center
> 
> 
> On Feb 14, 2024, at 6:14?PM, Jobichen <jobichenc at yahoo.com> wrote:
> 
> Dear All,
> We are looking for some suggestions on storing the raw datasets/movies. 
> What will be best option for storing around 100TB of movies/processed data.
> What will be pros/cons of having own storage server vs cloud storage 
> options.
> Thank you for your time.
> Jobi
> 
> 
> 
> _______________________________________________
> 3dem mailing list
> 3dem at ncmir.ucsd.edu
> https://urldefense.proofpoint.com/v2/url?u=https-3A__mail.ncmir.ucsd.edu_mailman_listinfo_3dem&d=DwICAg&c=ZQs-KZ8oxEw0p81sqgiaRA&r=GWA2IF6nkq8sZMXHpp1Xpg&m=-fMPusn_TT7DVAUweasDDQG4kEyzhEAyjRtShGQYPmx9cRVoBtVsmUUqEMrMPs9w&s=LBeNcMDu7IJx1_Y7BTp2_JFhuug6w0oVJobkLUozOFc&e= <https://urldefense.proofpoint.com/v2/url?u=https-3A__mail.ncmir.ucsd.edu_mailman_listinfo_3dem&d=DwICAg&c=ZQs-KZ8oxEw0p81sqgiaRA&r=GWA2IF6nkq8sZMXHpp1Xpg&m=-fMPusn_TT7DVAUweasDDQG4kEyzhEAyjRtShGQYPmx9cRVoBtVsmUUqEMrMPs9w&s=LBeNcMDu7IJx1_Y7BTp2_JFhuug6w0oVJobkLUozOFc&e=>
> 
> 
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: 
> <http://mail.ncmir.ucsd.edu/pipermail/3dem/attachments/20240215/7302abc1/attachment.html <http://mail.ncmir.ucsd.edu/pipermail/3dem/attachments/20240215/7302abc1/attachment.html>>
> 
> ------------------------------
> 
> Subject: Digest Footer
> 
> _______________________________________________
> 3dem mailing list
> 3dem at ncmir.ucsd.edu
> https://mail.ncmir.ucsd.edu/mailman/listinfo/3dem 
> <https://mail.ncmir.ucsd.edu/mailman/listinfo/3dem>
> 
> 
> ------------------------------
> 
> End of 3dem Digest, Vol 198, Issue 21
> *************************************
> 
> 
> _______________________________________________
> 3dem mailing list
> 3dem at ncmir.ucsd.edu
> https://mail.ncmir.ucsd.edu/mailman/listinfo/3dem


More information about the 3dem mailing list