[3dem] Data storage system

Takanori Nakane tnakane at mrc-lmb.cam.ac.uk
Fri Apr 2 03:06:05 PDT 2021


Hi,

RAID-NAS solutions as proposed by Tim and Steven are appropriate for
most per-research group storage. Of course this depends on the use
case and the number of clients; if you have many computers running
IO heavy jobs simultaneously (e.g. Polish), the storage
can be a bottleneck. For such cases, distributed file system is
more suitable, but "manual load balancing" to two NASes
(e.g. data collected on odd days in NAS 1 and even days in NAS 2)
also works.

Even if you use RAID-6, you should make at least one backup copy
of raw movies outside the NAS to protect from "rm -fr" mistakes and
malware. LTO tapes are reliable but drives are expensive and
tricky to use. "HDDs on a shelve" are far from ideal, but often the
only realistic backup solution. It is better to have a backup than nothing!
In any case, one should keep the list of files in each disk/tape
somewhere online; otherwise you have to mount and inspect media one by one
to find a dataset. This is very cumbersome and almost impossible
when the student/post-doc who made the backup leaves the lab.

I think the above strategy is sufficient to keep data for 3 to 5 years
after creation. 3 to 5 years should be long enough to publish a
paper. After publication, you can upload raw movies to EMPIAR.
It is free, safe (data are mirrored to Japan and China) and you contribute
to open science.

I think commercial clouds are way too expensive for our use cases. For example,
AWS S3 Glacier Deep Archive costs $0.0018 per GB. In addition, they charge
for data retrieval! https://urldefense.com/v3/__https://aws.amazon.com/s3/pricing/?nc1=h_ls__;!!Mih3wA!S_jlzuynVu6I81rDfWZpWYc_xP2cBmr3CC8cq6v9Zh9HHM3Z4vix_x5ky0SjDq0v8A$ 
If one stores 5 TB for 5 years and then retrieves it,
it will cost 0.0018 * 5000 * 12 * 5 = $540 for storage and 0.09 * 4999 = $450 for
transfer.

In Japan, my university's HPC system used to charge only ~60 USD/TB/year even
for hot storage. In Cambridge, cold storage is 21.6 GBP/TB/year
(https://urldefense.com/v3/__https://www.hpc.cam.ac.uk/research-data-storage-services/price-list__;!!Mih3wA!S_jlzuynVu6I81rDfWZpWYc_xP2cBmr3CC8cq6v9Zh9HHM3Z4vix_x5ky0QgL8jl5g$ ).
Of course this is not fair comparison to commercial cloud providers,
because university IT system is subsidized. But from individual lab's point of view,
there is no point purchasing external clouds when some of the grant money has
been already poured into university IT via indirect cost and the storage
is already available locally.

I also note that we don't really need the high reliability and availability provided
by commercial clouds. Loosing one movie out of a 10,000 movie dataset won't
change the resolution. We have many tasks to do and can happily wait
several days if the university HPC system went offline for maintenance.

Best regards,

Takanori Nakane

On 2021/04/01 19:58, Krishan Pandey wrote:
> Hello,
> 
> I am requesting suggestions and cost estimates about off the shelf data storage systems to store raw cryo-EMmovies and processed data for our lab. Our 
> initial target is 150-200 TB with options to expand it in future.
> We don't have much local IT support for Linux based systems, that's why I am asking for an off-the shelf system which should be easier to install and 
> manage.
> 
> Thank you
> best regards
> 
> Krishan Pandey
> 
> 
> _______________________________________________
> 3dem mailing list
> 3dem at ncmir.ucsd.edu
> https://mail.ncmir.ucsd.edu/mailman/listinfo/3dem
> 



More information about the 3dem mailing list