[3dem] Data storage and compression

Thu Aug 29 13:01:56 PDT 2019

Somehow, just buying the $40K storage doesn’t cut it. And who decides on what gets saved? I’d rather discuss who decides what actually gets collected.

1TB tier-one storage currently is €130 a year at EMBL Heidelberg considering total cost of storage, including backups (tape & tape robot etc.) and that’s underpriced considering power, IT personnel costs  etc.
And synchrotron  budgets still make EM look like the cheapest stuff around, until you want to buy another EM...

Wim

> On 29. Aug 2019, at 15:50, Ludtke, Steven J. <sludtke at bcm.edu> wrote:
> 
> I agree that my estimate was sort of a worst case scenario. However, 1 TB/structure is also not realistic, as most structures I'm aware of involve many days of Krios data collection, and are closer to 10 TB than 1 TB.  If we estimate 50 projects/week, and set a range of 1-10 TB/project, that would be 
> 2.5 - 25 PB of data per year. While this is feasible bandwidth-wise, it is still not a trivial amount of storage. While EMPIAR is talking about going 'peta-scale', I'm not sure they meant 25 PB/year. 
> 
> I do agree with you that making the centers responsible for long-term data archival would be quite reasonable, there are certainly still some issues. In that situation you would probably, again, be talking about archiving all of the data produced by the center, or at least all of the "useful" data. If the center had 3 krios, and ~50 % of the data were "useful", each center would have to archive ~2 TB/day, or <1 PB/year. 
> 
> Of course, storage costs fall with time, over the long term, averaging a 2-fold cost reduction every 14 months, so while 25 PB sounds like a lot right now, in 5 years it should be extremely feasible. While data production rates also increase with time, there are some physical limits there which we are beginning to approach.  Given that 1 PB of raw RAID 6 storage only costs ~$40 k now, it seems like a $40 M center ought to be able to allocate enough budget to do this sort of archival. If you look at the NYSBC's recent paper on JPG compression, though, it seems they are far from anxious to take on this role other than perhaps saving degraded JPG versions of the aligned averages.
> 
> 
> --------------------------------------------------------------------------------------
> Steven Ludtke, Ph.D. <sludtke at bcm.edu>                      Baylor College of Medicine 
> Charles C. Bell Jr., Professor of Structural Biology
> Dept. of Biochemistry and Molecular Biology                      (www.bcm.edu/biochem)
> Academic Director, CryoEM Core                                        (cryoem.bcm.edu)
> Co-Director CIBR Center                                    (www.bcm.edu/research/cibr)
> 
> 
> 
>> On Aug 29, 2019, at 8:27 AM, Takanori Nakane <tnakane at mrc-lmb.cam.ac.uk> wrote:
>> 
>> Hi,
>> 
>>> back of the envelope:
>>> 2000 images/day * 500 MB/compressed counting movie * 100 krios ~ 100
>>> TB/day
>> 
>> But datasets that result in published maps are only a
>> fraction of what were collected. Most of the scope time
>> is spent on sample screening and optimisation, and
>> these preliminary datasets don't have to be made public.
>> 
>> There are 52 new maps released in EMDB this week.
>> Some maps came from a single dataset by classification.
>> If 1 TB per map, it is 52 TB/week, not 100 TB/day.
>> 
>> By the way, some X-ray synchrotrons and XFELs are implementing data
>> policy, where all raw data will be stored on site and made public after
>> a certain embargo period. It would be nice if CryoEM facilities follow
>> this trend.
>> 
>>> You mean the "final" particle stack, or movies, when you say raw data ?
>>> The latter might be too large for EMPIAR ?
>> 
>> I recommend raw movies, so that others can reprocess for the beginning
>> with possibly improved strategies and programs than initially processed.
>> Of course, having cleaned particle coordinates and stacks is useful,
>> because some users are interested only in later steps of processing.
>> 
>> Best regards,
>> 
>> Takanori Nakane
>> 
>>> EMPIAR is great, and while they say they will take whatever people
>>> provide, if all of the raw data (even if it were limited to 'good' data)
>>> from all of the Krios around the world were archived in EMPIAR, there
>>> would be massive bandwidth and storage issues. When used as it is now,
>> as
>>> an archive for important reference data sets, it's great, but I'm not
>> sure
>>> it's a viable strategy for archiving everything produced in the CryoEM
>>> community.
>>> back of the envelope:
>>> 2000 images/day * 500 MB/compressed counting movie * 100 krios ~ 100
>>> TB/day
>>> even a dedicated 10 Gb network running flat out couldn't keep up.
>>> --------------------------------------------------------------------------------------
>>> Steven Ludtke, Ph.D. <sludtke at bcm.edu<mailto:sludtke at bcm.edu>>
>>>         Baylor College of Medicine
>>> Charles C. Bell Jr., Professor of Structural Biology
>>> Dept. of Biochemistry and Molecular Biology
>>> (www.bcm.edu/biochem<http://www.bcm.edu/biochem>)
>>> Academic Director, CryoEM Core
>>> (cryoem.bcm.edu<http://cryoem.bcm.edu>)
>>> Co-Director CIBR Center
>>> (www.bcm.edu/research/cibr<http://www.bcm.edu/research/cibr>)
>>> On Aug 29, 2019, at 7:04 AM, Takanori Nakane
>>> <tnakane at mrc-lmb.cam.ac.uk<mailto:tnakane at mrc-lmb.cam.ac.uk>> wrote:
>>> Hi,
>>> Most people just opt for the "hard drive on a shelf" method for
>> completed
>>> projects, which has advantages (cheap/simple) and disadvantages (what
>>> happens if the drive dies)...
>>> After publication of your structures, I recommend raw data to be
>> deposited
>>> in EMPIAR.
>>> Not only is it useful for reproducibility, education and method
>>> development,
>>> it also serves as an additional layer of backup. You might drop your
>> disk,
>>> water might leak from the ceiling, etc. Having backups in a physically
>>> distant
>>> place is a good practice.
>>> Best regards,
>>> Takanori Nakanori
>>> Julien,
>>> are you referring to the raw data, or are you trying to archive all of
>> the
>>> files associated with a project?
>>> Counting-mode movies are generally stored and archived as compressed
>> tiff
>>> stacks, though if they are collected on a Falcon, there are issues with
>>> this, as good compression is achieved only pre-normalization (or
>>> post-normalization if you decide you are willing to switch back to an
>>> integer format).
>>> If you want to perfectly archive everything exactly as it is
>> (losslessly),
>>> some compression algorithms may do very slightly better than others, but
>>> pretty much any of the commonly used algorithms will do about the same.
>>> Usually the slower ones will do slightly better, but you have to decide
>> if
>>> it's worth the CPU time the compression takes.  By definition, the
>> noisier
>>> the data is, the less compressible it is, unless you are willing to
>> invoke
>>> "lossy" compression and throw away some of the bits of pure noise.
>>> Most people just opt for the "hard drive on a shelf" method for
>> completed
>>> projects, which has advantages (cheap/simple) and disadvantages (what
>>> happens if the drive dies)...
>>> --------------------------------------------------------------------------------------
>>> Steven Ludtke, Ph.D.
>>> <sludtke at bcm.edu<mailto:sludtke at bcm.edu><mailto:sludtke at bcm.edu>>
>>>        Baylor College of Medicine
>>> Charles C. Bell Jr., Professor of Structural Biology
>>> Dept. of Biochemistry and Molecular Biology
>>> (www.bcm.edu/biochem<http://www.bcm.edu/biochem><http://www.bcm.edu/biochem>)
>>> Academic Director, CryoEM Core
>>> (cryoem.bcm.edu<http://cryoem.bcm.edu/><http://cryoem.bcm.edu<http://cryoem.bcm.edu/>>)
>>> Co-Director CIBR Center
>>> (www.bcm.edu/research/cibr<http://www.bcm.edu/research/cibr><http://www.bcm.edu/research/cibr>)
>>> On Aug 29, 2019, at 6:30 AM, Julien Bous
>>> <julien.bous at etu.umontpellier.fr<mailto:julien.bous at etu.umontpellier.fr><mailto:julien.bous at etu.umontpellier.fr>>
>>> wrote:
>>> Dear Community,
>>> I have a question about the best way to store my data once SPA projects
>>> are achieved. Can you advise me about which compression format is to
>>> prefer?
>>> Thank you for your interest,
>>> Julien
>>> _______________________________________________
>>> 3dem mailing list
>>> 3dem at ncmir.ucsd.edu<mailto:3dem at ncmir.ucsd.edu><mailto:3dem at ncmir.ucsd.edu>
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__mail.ncmir.ucsd.edu_mailman_listinfo_3dem&d=DwICAg&c=ZQs-KZ8oxEw0p81sqgiaRA&r=GWA2IF6nkq8sZMXHpp1Xpg&m=-Yu84q3MdcWvESYXpaK7NQdEWch6tE1eG9IVNTjLay4&s=NSrJg_YgFffwLELO1auXSC6yYLEsGHVoNV5TI_1eBqM&e=
>>> _______________________________________________
>>> 3dem mailing list
>>> 3dem at ncmir.ucsd.edu<mailto:3dem at ncmir.ucsd.edu>
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__mail.ncmir.ucsd.edu_mailman_listinfo_3dem&d=DwIDJg&c=ZQs-KZ8oxEw0p81sqgiaRA&r=GWA2IF6nkq8sZMXHpp1Xpg&m=NIpw6RIeeyKxoYDz2eZPHOcIZvNm9VytdzBFUEtQ-10&s=UG1BMTIotgpZVcqSlW0cd0tfnpxgEo9l3RLHUfU2ODc&e=
>> 
>> 
>> 
>> 
>> _______________________________________________
>> 3dem mailing list
>> 3dem at ncmir.ucsd.edu
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__mail.ncmir.ucsd.edu_mailman_listinfo_3dem&d=DwICAg&c=ZQs-KZ8oxEw0p81sqgiaRA&r=Dk5VoQQ-wINYVssLMZihyC5Dj_sWYKxCyKz9E4Lp3gc&m=N9gZrj6tzrxb1__Ie82LVKoonNvUYYAINYAiKVJq_Ak&s=ilFAU6frbYTGpAsrBgmIjwKDDWViAU0wjU8X3H_OFSE&e=
> 
> _______________________________________________
> 3dem mailing list
> 3dem at ncmir.ucsd.edu
> https://mail.ncmir.ucsd.edu/mailman/listinfo/3dem
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ncmir.ucsd.edu/pipermail/3dem/attachments/20190829/7363bcb6/attachment-0001.html>