[3dem] Data storage and compression

Carlos Oscar S. Sorzano coss at cnb.csic.es
Thu Aug 29 21:06:20 PDT 2019


Thank you, Steve, for this detailed calculations that give an upper 
bound of what is needed. As part of some European project, we are 
designing and implementing a data management policy in which the data 
acquired by the microscope is immediately copied to some temporary 
storage server. This is in part motivated by the need to make public 
data that has been funded by public money. At this point it is given a 
unique identifier that is used along the process to track the 
acquisition. This copy of raw data can be accompanied by derived data 
(aligned micrographs, CTFs, coordinates, image processing workflow, 
...), which is not that large compared to the size of the raw data. 
After a period of time, typically 3 years, we could check whether this 
data has resulted in a deposited EMDB entry. In this case, the data is 
automatically moved from the temporary repository to EMPIAR. If it has 
not resulted in a successful structure, then the user is contacted and, 
upon approval, the data can be deleted. This strategy limits the storage 
needs at EMPIAR, the storage needs of the temporary repository (limited 
to 3 years), and accounts for the fact that not all acquisition end in 
successful reconstructions. The temporary repository can be local to the 
institution, or more global (within the European project we are making a 
pilot with EMPIAR itself).

Kind regards, Carlos Oscar

El 29/08/2019 a las 15:50, Ludtke, Steven J. escribió:
> I agree that my estimate was sort of a worst case scenario. However, 1 
> TB/structure is also not realistic, as most structures I'm aware of 
> involve many days of Krios data collection, and are closer to 10 TB 
> than 1 TB.  If we estimate 50 projects/week, and set a range of 1-10 
> TB/project, that would be
> 2.5 - 25 PB of data per year. While this is feasible bandwidth-wise, 
> it is still not a trivial amount of storage. While EMPIAR is talking 
> about going 'peta-scale', I'm not sure they meant 25 PB/year.
>
> I do agree with you that making the centers responsible for long-term 
> data archival would be quite reasonable, there are certainly still 
> some issues. In that situation you would probably, again, be talking 
> about archiving all of the data produced by the center, or at least 
> all of the "useful" data. If the center had 3 krios, and ~50 % of the 
> data were "useful", each center would have to archive ~2 TB/day, or <1 
> PB/year.
>
> Of course, storage costs fall with time, over the long term, averaging 
> a 2-fold cost reduction every 14 months, so while 25 PB sounds like a 
> lot right now, in 5 years it should be extremely feasible. While data 
> production rates also increase with time, there are some physical 
> limits there which we are beginning to approach.  Given that 1 PB of 
> raw RAID 6 storage only costs ~$40 k now, it seems like a $40 M center 
> ought to be able to allocate enough budget to do this sort of 
> archival. If you look at the NYSBC's recent paper on JPG compression, 
> though, it seems they are far from anxious to take on this role other 
> than perhaps saving degraded JPG versions of the aligned averages.
>
>
> --------------------------------------------------------------------------------------
> Steven Ludtke, Ph.D. <sludtke at bcm.edu <mailto:sludtke at bcm.edu>>       
>               Baylor College of Medicine
> Charles C. Bell Jr., Professor of Structural Biology
> Dept. of Biochemistry and Molecular Biology                 
>   (www.bcm.edu/biochem <http://www.bcm.edu/biochem>)
> Academic Director, CryoEM Core                         (cryoem.bcm.edu 
> <http://cryoem.bcm.edu>)
> Co-Director CIBR Center             (www.bcm.edu/research/cibr 
> <http://www.bcm.edu/research/cibr>)
>
>
>
>> On Aug 29, 2019, at 8:27 AM, Takanori Nakane 
>> <tnakane at mrc-lmb.cam.ac.uk <mailto:tnakane at mrc-lmb.cam.ac.uk>> wrote:
>>
>> Hi,
>>
>>> back of the envelope:
>>> 2000 images/day * 500 MB/compressed counting movie * 100 krios ~ 100
>>> TB/day
>>
>> But datasets that result in published maps are only a
>> fraction of what were collected. Most of the scope time
>> is spent on sample screening and optimisation, and
>> these preliminary datasets don't have to be made public.
>>
>> There are 52 new maps released in EMDB this week.
>> Some maps came from a single dataset by classification.
>> If 1 TB per map, it is 52 TB/week, not 100 TB/day.
>>
>> By the way, some X-ray synchrotrons and XFELs are implementing data
>> policy, where all raw data will be stored on site and made public after
>> a certain embargo period. It would be nice if CryoEM facilities follow
>> this trend.
>>
>>> You mean the "final" particle stack, or movies, when you say raw data ?
>>> The latter might be too large for EMPIAR ?
>>
>> I recommend raw movies, so that others can reprocess for the beginning
>> with possibly improved strategies and programs than initially processed.
>> Of course, having cleaned particle coordinates and stacks is useful,
>> because some users are interested only in later steps of processing.
>>
>> Best regards,
>>
>> Takanori Nakane
>>
>>> EMPIAR is great, and while they say they will take whatever people
>>> provide, if all of the raw data (even if it were limited to 'good' data)
>>> from all of the Krios around the world were archived in EMPIAR, there
>>> would be massive bandwidth and storage issues. When used as it is now,
>> as
>>> an archive for important reference data sets, it's great, but I'm not
>> sure
>>> it's a viable strategy for archiving everything produced in the CryoEM
>>> community.
>>> back of the envelope:
>>> 2000 images/day * 500 MB/compressed counting movie * 100 krios ~ 100
>>> TB/day
>>> even a dedicated 10 Gb network running flat out couldn't keep up.
>>> --------------------------------------------------------------------------------------
>>> Steven Ludtke, Ph.D. <sludtke at bcm.edu 
>>> <mailto:sludtke at bcm.edu><mailto:sludtke at bcm.edu>>
>>>         Baylor College of Medicine
>>> Charles C. Bell Jr., Professor of Structural Biology
>>> Dept. of Biochemistry and Molecular Biology
>>> (www.bcm.edu/biochem 
>>> <http://www.bcm.edu/biochem><http://www.bcm.edu/biochem>)
>>> Academic Director, CryoEM Core
>>> (cryoem.bcm.edu <http://cryoem.bcm.edu/><http://cryoem.bcm.edu 
>>> <http://cryoem.bcm.edu/>>)
>>> Co-Director CIBR Center
>>> (www.bcm.edu/research/cibr 
>>> <http://www.bcm.edu/research/cibr><http://www.bcm.edu/research/cibr>)
>>> On Aug 29, 2019, at 7:04 AM, Takanori Nakane
>>> <tnakane at mrc-lmb.cam.ac.uk 
>>> <mailto:tnakane at mrc-lmb.cam.ac.uk><mailto:tnakane at mrc-lmb.cam.ac.uk>> 
>>> wrote:
>>> Hi,
>>> Most people just opt for the "hard drive on a shelf" method for
>> completed
>>> projects, which has advantages (cheap/simple) and disadvantages (what
>>> happens if the drive dies)...
>>> After publication of your structures, I recommend raw data to be
>> deposited
>>> in EMPIAR.
>>> Not only is it useful for reproducibility, education and method
>>> development,
>>> it also serves as an additional layer of backup. You might drop your
>> disk,
>>> water might leak from the ceiling, etc. Having backups in a physically
>>> distant
>>> place is a good practice.
>>> Best regards,
>>> Takanori Nakanori
>>> Julien,
>>> are you referring to the raw data, or are you trying to archive all of
>> the
>>> files associated with a project?
>>> Counting-mode movies are generally stored and archived as compressed
>> tiff
>>> stacks, though if they are collected on a Falcon, there are issues with
>>> this, as good compression is achieved only pre-normalization (or
>>> post-normalization if you decide you are willing to switch back to an
>>> integer format).
>>> If you want to perfectly archive everything exactly as it is
>> (losslessly),
>>> some compression algorithms may do very slightly better than others, but
>>> pretty much any of the commonly used algorithms will do about the same.
>>> Usually the slower ones will do slightly better, but you have to decide
>> if
>>> it's worth the CPU time the compression takes.  By definition, the
>> noisier
>>> the data is, the less compressible it is, unless you are willing to
>> invoke
>>> "lossy" compression and throw away some of the bits of pure noise.
>>> Most people just opt for the "hard drive on a shelf" method for
>> completed
>>> projects, which has advantages (cheap/simple) and disadvantages (what
>>> happens if the drive dies)...
>>> --------------------------------------------------------------------------------------
>>> Steven Ludtke, Ph.D.
>>> <sludtke at bcm.edu 
>>> <mailto:sludtke at bcm.edu><mailto:sludtke at bcm.edu><mailto:sludtke at bcm.edu>>
>>>        Baylor College of Medicine
>>> Charles C. Bell Jr., Professor of Structural Biology
>>> Dept. of Biochemistry and Molecular Biology
>>> (www.bcm.edu/biochem 
>>> <http://www.bcm.edu/biochem><http://www.bcm.edu/biochem><http://www.bcm.edu/biochem>)
>>> Academic Director, CryoEM Core
>>> (cryoem.bcm.edu 
>>> <http://cryoem.bcm.edu/><http://cryoem.bcm.edu/><http://cryoem.bcm.edu<http://cryoem.bcm.edu/>>)
>>> Co-Director CIBR Center
>>> (www.bcm.edu/research/cibr 
>>> <http://www.bcm.edu/research/cibr><http://www.bcm.edu/research/cibr><http://www.bcm.edu/research/cibr>)
>>> On Aug 29, 2019, at 6:30 AM, Julien Bous
>>> <julien.bous at etu.umontpellier.fr 
>>> <mailto:julien.bous at etu.umontpellier.fr><mailto:julien.bous at etu.umontpellier.fr><mailto:julien.bous at etu.umontpellier.fr>>
>>> wrote:
>>> Dear Community,
>>> I have a question about the best way to store my data once SPA projects
>>> are achieved. Can you advise me about which compression format is to
>>> prefer?
>>> Thank you for your interest,
>>> Julien
>>> _______________________________________________
>>> 3dem mailing list
>>> 3dem at ncmir.ucsd.edu 
>>> <mailto:3dem at ncmir.ucsd.edu><mailto:3dem at ncmir.ucsd.edu><mailto:3dem at ncmir.ucsd.edu>
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__mail.ncmir.ucsd.edu_mailman_listinfo_3dem&d=DwICAg&c=ZQs-KZ8oxEw0p81sqgiaRA&r=GWA2IF6nkq8sZMXHpp1Xpg&m=-Yu84q3MdcWvESYXpaK7NQdEWch6tE1eG9IVNTjLay4&s=NSrJg_YgFffwLELO1auXSC6yYLEsGHVoNV5TI_1eBqM&e=
>>> _______________________________________________
>>> 3dem mailing list
>>> 3dem at ncmir.ucsd.edu 
>>> <mailto:3dem at ncmir.ucsd.edu><mailto:3dem at ncmir.ucsd.edu>
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__mail.ncmir.ucsd.edu_mailman_listinfo_3dem&d=DwIDJg&c=ZQs-KZ8oxEw0p81sqgiaRA&r=GWA2IF6nkq8sZMXHpp1Xpg&m=NIpw6RIeeyKxoYDz2eZPHOcIZvNm9VytdzBFUEtQ-10&s=UG1BMTIotgpZVcqSlW0cd0tfnpxgEo9l3RLHUfU2ODc&e=
>>
>>
>>
>>
>> _______________________________________________
>> 3dem mailing list
>> 3dem at ncmir.ucsd.edu <mailto:3dem at ncmir.ucsd.edu>
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__mail.ncmir.ucsd.edu_mailman_listinfo_3dem&d=DwICAg&c=ZQs-KZ8oxEw0p81sqgiaRA&r=Dk5VoQQ-wINYVssLMZihyC5Dj_sWYKxCyKz9E4Lp3gc&m=N9gZrj6tzrxb1__Ie82LVKoonNvUYYAINYAiKVJq_Ak&s=ilFAU6frbYTGpAsrBgmIjwKDDWViAU0wjU8X3H_OFSE&e=
>
>
> _______________________________________________
> 3dem mailing list
> 3dem at ncmir.ucsd.edu
> https://mail.ncmir.ucsd.edu/mailman/listinfo/3dem

-- 
------------------------------------------------------------------------
Carlos Oscar Sánchez Sorzano                  e-mail:   coss at cnb.csic.es
Biocomputing unit                             http://i2pc.es/coss
National Center of Biotechnology (CSIC)
c/Darwin, 3
Campus Universidad Autónoma (Cantoblanco)     Tlf: 34-91-585 4510
28049 MADRID (SPAIN)                          Fax: 34-91-585 4506
------------------------------------------------------------------------

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ncmir.ucsd.edu/pipermail/3dem/attachments/20190830/bb8b9545/attachment-0001.html>


More information about the 3dem mailing list