[3dem] Data storage and compression

Ludtke, Steven J. sludtke at bcm.edu
Thu Aug 29 06:50:46 PDT 2019


I agree that my estimate was sort of a worst case scenario. However, 1 TB/structure is also not realistic, as most structures I'm aware of involve many days of Krios data collection, and are closer to 10 TB than 1 TB.  If we estimate 50 projects/week, and set a range of 1-10 TB/project, that would be
2.5 - 25 PB of data per year. While this is feasible bandwidth-wise, it is still not a trivial amount of storage. While EMPIAR is talking about going 'peta-scale', I'm not sure they meant 25 PB/year.

I do agree with you that making the centers responsible for long-term data archival would be quite reasonable, there are certainly still some issues. In that situation you would probably, again, be talking about archiving all of the data produced by the center, or at least all of the "useful" data. If the center had 3 krios, and ~50 % of the data were "useful", each center would have to archive ~2 TB/day, or <1 PB/year.

Of course, storage costs fall with time, over the long term, averaging a 2-fold cost reduction every 14 months, so while 25 PB sounds like a lot right now, in 5 years it should be extremely feasible. While data production rates also increase with time, there are some physical limits there which we are beginning to approach.  Given that 1 PB of raw RAID 6 storage only costs ~$40 k now, it seems like a $40 M center ought to be able to allocate enough budget to do this sort of archival. If you look at the NYSBC's recent paper on JPG compression, though, it seems they are far from anxious to take on this role other than perhaps saving degraded JPG versions of the aligned averages.


--------------------------------------------------------------------------------------
Steven Ludtke, Ph.D. <sludtke at bcm.edu<mailto:sludtke at bcm.edu>>                      Baylor College of Medicine
Charles C. Bell Jr., Professor of Structural Biology
Dept. of Biochemistry and Molecular Biology                      (www.bcm.edu/biochem<http://www.bcm.edu/biochem>)
Academic Director, CryoEM Core                                        (cryoem.bcm.edu<http://cryoem.bcm.edu>)
Co-Director CIBR Center                                    (www.bcm.edu/research/cibr<http://www.bcm.edu/research/cibr>)



On Aug 29, 2019, at 8:27 AM, Takanori Nakane <tnakane at mrc-lmb.cam.ac.uk<mailto:tnakane at mrc-lmb.cam.ac.uk>> wrote:

Hi,

back of the envelope:
2000 images/day * 500 MB/compressed counting movie * 100 krios ~ 100
TB/day

But datasets that result in published maps are only a
fraction of what were collected. Most of the scope time
is spent on sample screening and optimisation, and
these preliminary datasets don't have to be made public.

There are 52 new maps released in EMDB this week.
Some maps came from a single dataset by classification.
If 1 TB per map, it is 52 TB/week, not 100 TB/day.

By the way, some X-ray synchrotrons and XFELs are implementing data
policy, where all raw data will be stored on site and made public after
a certain embargo period. It would be nice if CryoEM facilities follow
this trend.

You mean the "final" particle stack, or movies, when you say raw data ?
The latter might be too large for EMPIAR ?

I recommend raw movies, so that others can reprocess for the beginning
with possibly improved strategies and programs than initially processed.
Of course, having cleaned particle coordinates and stacks is useful,
because some users are interested only in later steps of processing.

Best regards,

Takanori Nakane

EMPIAR is great, and while they say they will take whatever people
provide, if all of the raw data (even if it were limited to 'good' data)
from all of the Krios around the world were archived in EMPIAR, there
would be massive bandwidth and storage issues. When used as it is now,
as
an archive for important reference data sets, it's great, but I'm not
sure
it's a viable strategy for archiving everything produced in the CryoEM
community.
back of the envelope:
2000 images/day * 500 MB/compressed counting movie * 100 krios ~ 100
TB/day
even a dedicated 10 Gb network running flat out couldn't keep up.
--------------------------------------------------------------------------------------
Steven Ludtke, Ph.D. <sludtke at bcm.edu<mailto:sludtke at bcm.edu><mailto:sludtke at bcm.edu>>
        Baylor College of Medicine
Charles C. Bell Jr., Professor of Structural Biology
Dept. of Biochemistry and Molecular Biology
(www.bcm.edu/biochem<http://www.bcm.edu/biochem><http://www.bcm.edu/biochem>)
Academic Director, CryoEM Core
(cryoem.bcm.edu<http://cryoem.bcm.edu/><http://cryoem.bcm.edu<http://cryoem.bcm.edu/>>)
Co-Director CIBR Center
(www.bcm.edu/research/cibr<http://www.bcm.edu/research/cibr><http://www.bcm.edu/research/cibr>)
On Aug 29, 2019, at 7:04 AM, Takanori Nakane
<tnakane at mrc-lmb.cam.ac.uk<mailto:tnakane at mrc-lmb.cam.ac.uk><mailto:tnakane at mrc-lmb.cam.ac.uk>> wrote:
Hi,
Most people just opt for the "hard drive on a shelf" method for
completed
projects, which has advantages (cheap/simple) and disadvantages (what
happens if the drive dies)...
After publication of your structures, I recommend raw data to be
deposited
in EMPIAR.
Not only is it useful for reproducibility, education and method
development,
it also serves as an additional layer of backup. You might drop your
disk,
water might leak from the ceiling, etc. Having backups in a physically
distant
place is a good practice.
Best regards,
Takanori Nakanori
Julien,
are you referring to the raw data, or are you trying to archive all of
the
files associated with a project?
Counting-mode movies are generally stored and archived as compressed
tiff
stacks, though if they are collected on a Falcon, there are issues with
this, as good compression is achieved only pre-normalization (or
post-normalization if you decide you are willing to switch back to an
integer format).
If you want to perfectly archive everything exactly as it is
(losslessly),
some compression algorithms may do very slightly better than others, but
pretty much any of the commonly used algorithms will do about the same.
Usually the slower ones will do slightly better, but you have to decide
if
it's worth the CPU time the compression takes.  By definition, the
noisier
the data is, the less compressible it is, unless you are willing to
invoke
"lossy" compression and throw away some of the bits of pure noise.
Most people just opt for the "hard drive on a shelf" method for
completed
projects, which has advantages (cheap/simple) and disadvantages (what
happens if the drive dies)...
--------------------------------------------------------------------------------------
Steven Ludtke, Ph.D.
<sludtke at bcm.edu<mailto:sludtke at bcm.edu><mailto:sludtke at bcm.edu><mailto:sludtke at bcm.edu>>
       Baylor College of Medicine
Charles C. Bell Jr., Professor of Structural Biology
Dept. of Biochemistry and Molecular Biology
(www.bcm.edu/biochem<http://www.bcm.edu/biochem><http://www.bcm.edu/biochem><http://www.bcm.edu/biochem>)
Academic Director, CryoEM Core
(cryoem.bcm.edu<http://cryoem.bcm.edu/><http://cryoem.bcm.edu/><http://cryoem.bcm.edu<http://cryoem.bcm.edu/>>)
Co-Director CIBR Center
(www.bcm.edu/research/cibr<http://www.bcm.edu/research/cibr><http://www.bcm.edu/research/cibr><http://www.bcm.edu/research/cibr>)
On Aug 29, 2019, at 6:30 AM, Julien Bous
<julien.bous at etu.umontpellier.fr<mailto:julien.bous at etu.umontpellier.fr><mailto:julien.bous at etu.umontpellier.fr><mailto:julien.bous at etu.umontpellier.fr>>
wrote:
Dear Community,
I have a question about the best way to store my data once SPA projects
are achieved. Can you advise me about which compression format is to
prefer?
Thank you for your interest,
Julien
_______________________________________________
3dem mailing list
3dem at ncmir.ucsd.edu<mailto:3dem at ncmir.ucsd.edu><mailto:3dem at ncmir.ucsd.edu><mailto:3dem at ncmir.ucsd.edu>
https://urldefense.proofpoint.com/v2/url?u=https-3A__mail.ncmir.ucsd.edu_mailman_listinfo_3dem&d=DwICAg&c=ZQs-KZ8oxEw0p81sqgiaRA&r=GWA2IF6nkq8sZMXHpp1Xpg&m=-Yu84q3MdcWvESYXpaK7NQdEWch6tE1eG9IVNTjLay4&s=NSrJg_YgFffwLELO1auXSC6yYLEsGHVoNV5TI_1eBqM&e=
_______________________________________________
3dem mailing list
3dem at ncmir.ucsd.edu<mailto:3dem at ncmir.ucsd.edu><mailto:3dem at ncmir.ucsd.edu>
https://urldefense.proofpoint.com/v2/url?u=https-3A__mail.ncmir.ucsd.edu_mailman_listinfo_3dem&d=DwIDJg&c=ZQs-KZ8oxEw0p81sqgiaRA&r=GWA2IF6nkq8sZMXHpp1Xpg&m=NIpw6RIeeyKxoYDz2eZPHOcIZvNm9VytdzBFUEtQ-10&s=UG1BMTIotgpZVcqSlW0cd0tfnpxgEo9l3RLHUfU2ODc&e=




_______________________________________________
3dem mailing list
3dem at ncmir.ucsd.edu<mailto:3dem at ncmir.ucsd.edu>
https://urldefense.proofpoint.com/v2/url?u=https-3A__mail.ncmir.ucsd.edu_mailman_listinfo_3dem&d=DwICAg&c=ZQs-KZ8oxEw0p81sqgiaRA&r=Dk5VoQQ-wINYVssLMZihyC5Dj_sWYKxCyKz9E4Lp3gc&m=N9gZrj6tzrxb1__Ie82LVKoonNvUYYAINYAiKVJq_Ak&s=ilFAU6frbYTGpAsrBgmIjwKDDWViAU0wjU8X3H_OFSE&e=

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ncmir.ucsd.edu/pipermail/3dem/attachments/20190829/b336fcfc/attachment-0001.html>


More information about the 3dem mailing list