[3dem] Data storage and compression

Takanori Nakane tnakane at mrc-lmb.cam.ac.uk
Thu Aug 29 06:27:23 PDT 2019


Hi,

> back of the envelope:
> 2000 images/day * 500 MB/compressed counting movie * 100 krios ~ 100
> TB/day

But datasets that result in published maps are only a
fraction of what were collected. Most of the scope time
is spent on sample screening and optimisation, and
these preliminary datasets don't have to be made public.

There are 52 new maps released in EMDB this week.
Some maps came from a single dataset by classification.
If 1 TB per map, it is 52 TB/week, not 100 TB/day.

By the way, some X-ray synchrotrons and XFELs are implementing data
policy, where all raw data will be stored on site and made public after
a certain embargo period. It would be nice if CryoEM facilities follow
this trend.

> You mean the "final" particle stack, or movies, when you say raw data ?
> The latter might be too large for EMPIAR ?

I recommend raw movies, so that others can reprocess for the beginning
with possibly improved strategies and programs than initially processed.
Of course, having cleaned particle coordinates and stacks is useful,
because some users are interested only in later steps of processing.

Best regards,

Takanori Nakane

> EMPIAR is great, and while they say they will take whatever people
> provide, if all of the raw data (even if it were limited to 'good' data)
> from all of the Krios around the world were archived in EMPIAR, there
> would be massive bandwidth and storage issues. When used as it is now,
as
> an archive for important reference data sets, it's great, but I'm not
sure
> it's a viable strategy for archiving everything produced in the CryoEM
> community.
> back of the envelope:
> 2000 images/day * 500 MB/compressed counting movie * 100 krios ~ 100
> TB/day
> even a dedicated 10 Gb network running flat out couldn't keep up.
> --------------------------------------------------------------------------------------
> Steven Ludtke, Ph.D. <sludtke at bcm.edu<mailto:sludtke at bcm.edu>>
>          Baylor College of Medicine
> Charles C. Bell Jr., Professor of Structural Biology
> Dept. of Biochemistry and Molecular Biology
> (www.bcm.edu/biochem<http://www.bcm.edu/biochem>)
> Academic Director, CryoEM Core
> (cryoem.bcm.edu<http://cryoem.bcm.edu>)
> Co-Director CIBR Center
> (www.bcm.edu/research/cibr<http://www.bcm.edu/research/cibr>)
> On Aug 29, 2019, at 7:04 AM, Takanori Nakane
> <tnakane at mrc-lmb.cam.ac.uk<mailto:tnakane at mrc-lmb.cam.ac.uk>> wrote:
> Hi,
> Most people just opt for the "hard drive on a shelf" method for
completed
> projects, which has advantages (cheap/simple) and disadvantages (what
> happens if the drive dies)...
> After publication of your structures, I recommend raw data to be
deposited
> in EMPIAR.
> Not only is it useful for reproducibility, education and method
> development,
> it also serves as an additional layer of backup. You might drop your
disk,
> water might leak from the ceiling, etc. Having backups in a physically
> distant
> place is a good practice.
> Best regards,
> Takanori Nakanori
> Julien,
> are you referring to the raw data, or are you trying to archive all of
the
> files associated with a project?
> Counting-mode movies are generally stored and archived as compressed
tiff
> stacks, though if they are collected on a Falcon, there are issues with
> this, as good compression is achieved only pre-normalization (or
> post-normalization if you decide you are willing to switch back to an
> integer format).
> If you want to perfectly archive everything exactly as it is
(losslessly),
> some compression algorithms may do very slightly better than others, but
> pretty much any of the commonly used algorithms will do about the same.
> Usually the slower ones will do slightly better, but you have to decide
if
> it's worth the CPU time the compression takes.  By definition, the
noisier
> the data is, the less compressible it is, unless you are willing to
invoke
> "lossy" compression and throw away some of the bits of pure noise.
> Most people just opt for the "hard drive on a shelf" method for
completed
> projects, which has advantages (cheap/simple) and disadvantages (what
> happens if the drive dies)...
> --------------------------------------------------------------------------------------
> Steven Ludtke, Ph.D.
> <sludtke at bcm.edu<mailto:sludtke at bcm.edu><mailto:sludtke at bcm.edu>>
>         Baylor College of Medicine
> Charles C. Bell Jr., Professor of Structural Biology
> Dept. of Biochemistry and Molecular Biology
> (www.bcm.edu/biochem<http://www.bcm.edu/biochem><http://www.bcm.edu/biochem>)
> Academic Director, CryoEM Core
> (cryoem.bcm.edu<http://cryoem.bcm.edu/><http://cryoem.bcm.edu<http://cryoem.bcm.edu/>>)
> Co-Director CIBR Center
> (www.bcm.edu/research/cibr<http://www.bcm.edu/research/cibr><http://www.bcm.edu/research/cibr>)
> On Aug 29, 2019, at 6:30 AM, Julien Bous
> <julien.bous at etu.umontpellier.fr<mailto:julien.bous at etu.umontpellier.fr><mailto:julien.bous at etu.umontpellier.fr>>
> wrote:
> Dear Community,
> I have a question about the best way to store my data once SPA projects
> are achieved. Can you advise me about which compression format is to
> prefer?
> Thank you for your interest,
> Julien
> _______________________________________________
> 3dem mailing list
> 3dem at ncmir.ucsd.edu<mailto:3dem at ncmir.ucsd.edu><mailto:3dem at ncmir.ucsd.edu>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__mail.ncmir.ucsd.edu_mailman_listinfo_3dem&d=DwICAg&c=ZQs-KZ8oxEw0p81sqgiaRA&r=GWA2IF6nkq8sZMXHpp1Xpg&m=-Yu84q3MdcWvESYXpaK7NQdEWch6tE1eG9IVNTjLay4&s=NSrJg_YgFffwLELO1auXSC6yYLEsGHVoNV5TI_1eBqM&e=
> _______________________________________________
> 3dem mailing list
> 3dem at ncmir.ucsd.edu<mailto:3dem at ncmir.ucsd.edu>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__mail.ncmir.ucsd.edu_mailman_listinfo_3dem&d=DwIDJg&c=ZQs-KZ8oxEw0p81sqgiaRA&r=GWA2IF6nkq8sZMXHpp1Xpg&m=NIpw6RIeeyKxoYDz2eZPHOcIZvNm9VytdzBFUEtQ-10&s=UG1BMTIotgpZVcqSlW0cd0tfnpxgEo9l3RLHUfU2ODc&e=






More information about the 3dem mailing list