[3dem] File systems

Ludtke, Steven J sludtke at bcm.edu
Sun Jan 3 15:11:42 PST 2016


Ok, that helps a lot. This is a complicated issue, but here is my take on it, with some numbers:

Individual "spinning platter" desktop hard drives typically deliver ~150 MB/sec of data. A modern SSD can easily do 500+ MB/sec. The absolute maximum you can get over a 1 Gb (note that is gigaBIT not gigaBYTE) network connection is ~120 megabytes/sec, and with typical filesystem tunings, real-world performance is substantially less than that. If you properly tune an NFS setup to a centralized SAN, you can probably achieve 100 MB/sec, but if the SAN has a 1GB connection than that 100 MB/sec is shared among all of the workstations in the lab, so if you are doing very data intensive processing from 5 workstations at the same time, your maximum bandwidth is likely ~20 MB/sec per computer, or about 1/8 the speed of a decent single local hard drive. 

A common setup for "big data" workstations nowadays is an 8 drive RAID array with a PCIe RAID controller on the workstation. A setup like this with traditional spinning platter hard drives can achieve real-world throughput of ~1200 MB/sec (more than 10x faster than you will get with a SAN solution even if you are the only user). For a lot of CryoEM processing, this specific point doesn't really matter, since a lot of the data-intensive work is "bursty" and you are limited by processing time. However, with movie-mode imaging on direct detectors, the disk bandwidth can EASILY be the limiting factor when doing alignment and averaging. For these purposes, using a SAN is a VERY bad option. 

LUSTRE and similar filesystems cannot get around the fundamental limitation of the speed of the network connection, but they can provide more TOTAL bandwidth. If you have a central SAN storing your data, and 10 workstations are simultaneously reading from the SAN at full speed, each will get only 1/10 of the available bandwidth. However, with LUSTRE, the data is distributed among all 10 workstations instead of being on one central server, and in many cases each workstation would then be able to get close to 100 MB/sec from the storage simultaneously. That is what makes it so attractive for clusters, where there may be 100 or 1000 computers (nodes) all trying to share the same filesystem at once. 

The reason I asked about the physical proximity of the computers is because network bandwidth is so often the limiting factor. You can get a fast SAN that can do 2000 MB/sec, but if it's connected to a single gigabit network connection, no matter what you do, you won't get more than 120 MB/sec out of it. This is where interfaces like Infiniband and 10 Gb networking come in. These solutions can provide 10-100 times faster networking among computers, but you will pay for it.  

At the NCMI, our central database server can do ~2000 MB/sec, but the individual network ports at BCM are only gigabit (with a 10 Gb backbone). So, we payed ~$10,000 for a single fiber 10 Gb connection to the database. The individual microscope and desktop computers still have only 1 Gb connections, and are limited to 120 MB/sec each, but 10 machines can simultaneously get this bandwidth before saturating the server. The 10 Gb connection was so expensive because we had to buy into "enterprise" grade hardware compatible with BCM's network backbone. 

If you have a small number of computers in reasonable proximity to each other, you can set up your own little 10 Gb or Infiniband network among these machines for faster shared access to the storage for considerably less money, but they need to be pretty close to one another. Their connection to anything outside the little group would still be at the slower 1 Gb speeds. 

For general non-data-intensive processing, there is absolutely nothing wrong with NFS, particularly if your machines are Linux or Mac. The other "modern" solutions each have their advantages, but will take a lot more effort and knowledge to configure and keep operational.

Honestly (again, at the NCMI) we have largely abandoned the whole shared filesystem thing within the lab, and really only use this concept for the clusters (which we operate 5 of). It works out to be more cost effective overall to simply give each person a decent workstation with fast local storage. They then copy their data to the cluster for final processing. We also have a couple of dedicated machines with high-end hardware to use for things like DDD movie processing. While this setup does mean the individual workstations at people's desks are more expensive, the cost savings in setting up and maintaining (there is a time cost as well) a high performance shared filesystem balances this, and each person has a more powerful machine to work on.

Anyway, there are a lot of possible ways to go, and a lot will depend on both your financial resources and what sort of skilled manpower you have available to configure machines/networks. Keep in mind that a skilled sysop most places will draw $50-100k/year. If your department provides good system administration for you, or if you have really computer saavy students/postdocs, maybe you will opt for a more labor intensive solution with less up-front costs, otherwise you need to figure this cost into your estimates for any solution you propose.

As to the earlier comment about the big performance boost when the cluster switched to Infiniband and a better filesystem, I strongly suspect that the faster processing they observed was almost entirely due to the higher speed MPI communications among nodes, not due to faster storage access. You'd have to ask Sjors to be sure, but I believe Relion processing is really not very disk intensive at all.

----------------------------------------------------------------------------
Steven Ludtke, Ph.D.
Professor, Dept. of Biochemistry and Mol. Biol.                Those who do
Co-Director National Center For Macromolecular Imaging	           ARE
Baylor College of Medicine                                     The converse
sludtke at bcm.edu  -or-  stevel at alumni.caltech.edu               also applies
http://ncmi.bcm.edu/~stevel

> On Jan 3, 2016, at 2:05 PM, Reza Khayat <rkhayat at ccny.cuny.edu> wrote:
> 
> Thanks to Steve Ludtke and Bob Sinkovits for their fast reply. In response to Steve's questions:
> 
> 1. The group's workstations will be accessing the file systems. I will not be tackling the issue of building and maintaining a cluster. I'll leave that to the more technically apt. 
> 
> 2. There are currently no external storage capabilities (if that's what you're asking). I do have needs and funds to establish the system. I'd be grateful if suggestions can be made.  
> 
> 3. Apologies, I'm reaching the limits of my networking knowledge. Ookla's Speedtest gives me 500Mbps for both upload and download; however, my particular workstation is a bit old. I don't use this system for data processing. Other workstations have Intel 82574L Gigabit Ethernet Controllers. All are located in different rooms of the same floor. Each is connected to its own Ethernet switch on the same floor.    
> 
> Thanks again.
> 
> Best wishes,
> Reza
> 
> Reza Khayat, PhD
> Assistant Professor
> City College of New York
> Department of Chemistry
> New York, NY 10031
> 
> ________________________________________
> From: 3dem <3dem-bounces at ncmir.ucsd.edu> on behalf of Sinkovits, Robert <sinkovit at sdsc.edu>
> Sent: Sunday, January 3, 2016 1:46 PM
> To: 3dem at ncmir.ucsd.edu
> Subject: Re: [3dem] File systems
> 
> Btw, I obviously meant NFS and not NSF in my previous email
> 
> On 1/3/16 10:43 AM, "Sinkovits, Robert" <sinkovit at sdsc.edu> wrote:
> 
>> Hi Reza,
>> 
>> I'm no longer working in 3DEM, but felt that I could make some comments on
>> file systems. We use NSF for users' home directories and Lustre where we
>> need performance. You already know the limitations of NSF, but if you're
>> not facing any scaling issues or performance bottlenecks, I wouldn't feel
>> compelled to switch.
>> 
>> Lustre is not for the faint of heart! While it's free and can deliver
>> great performance, we often find that it requires a good deal of expert
>> baby sitting. When we have problems with the nationally allocated systems
>> at SDSC, nine times out of ten it's due to Lustre issues
>> 
>> If you really need a parallel file system and have a budget to pay for it,
>> you may want to think about GPFS. One of my colleagues who recently
>> returned from NCAR said that they used GPFS on their big (multi-petaflop)
>> systems and found that it was rock solid.
>> 
>> I haven't had a chance to work with the other new filesystems that you
>> mentioned, but do hear a lot of good things about Ceph. It's open source
>> and there is a community of developers. The one downside (pulled verbatim
>> from Wikipedia) is
>> 
>> "Ceph currently lacks standard file system repair tools, and the Ceph user
>> documentation currently does not recommend storing mission critical data
>> on this architecture because it lacks disaster recovery capability and
>> tools."
>> 
>> Of course, I would expect mission critical data (e.g. Raw image data) to
>> be backed up somewhere else just to be safe.
>> 
>> -- Bob
>> 
>> Robert Sinkovits, Ph.D.
>> Director Scientific Applications Group
>> San Diego Supercomputer Center
>> University of California, San Diego
>> 
>> 
>> 
>> 
>> On 1/3/16 9:53 AM, "Reza Khayat" <rkhayat at ccny.cuny.edu> wrote:
>> 
>>> Hi,
>>> 
>>> Can anyone describe some of their experience with deploying and using a
>>> distributed filesystem for image analysis? Is it appropriate to say that
>>> NFS is antiquated, slow and less secure than the younger systems like
>>> Lustre, Gluster, Ceph, PVFS2, or Fraunhofer?
>>> 
>>> Best wishes,
>>> Reza
>>> 
>>> Reza Khayat, PhD
>>> Assistant Professor
>>> City College of New York
>>> 85 St. Nicholas Terrace CDI 12308
>>> New York, NY 10031
>>> (212) 650-6070
>>> www.khayatlab.org
>>> 
>>> _______________________________________________
>>> 3dem mailing list
>>> 3dem at ncmir.ucsd.edu
>>> https://mail.ncmir.ucsd.edu/mailman/listinfo/3dem
>> 
>> _______________________________________________
>> 3dem mailing list
>> 3dem at ncmir.ucsd.edu
>> https://mail.ncmir.ucsd.edu/mailman/listinfo/3dem
> 
> _______________________________________________
> 3dem mailing list
> 3dem at ncmir.ucsd.edu
> https://mail.ncmir.ucsd.edu/mailman/listinfo/3dem
> _______________________________________________
> 3dem mailing list
> 3dem at ncmir.ucsd.edu
> https://mail.ncmir.ucsd.edu/mailman/listinfo/3dem



More information about the 3dem mailing list