[3dem] Utilizing the Xeon Phi

Tue Nov 18 07:04:52 PST 2014

Dear all,

I can only agree with what Matthias wrote. I had a Phi 7120 in my hands for
a week and tried to evaluate it for highly parallelizable algorithms that
also go well with GPUs. While easier portability is one of Intel's
marketing claims, it appears to be true for only very few cases. None of
the popular CPU-based EM code I have seen would experience significant
speed-ups without modifications. For the latter, porting from CUDA C to
OpenCL (which Phi will happily execute) is much easier than modifying C++
code to make better use of Phi. However, while Phi and GPUs are similar
enough to be addressed properly by OpenCL, getting the most out of any of
them will require device-specific optimization. I am not sure if getting
the peak performance out of a Phi is even possible with OpenCL – it
certainly isn't for Nvidia GPUs (for more complex code).

Thus, you can choose between: No code changes and virtually no speed-ups;
porting to OpenCL for reasonable performance across many platforms, but
still under-utilizing every platform (and having a rather poor development
environment compared to CUDA/C++); writing device-specific code to achieve
the maximum.

Then there is Intel MKL for common tasks, such as FFT or BLAS functions. It
is highly optimized for every architecture Intel currently offers,
including Phi – to an extent only Intel can afford. MKL can be used as a
straight drop-in for popular libraries like FFTW, so porting is just a
matter of changing library names. However, using Phi only for these tasks
doesn't make much sense, as it's connected over the same slow PCIe bus that
makes pushing data to a GPU such a pain. You would want to do more with the
data once they are in Phi's memory. It's really the same situation as with
CUDA, which offers the same set of vendor-supplied, optimized libraries as
Intel.

Going Phi causes exactly the same problems as going GPU (specifically,
Nvidia & CUDA). It is only a question of whom you trust to deliver more
FLOPS per $ in the long run. Right now, Nvidia is doing a much better job,
but that can change if Intel ever decides to make Phi a first-class citizen
(1 release in 3 years – really?).

As for the scalability issues on AMD chips: Their current approach to
counting cores is selling a "module" with 1 float and 2 integer pipelines
as 2 cores. If your code can saturate the float pipeline with 1 thread, the
second "core" will be useless.

Best,
Dimitry

On Tue, Nov 18, 2014 at 11:14 AM, Matthias Wolf <matthias.wolf at oist.jp>
wrote:

>  Hi Dewight, Alexis,
>
>
>
> Just to chime in after Alexis’ message – I did compile frealign on a Phi
> 7120 about one year ago during a visit by Intel. While the procedure was
> straight forward, I did not attempt any non-standard optimizations. Out of
> the box, the performance was rather disappointing and at the time I decided
> it were better to use standard multi-core Xeon processors.
>
>
>
> Compared to Xueming Li’s GPU version of frealign, the xeon phi I tested
> was no competition – I use a 16-GPU box (8x nvidia GTX590 in a Tyan
> barebone), which accelerates the program ~1500-fold as compared to a single
> 2.7 GHz Xeon core.
>
>
>
> While the concept of the phi is nice – it feels like having a little linux
> cluster in your PC to which you can ssh and run multi-threaded programs, it
> has clear limitations: the one I tested had only 16GB memory, which makes
> large reconstructions problematic. The 61 (Intel Atom-derived) cores per
> board run at only 1GHz and they have a small cache. Now this is not much
> different to GPUs, but there are many more cores on most GPUs. Maybe with
> the right optimizations, the phi would be a worthy adversary, but I did not
> have the time to find out.
>
>
>
> Regarding Intel vs AMD, I agree 100% with Steve Ludke’s statements. I
> tested a 32-core Opteron system against the latest quad core Xeon a couple
> years ago and while roughly comparable at single-threaded performance, the
> Xeon scaled linearly with the number of threads (frealign-mp), whereas the
> Opteron quickly saturated (more than 12 cores were useless) and its
> performance was significantly lower. I believe this has to do with AMDs
> interconnects having lower bandwidth than Intel’s hypertransport. In
> particular the E-series Xeons are really very good.
>
>
>
> Finally (this came up in a previous thread) – there is no problem
> operating nvidia gaming GPUs in headless mode with linux – my box is
> sitting in a rack in the datacenter and I simply ssh to it. Actually, the
> gaming cards use the same chips as their corresponding quadro or Tesla
> relatives, less ECC memory. They are usually even higher clocked than the
> more expensive “professional” cards, but the chief difference is that the
> GTX series has less memory. So unless you need quad-buffered graphics for
> windowed stereo and a lot of memory, there is no point in buying anything
> else. The main issue is to feed them with data, which requires SSD-raid,
> and providing sufficient current.  Cooling can be alleviated by removing
> their on-board fans in a good rack-mounted case, which brings the temps
> down by 20-30C.
>
>
>
>    Matthias
>
>
>
> _______________________________________________________
>
> Matthias Wolf, PhD MPharm - Assistant Professor
>
> Molecular Cryo-Electron Microscopy Unit
>
> Okinawa Institute of Science and Technology Graduate University
>
> 1919-1 Tancha, Onna-son, Kunigami-gun
>
> Okinawa 904-0495, Japan
>
> Phone +81-(0)98-966-8987
>
>
>
>
>
> *From:* 3dem-bounces at ncmir.ucsd.edu [mailto:3dem-bounces at ncmir.ucsd.edu] *On
> Behalf Of *Alexis Rohou
> *Sent:* Tuesday, November 18, 2014 2:11 PM
> *To:* 3dem at ncmir.ucsd.edu
> *Subject:* Re: [3dem] Utilizing the Xeon Phi
>
>
>
> Dear Dewight,
>
> As far as I know, none of the 3DEM packages have been adapted to run on
> Phi boards. This means you could run them (provided you recompiled them
> using the Intel compilers) but only in native mode, which involves SSH'ing
> onto the boards. And even then, without optimization, you'd probably get
> worse performance than on a top-of-the-range Xeon chip. However I guess if
> you pack enough cards per node you might get improved density for your
> cluster.
>
> The topic of Phi boards was brought up at the NRAMM meeting last week at
> Scripps and it seemed no-one had tried them yet.
>
> Here at Janelia we bought a Phi 7200 to test out, but haven't got round to
> doing much with it because of the time required to investigate program
> optimization and the relatively meager prospective gains.
>
> So, bottom line: don't go for a cluster with Phi boards, because none of
> the 3DEM software will be ready for them.
>
> Hope this helps,
> Alexis
>
> On 11/12/2014 10:08 AM, Dewight R. Williams wrote:
>
> Dear 3dem,
>
>
>
> Has anyone performed 3D single particle reconstruction on the new Intel
> Xeon Phi boards? When you performed this work did the software need to be
> recompiled or was it implemented through standard openMPI?  What software
> were you using Frealign, Relion, Xmipp, EMAN2, etc? Thanks, I’m debating on
> which architecture I want to invest in for a local cluster and any feedback
> on these questions would be very appreciated.
>
>
>
> Dewight
>
>
>
>
>  _______________________________________________
>
> 3dem mailing list
>
> 3dem at ncmir.ucsd.edu
>
> https://mail.ncmir.ucsd.edu/mailman/listinfo/3dem
>
>
>
>
>  --
>
> Alexis Rohou
>
>
>
> Research Specialist
>
> Grigorieff Lab
>
> http://grigoriefflab.janelia.org
>
> Tel. +1 571 209 4000 x3485
>
>
>
>
>
> _______________________________________________
> 3dem mailing list
> 3dem at ncmir.ucsd.edu
> https://mail.ncmir.ucsd.edu/mailman/listinfo/3dem
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ncmir.ucsd.edu/pipermail/3dem/attachments/20141118/359dd8ae/attachment-0001.html>