[3dem] weird server issues, suggestions/advice would be helpful!

Mon Mar 23 15:25:13 PDT 2020

Hi 3dem-ers,

I hope everyone is safely at home.  I have a non COVID-19 related problem
that I hope others can help advise on.

I bought a 8-GPU server last April that would be used by my lab for image
processing work. Everything seemed fine initially, however once I started
getting more users the server became noticeably unstable around December
and started randomly rebooting itself. It was happening so often that at
its worst we couldn’t get through a single refinement job without a reboot.
Here are some technical details and hints at what could be going wrong:

*Configuration of the server:*

TYAN Thunder HX FT77D-B7109 8GPU 2P 4x3

Intel Xeon Gold 6138 20C 2.0-3.7 GHz

384 GB DDR4 2400/2666 ECC/REG (12x32GB)

SamSung 480GB 883 DCT SSD x 2

Seagate 12TB SAS x 16

GeForce RTX-2080Ti 11 GB x 8

The most noticeable errors we see when the server is up are the GPU devices
becoming undetectable, along the lines of:

$ nvidia-smi

Unable to determine the device handle for GPU 0000:B1:00.0: GPU is lost.
Reboot the system to recover this GPU

Or

$ nvidia-smi

No devices were found

Replacing the GPUs did not seem to help which we did back in January, we
are back to the same issues.

We also tried updating the GPU drivers to NVIDIA-SMI 440.33.01    Driver
Version: 440.33.01 (before they were 410.48)

However, we experience pretty much the same behavior before and after the
driver update.

Since we have updated the drivers, I doubt that’s a driver issue. Although
it could be  a PCI bus issue, doesn’t seem likely to me because each of the
8 cards tend to go down randomly (during one strange episode, they were
flickering on and off). My gut feeling is that there is either a power
issue where the system’s power was not dimensioned properly (though looking
at the chassis specs this seems unlikely as well), or a cooling issue. I am
planning on monitoring the GPU temperature (I wrote a bash script using
nvidia-smi -q) under heavy load and see if the current temp exceeds the
maximum temp of each GPU.

Any idea of what would be going on? I think I have pretty standard server
config.. has anyone experienced similar problems? Anyone that has
configurations that work well for you, would you mind sharing your specs
and your NVIDIA driver versions? Even if it's exactly the same specs that
would help. Any non-standard steps to configure the machine or the drivers?
I am mystified as to why we are experiencing these issues.. and doesn’t
help that we’re all working from home at the moment :*(

Thanks everyone, stay safe.

Best wishes,

Liz

Elizabeth H. Kellogg, Ph.D.
Assistant Professor, Cornell University
Molecular Biology and Genetics
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ncmir.ucsd.edu/pipermail/3dem/attachments/20200323/abf31560/attachment-0001.html>