<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">
Hi Liz,
<div class="">with a machine like this, I suspect you purchased from a vendor, rather than building it yourself?  If so, this seems an issue for vendor support.  Other things you omitted:</div>
<div class=""><br class="">
</div>
<div class="">- power supply model and ratings</div>
<div class="">- how many of the Xeon Gold processors?</div>
<div class="">- have you run a memory test</div>
<div class=""><br class="">
</div>
<div class="">power is the most likely culprit...</div>
<div class=""><br class="">
<div class="">
<div dir="auto" style="color: rgb(0, 0, 0); letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">
<div dir="auto" style="color: rgb(0, 0, 0); letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">
<div dir="auto" style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">
<div style="color: rgb(0, 0, 0); font-variant-caps: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px;">
<font face="Courier" class=""><span style="font-size: 14px;" class="">--------------------------------------------------------------------------------------<br class="">
Steven Ludtke, Ph.D. <<a href="mailto:sludtke@bcm.edu" class="">sludtke@bcm.edu</a>>                      Baylor College of Medicine <br class="">
Charles C. Bell Jr., Professor of Structural Biology<br class="">
Dept. of Biochemistry and Molecular Biology                      (<a href="http://www.bcm.edu/biochem" class="">www.bcm.edu/biochem</a>)<br class="">
Academic Director, CryoEM Core                                        (<a href="http://cryoem.bcm.edu" class="">cryoem.bcm.edu</a>)<br class="">
Co-Director CIBR Center                                    (<a href="http://www.bcm.edu/research/cibr" class="">www.bcm.edu/research/cibr</a>)<br class="">
<br class="">
</span></font><br class="">
</div>
</div>
</div>
</div>
</div>
<div><br class="">
<blockquote type="cite" class="">
<div class="">On Mar 23, 2020, at 5:25 PM, Liz Kellogg <<a href="mailto:lizkellogg@GMAIL.COM" class="">lizkellogg@GMAIL.COM</a>> wrote:</div>
<br class="Apple-interchange-newline">
<div class="">
<div dir="ltr" class="">
<div style="margin: 0px; font-variant-numeric: normal; font-variant-east-asian: normal; font-stretch: normal; line-height: normal; font-family: Helvetica; background-color: rgba(255, 255, 255, 0);" class="">
Hi 3dem-ers,</div>
<div style="margin: 0px; font-variant-numeric: normal; font-variant-east-asian: normal; font-stretch: normal; line-height: normal; font-family: Helvetica; background-color: rgba(255, 255, 255, 0);" class="">
<br class="">
</div>
<div style="margin: 0px; font-variant-numeric: normal; font-variant-east-asian: normal; font-stretch: normal; line-height: normal; font-family: Helvetica; background-color: rgba(255, 255, 255, 0);" class="">
I hope everyone is safely at home.<span class="gmail-Apple-converted-space">  </span>
I have a non COVID-19 related problem that I hope others can help advise on.</div>
<div style="margin: 0px; font-variant-numeric: normal; font-variant-east-asian: normal; font-stretch: normal; line-height: normal; font-family: Helvetica; background-color: rgba(255, 255, 255, 0);" class="">
<br class="">
</div>
<div style="margin: 0px; font-variant-numeric: normal; font-variant-east-asian: normal; font-stretch: normal; line-height: normal; font-family: Helvetica; background-color: rgba(255, 255, 255, 0);" class="">
I bought a 8-GPU server last April that would be used by my lab for image processing work. Everything seemed fine initially, however once I started getting more users the server became noticeably unstable around December and started randomly rebooting itself.
 It was happening so often that at its worst we couldn’t get through a single refinement job without a reboot.<span class="gmail-Apple-converted-space"> 
</span>Here are some technical details and hints at what could be going wrong:</div>
<div style="margin: 0px; font-variant-numeric: normal; font-variant-east-asian: normal; font-stretch: normal; line-height: normal; font-family: Helvetica; background-color: rgba(255, 255, 255, 0);" class="">
<br class="">
</div>
<div style="margin: 0px; font-variant-numeric: normal; font-variant-east-asian: normal; font-stretch: normal; line-height: normal; font-family: Helvetica; background-color: rgba(255, 255, 255, 0);" class="">
<u class="">Configuration of the server:</u></div>
<div style="margin: 0px; font-variant-numeric: normal; font-variant-east-asian: normal; font-stretch: normal; line-height: normal; font-family: Helvetica; background-color: rgba(255, 255, 255, 0);" class="">
TYAN Thunder HX FT77D-B7109 8GPU 2P 4x3</div>
<div style="margin: 0px; font-variant-numeric: normal; font-variant-east-asian: normal; font-stretch: normal; line-height: normal; font-family: Helvetica; background-color: rgba(255, 255, 255, 0);" class="">
Intel Xeon Gold 6138 20C 2.0-3.7 GHz</div>
<div style="margin: 0px; font-variant-numeric: normal; font-variant-east-asian: normal; font-stretch: normal; line-height: normal; font-family: Helvetica; background-color: rgba(255, 255, 255, 0);" class="">
384 GB DDR4 2400/2666 ECC/REG (12x32GB)</div>
<div style="margin: 0px; font-variant-numeric: normal; font-variant-east-asian: normal; font-stretch: normal; line-height: normal; font-family: Helvetica; background-color: rgba(255, 255, 255, 0);" class="">
SamSung 480GB 883 DCT SSD x 2</div>
<div style="margin: 0px; font-variant-numeric: normal; font-variant-east-asian: normal; font-stretch: normal; line-height: normal; font-family: Helvetica; background-color: rgba(255, 255, 255, 0);" class="">
Seagate 12TB SAS x 16<span class="gmail-Apple-converted-space"> </span></div>
<div style="margin: 0px; font-variant-numeric: normal; font-variant-east-asian: normal; font-stretch: normal; line-height: normal; font-family: Helvetica; background-color: rgba(255, 255, 255, 0);" class="">
GeForce RTX-2080Ti 11 GB x 8</div>
<div style="margin: 0px; font-variant-numeric: normal; font-variant-east-asian: normal; font-stretch: normal; line-height: normal; font-family: Helvetica; background-color: rgba(255, 255, 255, 0);" class="">
<br class="">
</div>
<div style="margin: 0px; font-variant-numeric: normal; font-variant-east-asian: normal; font-stretch: normal; line-height: normal; font-family: Helvetica; background-color: rgba(255, 255, 255, 0);" class="">
The most noticeable errors we see when the server is up are the GPU devices becoming undetectable, along the lines of:</div>
<div style="margin: 0px; font-variant-numeric: normal; font-variant-east-asian: normal; font-stretch: normal; line-height: normal; font-family: Helvetica; background-color: rgba(255, 255, 255, 0);" class="">
<br class="">
</div>
<div style="margin: 0px; font-variant-numeric: normal; font-variant-east-asian: normal; font-stretch: normal; line-height: normal; font-family: Helvetica; background-color: rgba(255, 255, 255, 0);" class="">
<span class="gmail-s1" style="font-kerning:none">$ nvidia-smi</span></div>
<div style="margin: 0px; font-variant-numeric: normal; font-variant-east-asian: normal; font-stretch: normal; line-height: normal; font-family: Helvetica; background-color: rgba(255, 255, 255, 0);" class="">
<span class="gmail-s1" style="font-kerning:none">Unable to determine the device handle for GPU 0000:B1:00.0: GPU is lost.  Reboot the system to recover this GPU</span></div>
<div style="margin: 0px; font-variant-numeric: normal; font-variant-east-asian: normal; font-stretch: normal; line-height: normal; font-family: Helvetica; background-color: rgba(255, 255, 255, 0);" class="">
<span class="gmail-s1" style="font-kerning:none"><br class="">
</span></div>
<div style="margin: 0px; font-variant-numeric: normal; font-variant-east-asian: normal; font-stretch: normal; line-height: normal; font-family: Helvetica; color: rgb(26, 26, 26); background-color: rgba(255, 255, 255, 0);" class="">
<span class="gmail-s1" style="font-kerning:none">Or<span class="gmail-Apple-converted-space"> </span></span></div>
<div style="margin: 0px; font-variant-numeric: normal; font-variant-east-asian: normal; font-stretch: normal; line-height: normal; font-family: Helvetica; color: rgb(26, 26, 26); background-color: rgba(255, 255, 255, 0);" class="">
<span class="gmail-s1" style="font-kerning:none"><span class="gmail-Apple-converted-space"><br class="">
</span></span></div>
<div style="margin: 0px; font-variant-numeric: normal; font-variant-east-asian: normal; font-stretch: normal; line-height: normal; font-family: Helvetica; background-color: rgba(255, 255, 255, 0);" class="">
<span class="gmail-s1" style="font-kerning:none">$ nvidia-smi</span></div>
<div style="margin: 0px; font-variant-numeric: normal; font-variant-east-asian: normal; font-stretch: normal; line-height: normal; font-family: Helvetica; background-color: rgba(255, 255, 255, 0);" class="">
<span class="gmail-s1" style="font-kerning:none">No devices were found</span></div>
<div style="margin: 0px; font-variant-numeric: normal; font-variant-east-asian: normal; font-stretch: normal; line-height: normal; font-family: Helvetica; background-color: rgba(255, 255, 255, 0);" class="">
<span class="gmail-s1" style="font-kerning:none"><br class="">
</span></div>
<div style="margin: 0px; font-variant-numeric: normal; font-variant-east-asian: normal; font-stretch: normal; line-height: normal; font-family: Helvetica; background-color: rgba(255, 255, 255, 0);" class="">
Replacing the GPUs did not seem to help which we did back in January, we are back to the same issues.<span class="gmail-Apple-converted-space"> </span></div>
<div style="margin: 0px; font-variant-numeric: normal; font-variant-east-asian: normal; font-stretch: normal; line-height: normal; font-family: Helvetica; background-color: rgba(255, 255, 255, 0);" class="">
<span class="gmail-s2" style="font-kerning:none">We also tried updating the GPU drivers to
</span><span class="gmail-s1" style="font-kerning:none">NVIDIA-SMI 440.33.01    Driver Version: 440.33.01 (before they were 410.48)</span></div>
<div style="margin: 0px; font-variant-numeric: normal; font-variant-east-asian: normal; font-stretch: normal; line-height: normal; font-family: Helvetica; background-color: rgba(255, 255, 255, 0);" class="">
<span class="gmail-s1" style="font-kerning:none">However, we experience pretty much the same behavior before and after the driver update.<span class="gmail-Apple-converted-space"> </span></span></div>
<div style="margin: 0px; font-variant-numeric: normal; font-variant-east-asian: normal; font-stretch: normal; line-height: normal; font-family: Helvetica; background-color: rgba(255, 255, 255, 0);" class="">
<span class="gmail-s1" style="font-kerning:none"><span class="gmail-Apple-converted-space"><br class="">
</span></span></div>
<div style="margin: 0px; font-variant-numeric: normal; font-variant-east-asian: normal; font-stretch: normal; line-height: normal; font-family: Helvetica; background-color: rgba(255, 255, 255, 0);" class="">
<span class="gmail-s1" style="font-kerning:none">Since we have updated the drivers, I doubt that’s a driver issue. Although it could be<span class="gmail-Apple-converted-space"> 
</span>a PCI bus issue, doesn’t seem likely to me because each of the 8 cards tend to go down randomly (during one strange episode, they were flickering on and off). My gut feeling is that there is either a power issue where the system’s power was not dimensioned
 properly (though looking at the chassis specs this seems unlikely as well), or a cooling issue. I am planning on monitoring the GPU temperature (I wrote a bash script using nvidia-smi -q) under heavy load and see if the current temp exceeds the maximum temp
 of each GPU.<span class="gmail-Apple-converted-space"> </span></span></div>
<div style="margin: 0px; font-variant-numeric: normal; font-variant-east-asian: normal; font-stretch: normal; line-height: normal; font-family: Helvetica; background-color: rgba(255, 255, 255, 0);" class="">
<span class="gmail-s1" style="font-kerning:none"><span class="gmail-Apple-converted-space"><br class="">
</span></span></div>
<div style="margin: 0px; font-variant-numeric: normal; font-variant-east-asian: normal; font-stretch: normal; line-height: normal; font-family: Helvetica; background-color: rgba(255, 255, 255, 0);" class="">
<span class="gmail-s1" style="font-kerning:none">Any idea of what would be going on? I think I have pretty standard server config.. has anyone experienced similar problems? Anyone that has configurations that work well for you, would you mind sharing your specs
 and your NVIDIA driver versions? Even if it's exactly the same specs that would help. Any non-standard steps to configure the machine or the drivers? I am mystified as to why we are experiencing these issues.. and doesn’t help that we’re all working from home
 at the moment :*(</span></div>
<div style="margin: 0px; font-variant-numeric: normal; font-variant-east-asian: normal; font-stretch: normal; line-height: normal; font-family: Helvetica; background-color: rgba(255, 255, 255, 0);" class="">
<span class="gmail-s1" style="font-kerning:none"><br class="">
</span></div>
<div style="margin: 0px; font-variant-numeric: normal; font-variant-east-asian: normal; font-stretch: normal; line-height: normal; font-family: Helvetica; background-color: rgba(255, 255, 255, 0);" class="">
<span class="gmail-s1" style="font-kerning:none">Thanks everyone, stay safe.</span></div>
<div style="margin: 0px; font-variant-numeric: normal; font-variant-east-asian: normal; font-stretch: normal; line-height: normal; font-family: Helvetica; background-color: rgba(255, 255, 255, 0);" class="">
<span class="gmail-s1" style="font-kerning:none"><br class="">
</span></div>
<div class="">
<div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature">
<div dir="ltr" class="">
<div class="">
<div dir="ltr" class="">
<div class="">
<div class="">Best wishes,</div>
<div class=""><br class="">
Liz</div>
<div dir="ltr" class=""><br class="">
</div>
<div dir="ltr" class="">Elizabeth H. Kellogg, Ph.D.
<div class="">Assistant Professor, Cornell University</div>
<div class="">Molecular Biology and Genetics</div>
<div class=""><br class="">
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
_______________________________________________<br class="">
3dem mailing list<br class="">
<a href="mailto:3dem@ncmir.ucsd.edu" class="">3dem@ncmir.ucsd.edu</a><br class="">
https://urldefense.proofpoint.com/v2/url?u=https-3A__mail.ncmir.ucsd.edu_mailman_listinfo_3dem&d=DwICAg&c=ZQs-KZ8oxEw0p81sqgiaRA&r=Dk5VoQQ-wINYVssLMZihyC5Dj_sWYKxCyKz9E4Lp3gc&m=I-DCLW8-a8G9CUwNPojXeXKScOa76l1CszApIJVedsI&s=Xhi1X3aXYltnUN2kig-0Dgg45iF3QszB2XwpyImF4v0&e=
<br class="">
</div>
</blockquote>
</div>
<br class="">
</div>
</body>
</html>