<div dir="ltr">Hi, Liz<div><br></div><div>

Regarding your GPU error messages, can you recover after a reboot? And does the errors show up even when you are not using the machine much?  </div><div><br></div><div>I'd concur with Steve and Hideki. It is most likely a power problem. </div><div><br></div><div>In our experience with 8GPU machines (we use a different model though), you'd better use 220V power outlets. With 110V outlets, even if you use different outlets, the total power draw might still be short of what the machine can draw when running at full. </div><div><br></div><div><div><div><div><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div>-Clara</div><div>SingleParticle.com</div></div></div></div><br></div></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Mar 23, 2020 at 3:31 PM Liz Kellogg <<a href="mailto:lizkellogg@gmail.com">lizkellogg@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">





<p style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;line-height:normal;font-family:Helvetica;background-color:rgba(255,255,255,0)">Hi 3dem-ers,</p><p style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;line-height:normal;font-family:Helvetica;background-color:rgba(255,255,255,0)"><br></p>
<p style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;line-height:normal;font-family:Helvetica;background-color:rgba(255,255,255,0)">I hope everyone is safely at home.<span>  </span>I have a non COVID-19 related problem that I hope others can help advise on.</p><p style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;line-height:normal;font-family:Helvetica;background-color:rgba(255,255,255,0)"><br></p>
<p style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;line-height:normal;font-family:Helvetica;background-color:rgba(255,255,255,0)">I bought a 8-GPU server last April that would be used by my lab for image processing work. Everything seemed fine initially, however once I started getting more users the server became noticeably unstable around December and started randomly rebooting itself. It was happening so often that at its worst we couldn’t get through a single refinement job without a reboot.<span>  </span>Here are some technical details and hints at what could be going wrong:</p><p style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;line-height:normal;font-family:Helvetica;background-color:rgba(255,255,255,0)"><br></p>
<p style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;line-height:normal;font-family:Helvetica;background-color:rgba(255,255,255,0)"><u>Configuration of the server:</u></p>
<p style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;line-height:normal;font-family:Helvetica;background-color:rgba(255,255,255,0)">TYAN Thunder HX FT77D-B7109 8GPU 2P 4x3</p>
<p style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;line-height:normal;font-family:Helvetica;background-color:rgba(255,255,255,0)">Intel Xeon Gold 6138 20C 2.0-3.7 GHz</p>
<p style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;line-height:normal;font-family:Helvetica;background-color:rgba(255,255,255,0)">384 GB DDR4 2400/2666 ECC/REG (12x32GB)</p>
<p style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;line-height:normal;font-family:Helvetica;background-color:rgba(255,255,255,0)">SamSung 480GB 883 DCT SSD x 2</p>
<p style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;line-height:normal;font-family:Helvetica;background-color:rgba(255,255,255,0)">Seagate 12TB SAS x 16<span> </span></p>
<p style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;line-height:normal;font-family:Helvetica;background-color:rgba(255,255,255,0)">GeForce RTX-2080Ti 11 GB x 8</p><p style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;line-height:normal;font-family:Helvetica;background-color:rgba(255,255,255,0)"><br></p>
<p style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;line-height:normal;font-family:Helvetica;background-color:rgba(255,255,255,0)">The most noticeable errors we see when the server is up are the GPU devices becoming undetectable, along the lines of:</p><p style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;line-height:normal;font-family:Helvetica;background-color:rgba(255,255,255,0)"><br></p>
<p style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;line-height:normal;font-family:Helvetica;color:rgb(0,0,0);background-color:rgba(255,255,255,0)"><span style="font-kerning:none">$ nvidia-smi</span></p>
<p style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;line-height:normal;font-family:Helvetica;color:rgb(0,0,0);background-color:rgba(255,255,255,0)"><span style="font-kerning:none">Unable to determine the device handle for GPU 0000:B1:00.0: GPU is lost.  Reboot the system to recover this GPU</span></p><p style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;line-height:normal;font-family:Helvetica;color:rgb(0,0,0);background-color:rgba(255,255,255,0)"><span style="font-kerning:none"><br></span></p>
<p style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;line-height:normal;font-family:Helvetica;color:rgb(26,26,26);background-color:rgba(255,255,255,0)"><span style="font-kerning:none">Or<span> </span></span></p><p style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;line-height:normal;font-family:Helvetica;color:rgb(26,26,26);background-color:rgba(255,255,255,0)"><span style="font-kerning:none"><span><br></span></span></p>
<p style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;line-height:normal;font-family:Helvetica;color:rgb(0,0,0);background-color:rgba(255,255,255,0)"><span style="font-kerning:none">$ nvidia-smi</span></p>
<p style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;line-height:normal;font-family:Helvetica;color:rgb(0,0,0);background-color:rgba(255,255,255,0)"><span style="font-kerning:none">No devices were found</span></p><p style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;line-height:normal;font-family:Helvetica;color:rgb(0,0,0);background-color:rgba(255,255,255,0)"><span style="font-kerning:none"><br></span></p>
<p style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;line-height:normal;font-family:Helvetica;background-color:rgba(255,255,255,0)">Replacing the GPUs did not seem to help which we did back in January, we are back to the same issues.<span> </span></p>
<p style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;line-height:normal;font-family:Helvetica;color:rgb(0,0,0);background-color:rgba(255,255,255,0)"><span style="font-kerning:none">We also tried updating the GPU drivers to </span><span style="font-kerning:none">NVIDIA-SMI 440.33.01    Driver Version: 440.33.01 (before they were 410.48)</span></p>
<p style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;line-height:normal;font-family:Helvetica;color:rgb(0,0,0);background-color:rgba(255,255,255,0)"><span style="font-kerning:none">However, we experience pretty much the same behavior before and after the driver update.<span> </span></span></p><p style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;line-height:normal;font-family:Helvetica;color:rgb(0,0,0);background-color:rgba(255,255,255,0)"><span style="font-kerning:none"><span><br></span></span></p>
<p style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;line-height:normal;font-family:Helvetica;color:rgb(0,0,0);background-color:rgba(255,255,255,0)"><span style="font-kerning:none">Since we have updated the drivers, I doubt that’s a driver issue. Although it could be<span>  </span>a PCI bus issue, doesn’t seem likely to me because each of the 8 cards tend to go down randomly (during one strange episode, they were flickering on and off). My gut feeling is that there is either a power issue where the system’s power was not dimensioned properly (though looking at the chassis specs this seems unlikely as well), or a cooling issue. I am planning on monitoring the GPU temperature (I wrote a bash script using nvidia-smi -q) under heavy load and see if the current temp exceeds the maximum temp of each GPU.<span> </span></span></p><p style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;line-height:normal;font-family:Helvetica;color:rgb(0,0,0);background-color:rgba(255,255,255,0)"><span style="font-kerning:none"><span><br></span></span></p>
<p style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;line-height:normal;font-family:Helvetica;color:rgb(0,0,0);background-color:rgba(255,255,255,0)"><span style="font-kerning:none">Any idea of what would be going on? I think I have pretty standard server config.. has anyone experienced similar problems? Anyone that has configurations that work well for you, would you mind sharing your specs and your NVIDIA driver versions? Even if it's exactly the same specs that would help. Any non-standard steps to configure the machine or the drivers? I am mystified as to why we are experiencing these issues.. and doesn’t help that we’re all working from home at the moment :*(</span></p><p style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;line-height:normal;font-family:Helvetica;color:rgb(0,0,0);background-color:rgba(255,255,255,0)"><span style="font-kerning:none"><br></span></p>
<p style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;line-height:normal;font-family:Helvetica;color:rgb(0,0,0);background-color:rgba(255,255,255,0)"><span style="font-kerning:none">Thanks everyone, stay safe.</span></p><p style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;line-height:normal;font-family:Helvetica;color:rgb(0,0,0);background-color:rgba(255,255,255,0)"><span style="font-kerning:none"><br></span></p><div><div dir="ltr"><div dir="ltr"><div><div dir="ltr"><div><div>Best wishes,</div><div><br>Liz</div><div dir="ltr"><br></div><div dir="ltr">Elizabeth H. Kellogg, Ph.D.<div>Assistant Professor, Cornell University</div><div>Molecular Biology and Genetics</div><div><br></div></div></div></div></div></div></div></div></div>
_______________________________________________<br>
3dem mailing list<br>
<a href="mailto:3dem@ncmir.ucsd.edu" target="_blank">3dem@ncmir.ucsd.edu</a><br>
<a href="https://mail.ncmir.ucsd.edu/mailman/listinfo/3dem" rel="noreferrer" target="_blank">https://mail.ncmir.ucsd.edu/mailman/listinfo/3dem</a><br>
</blockquote></div>