Thank you for your patience over the last few days. As
most of you probably suspect, our decision to change the system downtime and
halt new jobs from starting was due to the recently announced Intel security
flaws. More details can be found in these articles:
We felt the security implications were severe enough that we
needed to apply patches as soon as possible. However, because the issue
and mitigation techniques were so rapidly developing, we wanted to hold off
until we were sure the patches issued by operating system vendors would be
successful at addressing the problems. This morning we patched all
the nodes of all the clusters and everything is back in production. You
may be aware that some believe these patches will cause a decrease in system
performance. We weighed the options and felt that, because CCR allows
sharing of nodes, the security patches were a requirement, regardless of the
potential performance hit we might see. We do not have definitive data on
CCR’s system yet but results will vary depending on what applications you’re
running and whether you’re running on one node or more than one. We have
a robust application kernel monitoring system built into XDMoD (https://metrics.ccr.buffalo.edu/) so we will be able to
track the performance differences in many of the more popular applications, over time. RedHat has released this
statement based on very initial testing on their HPC systems: https://access.redhat.com/articles/3307751
While these problems can only be completely fixed at the
hardware level, we fully expect to see additional operating system patches
released by vendors as they determine ways to improve system performance.
We will apply these at every downtime, as we always do.
Thank you again for your patience! Please email ccr-help@buffalo.edu if you run into any
problems or have questions about this update.
Dori Sajdak
Update 1/18/18: See our research paper here and the HPC Wire article about it here
Dear CCR users,
Thank you for your patience over the last few days. As most of you probably suspect, our decision to change the system downtime and halt new jobs from starting was due to the recently announced Intel security flaws. More details can be found in these articles:
https://www.wired.com/story/critical-intel-flaw-breaks-basic-security-for-most-computers/
https://www.theregister.co.uk/2018/01/02/intel_cpu_design_flaw/
We felt the security implications were severe enough that we needed to apply patches as soon as possible. However, because the issue and mitigation techniques were so rapidly developing, we wanted to hold off until we were sure the patches issued by operating system vendors would be successful at addressing the problems. This morning we patched all the nodes of all the clusters and everything is back in production. You may be aware that some believe these patches will cause a decrease in system performance. We weighed the options and felt that, because CCR allows sharing of nodes, the security patches were a requirement, regardless of the potential performance hit we might see. We do not have definitive data on CCR’s system yet but results will vary depending on what applications you’re running and whether you’re running on one node or more than one. We have a robust application kernel monitoring system built into XDMoD (https://metrics.ccr.buffalo.edu/) so we will be able to track the performance differences in many of the more popular applications, over time. RedHat has released this statement based on very initial testing on their HPC systems: https://access.redhat.com/articles/3307751
While these problems can only be completely fixed at the hardware level, we fully expect to see additional operating system patches released by vendors as they determine ways to improve system performance. We will apply these at every downtime, as we always do.
Thank you again for your patience! Please email ccr-help@buffalo.edu if you run into any problems or have questions about this update.