Tuesday June 13th, 2023 NCLUG Meeting

Bob Proulx bob at proulx.com
Wed Jun 14 01:56:13 UTC 2023


j dewitt wrote:
> What: Tuesday June 13th, 2023 NCLUG Meeting

Tonight we had a full house!  AWESOME!  Summer is here and people are
coming out.

Mory started things off with a very nice talk "ExaFlop Clusters Use
Linux".  I'll just note down some words about the super computers he
talked about.  "Just shy of 10,000 systems."  "I brought my compute
cluster tonight."  "Frontier ExaFlop AMD -HPE, a multi-million dollar
machine, has their documentation online.  Anyone can read all of the
datails of it."  "Just shy of 39,000 GPUs."  "5TB of RAM. 12.8TBits/s
I/O transfer."

Warewulf is a computer cluster implementation toolkit that facilitates
the process of installing a cluster and long term administration.
Clusters run from RAM instead of from disk.  Load the OS into RAM and
then run.  Otherwise there will be too much disk failure.  And it is
all about speed.  PXE, TFTP, DHCP, NFS, no local disk storage.

    https://en.wikipedia.org/wiki/Warewulf
    https://warewulf.org/

Reboots are very slow when there is so much RAM.  So reboots are
avoided.  Instead it uses overlays which are live and created on the
fly.  Demo!  Mory's compute cluster (a few laptops) showed a
demonstration of booting and loading the system into RAM.

Some complaints about the proprietary nvidia driver.  It was a pain to
make work in the overlay.  Since it is not part of the OS it has to be
installed separately.  But then it always must be installed
separately.  Not impossible.  Just more difficult to get going.  But
required in order to use the GPUs in the compute cluster.

The way things work is that there are chroots with the raw file system
for the OS.  But then that gets packed into .img files.  Then there
overlays on them that contain the application.

    https://openhpc.community/downloads/
    https://github.com/stanfordhpccenter/OpenHPC/tree/main/hpc-for-the-rest-of-us/recipes/rocky8/warewulf4/slurm

Warewulf manages all aspects of the cluter.  PXE boot.  DHCP.  DNS.
And on and on for everything that is needed to diskless boot each of
the cluter machines and making them available in the cluster.  Though
we had some conversation about NTP.  But time synchronization is
critical just the same and nodes cannot authenticate if the time is
offset from the manager host.

We had a little discussion about Intel HyperThreading.  It makes the
OS process scheduler more complicated.  It was amusing that we had at
least five people who voiced that they disable HT in order to improve
the total performance for high performance computing.  (And I am in
that camp too because when we benchmarked we were faster with HT off
than on and could get more simulations through.)  So many of us
disable HT as a matter of routine now.

Mory was very enthused about supercomputing!  But all good things must
come to an end!  We had many new people so we decided to do a round
robin to give everyone that wanted to say something to the group a
chance to do so.  Then we adjourned the meeting.  Many of us then went
to Slyce Pizza afterward for dinner.


More information about the NCLUG mailing list