[NCLUG] system hangs for 5-10 seconds every few minutes
John L. Bass
jbass at dmsd.com
Sun Sep 2 12:07:49 MDT 2007
Marcio Luis Teixeira wrote:
> Try using a script to execute "top" every few seconds. Something like:
>
> while true; do top -n 1 > "`date`"; sleep 1; done
>
The problem with top is that it doesn't tell you much about why the
processes are stalling. Better to use "ps -lax" instead, so you at least
can see which wait channel things are getting suspended on.
> Anyhow, in my prior experiences with computers (and not
> necessarily using Linux), periodic hanging like that is caused by
> bad hardware on a bus which is causing bus resets and stalled
> interrupt handlers. I've seen stuff like that with bad IDE devices, bad
> USB devices or bad SCSI devices. Checking syslog for errors is
> definitely a first step (and also run "dmesg"). Be suspicious of
> any bus type error messages.
>
Few hardware devices fail like clock work ... IE every several minutes,
and then are functional for several minutes.
Bad device drivers which are not managing watchdog timers well,
certainly will produce nice regular failures, which as you note should
make it out to syslog.
Since it started with a kernel upgrade, there is some possibility that
the new release has a driver problem, and certainly that should be
visible from syslog.
The small memory size for a "modern" kernel, generally also means that
it has dynamically allocated a lot of resources based on available
memory percentages. With a lot of daemon processes running off cron and
the system timer, regular bursts of activity are common, and another
source of periodic problems.
Network stack issues are another. It's not uncommon to find that a bad
ethernet port, cable, switch causes packet loss. This results in
problems like ARP timeouts which cause periodic stalls while ARP has to
hit the network interface a few times to refresh it's IP to MAC table
maps. Regular light packet loss in the low double digits will not
directly stall ssh (as packets make it thru after a few retries) but
they will impact ARP and cause small TCP windows with increasingly
longer retry backoff's into the second range because of the congestion
protocol. It's not uncommon for these kinds of network errors to be data
dependent ... failing reliably for certain data patterns, and passing
most others with little problem.
The watchdog timers in the USB and SCSI stack generally are not that
predicable, and will largely generate hard errors for the service
resulting in process failures.
However, resource exhaustion caused by a few cron initiated processes,
can cause stalls till they finish ... impacting the dynamic allocation
in many message structured driver subsystems (USB, SCSI, Network, etc).
John
More information about the NCLUG
mailing list