[NCLUG] system hangs for 5-10 seconds every few minutes

Sun Sep 2 12:07:49 MDT 2007

Marcio Luis Teixeira wrote:
> Try using a script to execute "top" every few seconds. Something like:
>
>    while true; do top -n 1 > "`date`"; sleep 1; done
>   
The problem with top is that it doesn't tell you much about why the 
processes are stalling. Better to use "ps -lax" instead, so you at least 
can see which wait channel things are getting suspended on.

> Anyhow, in my prior experiences with computers (and not
> necessarily using Linux), periodic hanging like that is caused by
> bad hardware on a bus which is causing bus resets and stalled
> interrupt handlers. I've seen stuff like that with bad IDE devices, bad
> USB devices or bad SCSI devices. Checking syslog for errors is
> definitely a first step (and also run "dmesg"). Be suspicious of
> any bus type error messages.
>   

Few hardware devices fail like clock work ... IE every several minutes, 
and then are functional for several minutes.

Bad device drivers which are not managing watchdog timers well, 
certainly will produce nice regular failures, which as you note should 
make it out to syslog.

Since it started with a kernel upgrade, there is some possibility that 
the new release has a driver problem, and certainly that should be 
visible from syslog.

The small memory size for a "modern" kernel, generally also means that 
it has dynamically allocated a lot of resources based on available 
memory percentages. With a lot of daemon processes running off cron and 
the system timer, regular bursts of activity are common, and another 
source of periodic problems.

Network stack issues are another. It's not uncommon to find that a bad 
ethernet port, cable, switch causes packet loss. This results in 
problems like ARP timeouts which cause periodic stalls while ARP has to 
hit the network interface a few times to refresh it's IP to MAC table 
maps. Regular light packet loss in the low double digits will not 
directly stall ssh (as packets make it thru after a few retries) but 
they will impact ARP and cause small TCP windows with increasingly 
longer retry backoff's into the second range because of the congestion 
protocol. It's not uncommon for these kinds of network errors to be data 
dependent ... failing reliably for certain data patterns, and passing 
most others with little problem.

The watchdog timers in the USB and SCSI stack generally are not that 
predicable, and will largely generate hard errors for the service 
resulting in process failures.

However, resource exhaustion caused by a few cron initiated processes, 
can cause stalls till they finish ... impacting the dynamic allocation 
in many message structured driver subsystems (USB, SCSI, Network, etc).

John