[NCLUG] system hangs for 5-10 seconds every few minutes
Daniel Herrington
nclug at iherr.com
Tue Sep 4 21:40:06 MDT 2007
Thanks for all the responses and helpful suggestions! I'm still
trying to narrow it down, but when I tried "ps lax" on a serial
console during a stall observed over the wireless ssh connection, it
went through just fine, and the "top" process had a dash ("-") in the
WCHAN column, so it was running and not waiting on anything. Maybe
you guys are on to something with the wireless usb adapter. It's a
Belkin F5D6050 using the at76c503a driver from at76c503a.berlios.de,
which I had to use the 2.6.20 kernel with, since it wouldn't compile
with the 2.6.22 kernel.
Daniel
On Sep 2, 2007, at 12:07 PM, John L. Bass wrote:
> Marcio Luis Teixeira wrote:
>> Try using a script to execute "top" every few seconds. Something
>> like:
>>
>> while true; do top -n 1 > "`date`"; sleep 1; done
>>
> The problem with top is that it doesn't tell you much about why the
> processes are stalling. Better to use "ps -lax" instead, so you at
> least can see which wait channel things are getting suspended on.
>
>> Anyhow, in my prior experiences with computers (and not
>> necessarily using Linux), periodic hanging like that is caused by
>> bad hardware on a bus which is causing bus resets and stalled
>> interrupt handlers. I've seen stuff like that with bad IDE
>> devices, bad
>> USB devices or bad SCSI devices. Checking syslog for errors is
>> definitely a first step (and also run "dmesg"). Be suspicious of
>> any bus type error messages.
>>
>
> Few hardware devices fail like clock work ... IE every several
> minutes, and then are functional for several minutes.
>
> Bad device drivers which are not managing watchdog timers well,
> certainly will produce nice regular failures, which as you note
> should make it out to syslog.
>
> Since it started with a kernel upgrade, there is some possibility
> that the new release has a driver problem, and certainly that
> should be visible from syslog.
>
> The small memory size for a "modern" kernel, generally also means
> that it has dynamically allocated a lot of resources based on
> available memory percentages. With a lot of daemon processes
> running off cron and the system timer, regular bursts of activity
> are common, and another source of periodic problems.
>
> Network stack issues are another. It's not uncommon to find that a
> bad ethernet port, cable, switch causes packet loss. This results
> in problems like ARP timeouts which cause periodic stalls while ARP
> has to hit the network interface a few times to refresh it's IP to
> MAC table maps. Regular light packet loss in the low double digits
> will not directly stall ssh (as packets make it thru after a few
> retries) but they will impact ARP and cause small TCP windows with
> increasingly longer retry backoff's into the second range because
> of the congestion protocol. It's not uncommon for these kinds of
> network errors to be data dependent ... failing reliably for
> certain data patterns, and passing most others with little problem.
>
> The watchdog timers in the USB and SCSI stack generally are not
> that predicable, and will largely generate hard errors for the
> service resulting in process failures.
>
> However, resource exhaustion caused by a few cron initiated
> processes, can cause stalls till they finish ... impacting the
> dynamic allocation in many message structured driver subsystems
> (USB, SCSI, Network, etc).
>
> John
> _______________________________________________
> NCLUG mailing list NCLUG at nclug.org
>
> To unsubscribe, subscribe, or modify
> your settings, go to:
> http://www.nclug.org/mailman/listinfo/nclug
More information about the NCLUG
mailing list