[NCLUG] system hangs for 5-10 seconds every few minutes

Daniel Herrington nclug at iherr.com
Tue Sep 4 21:40:06 MDT 2007


Thanks for all the responses and helpful suggestions! I'm still  
trying to narrow it down, but when I tried "ps lax" on a serial  
console during a stall observed over the wireless ssh connection, it  
went through just fine, and the "top" process had a dash ("-") in the  
WCHAN column, so it was running and not waiting on anything. Maybe  
you guys are on to something with the wireless usb adapter. It's a  
Belkin F5D6050 using the at76c503a driver from at76c503a.berlios.de,  
which I had to use the 2.6.20 kernel with, since it wouldn't compile  
with the 2.6.22 kernel.

Daniel


On Sep 2, 2007, at 12:07 PM, John L. Bass wrote:

> Marcio Luis Teixeira wrote:
>> Try using a script to execute "top" every few seconds. Something  
>> like:
>>
>>    while true; do top -n 1 > "`date`"; sleep 1; done
>>
> The problem with top is that it doesn't tell you much about why the  
> processes are stalling. Better to use "ps -lax" instead, so you at  
> least can see which wait channel things are getting suspended on.
>
>> Anyhow, in my prior experiences with computers (and not
>> necessarily using Linux), periodic hanging like that is caused by
>> bad hardware on a bus which is causing bus resets and stalled
>> interrupt handlers. I've seen stuff like that with bad IDE  
>> devices, bad
>> USB devices or bad SCSI devices. Checking syslog for errors is
>> definitely a first step (and also run "dmesg"). Be suspicious of
>> any bus type error messages.
>>
>
> Few hardware devices fail like clock work ... IE every several  
> minutes, and then are functional for several minutes.
>
> Bad device drivers which are not managing watchdog timers well,  
> certainly will produce nice regular failures, which as you note  
> should make it out to syslog.
>
> Since it started with a kernel upgrade, there is some possibility  
> that the new release has a driver problem, and certainly that  
> should be visible from syslog.
>
> The small memory size for a "modern" kernel, generally also means  
> that it has dynamically allocated a lot of resources based on  
> available memory percentages. With a lot of daemon processes  
> running off cron and the system timer, regular bursts of activity  
> are common, and another source of periodic problems.
>
> Network stack issues are another. It's not uncommon to find that a  
> bad ethernet port, cable, switch causes packet loss. This results  
> in problems like ARP timeouts which cause periodic stalls while ARP  
> has to hit the network interface a few times to refresh it's IP to  
> MAC table maps. Regular light packet loss in the low double digits  
> will not directly stall ssh (as packets make it thru after a few  
> retries) but they will impact ARP and cause small TCP windows with  
> increasingly longer retry backoff's into the second range because  
> of the congestion protocol. It's not uncommon for these kinds of  
> network errors to be data dependent ... failing reliably for  
> certain data patterns, and passing most others with little problem.
>
> The watchdog timers in the USB and SCSI stack generally are not  
> that predicable, and will largely generate hard errors for the  
> service resulting in process failures.
>
> However, resource exhaustion caused by a few cron initiated  
> processes, can cause stalls till they finish ... impacting the  
> dynamic allocation in many message structured driver subsystems  
> (USB, SCSI, Network, etc).
>
> John
> _______________________________________________
> NCLUG mailing list       NCLUG at nclug.org
>
> To unsubscribe, subscribe, or modify
> your settings, go to:
> http://www.nclug.org/mailman/listinfo/nclug




More information about the NCLUG mailing list