Re: [squid-users] Squid Performance (with Polygraph) from Marcello Romani on 2007-11-09 (squid-users)

From: Marcello Romani <mromani@dont-contact.us>
Date: Fri, 09 Nov 2007 17:45:27 +0100

Dave Raven ha scritto:
> Hi Adrian,
>
> It works for the full 4 hours with a null cache directory. How would
> I see any kind of stats/information on disk IO? From the stats I can see so
> far, the disk stats don't change at all when it fails ...
>
> I'm currently using COSS, but I've also tried this with ufs and diskd (with
> the same results, just different times that it fails after).
>
> Thanks
> Dave
>
> -----Original Message-----
> From: Adrian Chadd [mailto:adrian@creative.net.au]
> Sent: Friday, November 09, 2007 3:35 PM
> To: Dave Raven
> Cc: squid-users@squid-cache.org
> Subject: Re: [squid-users] Squid Performance (with Polygraph)
>
> Rightio; this reads like you're running out of disk IO.
> Try running the test with a null cache dir and make sure the box can handle
> that load.
>
> Squid unfortunately had crap disk IO code for whats available these days.
>
>
>
>
> Adrian
>
> On Fri, Nov 09, 2007, Dave Raven wrote:
>> Hi all,
>> Okay I managed to do a lot more testing at the office today. Firstly
>> some of the questions asked --
>>
>> CPU Usage: The cpu usage is around 30% during the test, when the unit
> begins
>> to fail it actually goes down a bit.
>>
>> Mbufs/Clusters: All fine - they do rise quickly after the problem happens
>> but this is because the established network connections are still coming
> in
>> 600 a second, but only being satisfied at a rate of say 200 a second. The
>> send queues then get big, and mbuf usage goes up - this is not the cause
> of
>> the failure though, it's a side effect. For the first x minutes its
> between
>> 250 and 3000 mbufs (and clusters) used, and my max is 65k/32k
>>
>> As for system logs there are none - there is nothing suspicious anywhere
>> until the side effects kick in, e.g. mbufs running out etc. Squid also
> logs
>> nothing at all. I've also checked if I'm using too much memory and that's
>> not the case - swap is not used at all during the entire test.
>>
>> This is the process of what happens --
>>
>> 1. PolyClt + PolySrv begin, 800 RPS.
>>
>> 2. ESTABLISHED netstat connections are around 2000 once 800RPS is reached
>> (about 20 seconds). CPU load is 30%, mbufs are available etc.
>>
>> 3. Once memory becomes full (quickly) disk drive usage begins - squid -z
>> puts the TPS per drive at well over 1000/s when I run it, when the cache
> is
>> doing 800 RPS the tps is about 30 per drive (low..).
>>
>> 4. After a period of time (almost always the same (+/- 60 seconds)
> depending
>> on RPS) the ESTABLISHED connections start rising, at the exact same time
> the
>> PolyClt starts showing less RPS. This is the "slow down" as such.
>>
>> 5. Because of this, polyclt continues to send requests which the unit
>> continues to accept - quickly all available sockets are used, and the unit
>> will then crash
>>
>> Interestingly enough though - if I stop the polyclt when this happens and
>> restart it - in under 10 seconds - it continues on for another x minutes
>> without problem. If I leave it running the unit never comes right.
>>
>> I have used "systat -vmstat 1", "systat -tcp 1", "systat -iostat 1" and
> all
>> the stats from Munin, and a MRTG graphing config for squid and they all
> show
>> nothing of interest. The only result that changes between working time and
>> slow down is that the connections go through the roof as explained
> above...
>> I have also seen it fail at 300RPS, but only after 82 minutes - which
> seems
>> like a very long time if it was going to fail because of disk load. The
>> entire time the disks are very underloaded. That said, if I use a null
> cache
>> directory this doesn't happen....
>>
>> I know that sounds like its clearly drives - but 82 minutes ??
>>
>> Thanks for all the help
>> Dave
>>
>> -----Original Message-----
>> From: Adrian Chadd [mailto:adrian@creative.net.au]
>> Sent: Friday, November 09, 2007 11:55 AM
>> To: Dave Raven
>> Cc: 'Adrian Chadd'; squid-users@squid-cache.org
>> Subject: Re: [squid-users] Squid Performance (with Polygraph)
>>
>> Check netstat -mb and see if you're running out of mbufs?
>> You haven't mentioned whether the CPU is being pegged at this point?
>>
>>
>>
>> Adrian
>>
>> On Fri, Nov 09, 2007, Dave Raven wrote:
>>> Hi all,
>>> Okay I've done some of what you requested, and unfortunately failed
>>> to find anything specific. I can pretty much guarantee the times at
> which
>>> the requests will slow down now. 600RPS = 15 minutes, 800 RPS = 11
>> minutes,
>>> 400 RPS = ~80 minutes.
>>>
>>> During that time (before and during the problem) systat -vmstat 1 shows
>> the
>>> same interrupts - about 4000 on em1 (ifac) and 250 on hptmv0 - my
>> controller
>>> for the SATA drives.
>>>
>>> If I use a systat -iostat 1 I can see that none of the drives are 100%
>>> utilized at any time during the test. Systat -tcp 1 also doesn't show me
>>> anything out of the ordinary. I have setup munin to monitor the host but
>>> unfortunately its not showing much.
>>>
>>> Also the problem is that when the problem begins, it starts filling up
>>> network connections - once it fills all the available ports nothing can
>>> monitor it :/
>>>
>>> I'm going to try use a different network card, then a different
>> motherboard
>>> etc - try some different setups today. Thanks again for all the help and
>>> please let me know if anyone has any ideas...
>>>
>>> Thanks
>>> Dave
>>>
>>> -----Original Message-----
>>> From: Adrian Chadd [mailto:adrian@creative.net.au]
>>> Sent: Friday, November 09, 2007 4:08 AM
>>> To: Dave Raven
>>> Cc: squid-users@squid-cache.org
>>> Subject: Re: [squid-users] Squid Performance (with Polygraph)
>>>
>>> On Thu, Nov 08, 2007, Dave Raven wrote:
>>>> Hi Adrian,
>>>> What would cause it to fail after a specific time though - if the
>>> cache_mem
>>>> is already full and its using the drives? I would have thought it
> would
>>> fail
>>>> immediately ?
>>>>
>>>> Also there are no log messages about failures or anything...
>>> Who knows :) its hard without having remote access, or lots of logging/
>>> statistics to correlate the trouble times with.
>>>
>>> Try installing munin and graph all the system-specific stuff. See what
>>> correlates against the failure time. You might notice something, like
>>> out of memory/paging, or an increase in interrupts, or something. ;)
>>>
>>> Thats all I can offer at the present time, sorry.
>>>
>>>
>>>
>>> Adrian
>>>
>>> --
>>> - Xenion - http://www.xenion.com.au/ - VPS Hosting - Commercial Squid
>>> Support -
>>> - $25/pm entry-level VPSes w/ capped bandwidth charges available in WA -
>> --
>> - Xenion - http://www.xenion.com.au/ - VPS Hosting - Commercial Squid
>> Support -
>> - $25/pm entry-level VPSes w/ capped bandwidth charges available in WA -
>

I'll spend my remaining 1 cent suggesting a full dump of SMART
parameters before and after each test. Maybe by looking at how the smart
counters vary a clue may come out... :-)

HTH

-- 
Marcello Romani
Responsabile IT
Ottotecnica s.r.l.
http://www.ottotecnica.com

Received on Fri Nov 09 2007 - 09:45:41 MST

This archive was generated by hypermail pre-2.1.9 : Sat Dec 01 2007 - 12:00:02 MST