[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: pps or other unknown upper bound?

On Thu, Nov 17, 2005 at 04:52:40PM -0500, Jon Hart wrote:
> Bingo.  There are entries in the logs when this condition happens but it
> is not entirely clear what the problem aside from the fact that it is
> a "BAD STATE":
> Nov 17 21:44:48 fw-1 /bsd: pf: BAD state: TCP
> [lo=3722728956 high=3722735388
>    win=6432 modulator=4006337120 wscale=0] [lo=3737716700
>    high=3737723132 win=6432 modulator=3433376110 wscale=0] 9:9
>    S seq=3723083242 ack=3737716700 len=0 ackskew=0 pkts=5:5 dir=in,fwd
> Nov 17 21:44:48 fw-1 /bsd: pf: State failure on: 1       | 5  
The address/port pairs are probably clear (there's three pairs, two are
equal unless the state involves translation).
What you see is the existing state pf has and the packet that was
associated with it (based on the source/destination addresses/ports),
but failed the sequence number checks.
The square brackets [] contain the sequence number windows the state
allows. The digits 1 and 5 in the last part indicate which window rules
were violated: the packet's seq=3723083242 is higher than the upper
limit high=3722735388. That's why the packet is blocked.
Now, the theory is that the client is reusing the source port 59635
before the time-wait of the previous connection (which the state we see
represents) is over.
The 'S' part means the blocked packet was a SYN.
The '9:9' part means the FINs where exchanged and ACKed in both
directions, so the connection was closed normally (and no RSTs were
'pkts=5:5' means that the prior connection consisted of only 5 packets
each in both directions.
This all makes sense. Assuming you're fetching a tiny document from the
web server in a fast loop, the client will run out of random source
ports. It's probably honouring 2MSL up to the point where it simply has
no choice (other than stalling further connect(2) calls), until ports
free up.
I think the real solution in this case is to re-think the application
protocol. If the application re-connects to the server at this rate
(like 32,000 connections per minute), it's wasting a lot of network
bandwidth (for connection establishment and tear-down) and accumulating
a lot of latency. It would be much smarter to use one persistent
connection and pass multiple transactions over that. Maybe SOAP supports
that (if not, is it authenticating 32,000 times per minute, too? ;)
If you want to adjust pf so it will expire the states earlier, you can
lower the tcp.closed timeout value (from the default 90s to 1s). Expired
states are only removed in intervals (default is 10s, adjustable), so if
you lower a timeout to < 10s, you probably also want to lower the
interval accordingly (that may increase CPU load if you have many states).
The reason we keep a state in FIN_WAIT or TIME_WAIT is that there might
be spurious packets arriving late (like packets that travelled through
slower alternative paths across the network). By keeping the state
entry, those are associated with the state and don't cause pflog logging
(they'd usually not have a SYN flag and would get blocked and logged
according to your policy). So you'll likely not break anything by
lowering the timeout values, but you might be getting some more packets
logged as ordinary blocks.