[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

pps or other unknown upper bound?

This may be a pf issue, this may be an OpenBSD issue or this may be
a client issue, so let me apologize in advance.  
The setup is fairly simply -- a debian machine hanging off of each of
two interfaces on an OpenBSD -current box from 11/8 running pf.  Nothing
particularly complex about this setup.  Tight ruleset, aggressive
optimization, nothing funky, no queueing, no per-rule timeout
optimization, and the following options:
set limit { states 500000, frags 100000 }
set optimization aggressive
set block-policy return
set state-policy if-bound
set require-order yes
set skip on $LOCAL_IF
set debug urgent
set loginterface $FW_BACK_IF
scrub all no-df random-id fragment reassemble
Yes, there is lots more to this ruleset but nothing is getting blocked
here and like I said, there are no rules in the remainder of the ruleset
that have anything related to timeouts, limits or the like.  In fact,
I actually added a specific rule as a test case that does this:
pass in on $CLIENT_IF inet proto tcp from $CLIENT_NET to $SERVER_NET \
   port 12345 flags S/SA modulate state
I've also tried this same test on other pf installations (early 3.7)
that are vastly simpler and they behave identically.
There are no other rules relating to port 12345 and if I remove this
rule, the traffic gets blocked thanks to my default policy.
My test is simple.  While on $CLIENT_NET:
   while (true); do lynx -dump http://host.on.server.net:12345; date; done
Things spin up fast and go quickly for some number of seconds spewing
tens/hundreds of connections and then subsequent connections hang -- the
client sits in SYN_SENT and the server sits there with several hundred
connections in TIME_WAIT.  Exactly 45 seconds later, things come back to
life.  In the time 0s to time 45s, you can see the TIME_WAITs slowly
disappear, and then at 45s the loop comes back to life and the
connections rip through once again.  Some number of seconds later,
things freeze again and hang for 45s.  Both systems seem completely
usable -- I can ssh/to from them and do whatever I please.  I/O is doing
almost nothing, system is 98% idle, and interrupts seem fine.  Same with
the firewall -- not even breaking a sweat.
Not that it really matters, but the way this problem originally popped up
was with SOAP calls from a Java client to a JBoss server.  I've
simplified the problem a bit by just using lynx on the client and Apache
on the server.
I do not know of any default settings in pf that would cause this.  My
second thought was the clients, but the problem non existent when the
firewall is out of the picture.  I've twisted some sysctl knobs that
Henning and others have suggested in the past but none seem to have any
The only thing so far that has seemed to affect the timeout was changing
pf's tcp.finwait.  When I changed that from aggressive's setting of 30s
to 10s, the timeouts went from a consistent 45s to 21s.  Aggressive has
that name for a reason so I'm hesitant to crank things any further in
that regard.
Any input, whether its pf, OpenBSD or client related would be much