[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: pf/carp for redundant production use



On Sep 25, 2005, at 8:30 AM, Neil wrote:

Yep, the same behavior when the master dies. The solution that the person in #pf told me is use routing but I don't know how to implement. He told me that it's an issue in pf's NAT.

Bullshit.


Ok, here is the layman's description of the problem and the practical solution(s) to it. I'd love to be able to explain why interfaces recovering from INIT don't reclaim MASTER faster than they do (approx 30 seconds in my tests), but I don't understand the code-level logistics of everything. Hint: This is only a problem using single CARP hosts with preemption.

PROBLEM:

With a simple CARP design using a single CARP host on each segment and preemption enabled, failover occurs as expected in the case of any system offline condition (server crashes, admin reboots, etc). If a single interface goes from MASTER to INIT state (cable gets pulled, cable goes bad, card goes bad, etc), the 2nd interface on that system will go into BACKUP mode as expected. Traffic will route across the new MASTER, and will continue to do so while the failed system is in an INIT/BACKUP state.

However, if the failed interface returns from INIT to an available mode (we plug the cable in), we notice that the 2nd interface reclaims MASTER almost immediately, but the restored interface does not. It becomes a BACKUP host, which leaves us with a routing impossibility:

BACKUP   MASTER
   carp0         carp0
      |                 |   host1         host2
      |                 |   carp1         carp1
MASTER   BACKUP

Any internal clients will attempt to send traffic through the "new gateway" (host1), although neither system has any way of routing the traffic properly (not without some hokey static routes bypassing the CARP hosts). NOTE: I have found that the original MASTER does indeed return to the correct state, approximately 30 seconds later. This is reproducible, but YMMV.

SOLUTION:

1) If you really are concerned about a partial system failure (unplugged cable, bad card, etc), then scrap the single CARP host/ segment design and use arpbalance with multiple CARP hosts. The same partial-failure test using 2 CARP hosts on each segment with arpbalance resulted in a perfect failover and recovery with no packet loss.

2) This is not tested, but I suspect that you should be able to use the new interface grouping features in 3.8 to simply assign multiple physical interfaces to the same group. Even if one fails, the other *should* maintain the MASTER state and avoid any partial failure consequences. I'd love to hear from other users or developers that have tried the grouping feature in this sort of scenario.


-- Jason Dixon DixonGroup Consulting http://www.dixongroup.net