DPM Dual Homing
I've discovered a significant caveat to the dual homing arguments below, that of passive mode gridftp. In this mode the server opens a listening data port, to which the client connects. This data port is passed as a 6 byte numerical string - 4 bytes for the IPv4 address, 2 for the port. This means that the IP address of the server is encoded as a numerical IP address, not as a hostname. If the client is attempting to trigger a 3rd party gridftp transfer, it may pass this numerical IP address from it's unrouted space to a second server in routed space. This second server cannot route to the first server's unrouted IP address and the gridftp transfer fails.
The only way I can see around this is:
- Ensure that all internal hosts only ever see the routed IP address of the SE.
- Insert special routes onto the routing tables of the clients, routing to these addresses over the interbal unrouted interface of the client.
- Ensure that source address spoof protection on the servers is disabled, so that these servers will respond to their routed IP address, even on their internal interface.
At this point I decided the above was all too hacky for our production system and so I have currently disabled any dual homing at Glasgow - WNs talk via NAT to the DPM nodes.
Report of a Dual Homing Test at Glasgow
At Glasgow we have worker nodes on private IP space, thus an interest in having our SRM on a dual homed machine to avoid bottlenecks in the NAT gateway. i.e., we want the DPM node and pools to appear on the internal worker node subnets as well as on the external routed subnet.
So I've done some tests of DPM deployed in this manner and it seems to work as expected.
The first thing I checked was that the DPM daemons bound to all IP addresses, using netstat -tlp:
Active Internet connections (only servers) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp 0 0 *:rfio *:* LISTEN 3398/rfiod tcp 0 0 *:5010 *:* LISTEN 3348/dpnsdaemon tcp 0 0 *:5015 *:* LISTEN 3484/dpm tcp 0 0 *:2811 *:* LISTEN 4374/ftpd tcp 0 0 *:8443 *:* LISTEN 3593/srmv1 tcp 0 0 *:8444 *:* LISTEN 3645/srmv2
As they do bind to * no problems were anticipated. It should just work, but I wanted to check...
I gave an internal IP address to the second interface on our DPM host. By hacking /etc/hosts I was able to ensure that the DPM host and test clients talked to one another through the internal subnets (checked with traceroute).
I then performed DPM Testing from internal clients:
- globus-url-copy was tested from a worker node (LCG 2.4.0)
- globus-url-copy, srmcp, DPNS functionality were tested from a UI (LCG 2.6.0).
All these tests succeeded, both for file placement and recovery and name server functionality.
The only caveat is that this test was performed onto a single DPM host, with all services, including the disk pool. However, a DPM pool node uses the same services (dpm-gridftp, dpm-rfio) as the central host so I don't anticipate any problems with dual homed pools. (The Transfer URL is returned using a hostname, not an IP address, which is obviously important.)
DPM Dual Homing experience at Lancaster
Nothing particuarly tricky was required for DPM to listen on dual NICs, in fact nothing was required on DPM whatsoever. The changes made were on WNs, by routing traffic destined for the dpm-pool public NIC to it's private NIC. On SL4, like this:
# WN /etc/sysconfig/network-scripts/route-eth0 22.214.171.124 via 10.0.0.23
Our net in world's worst ascii art.
WN -- switch -- NAT -- router -- internet | | |--- pool ------| | | |--- pool ------| | | --- dpm head --