BigIP: Fixing MQ Read and Write 104 X('68') Errors
F5 published a Deployment Guide "Deploying the BigIP LTM with IBM WebSphere MQ", which provides a very good HowTo guide for load-balancing MQ with a BigIP and provides all the steps necessary to get up and running quickly.
I followed all these steps myself and my developers were off and running. The problems started a little later, however, when the application was unable to read messages off of the queue. Errors like the following were showing up in the MQ "client" logs:
AMQ9208: Error on receive from host 10.X.Y.Z (10.X.Y.Z). EXPLANATION: An error occurred receiving data from 10.X.Y.Z (10.X.Y.Z) over TCP/IP. This may be due to a communications failure. ACTION: The return code from the TCP/IP read() call was 104 (X'68'). Record these values and tell the systems administrator.
In addition, we would see the following errors on the MQ "server" logs:
AMQ9206: Error sending data to host 10.a.b.c (10.a.b.c)(port#). EXPLANATION: An error occurred sending data over TCP/IP to 10.a.b.c (10.a.b.c)(port#). This may be due to a communications failure. ACTION: The return code from the TCP/IP(write) call was 104 X('68'). Record these values and tell your systems administrator.
If the BigIP VIP was bypassed, the errors on both ends of the connection would go away. No firewalls separated one mq node from another. Other than citing a "network failure", IBM support wasn't much assistance here. Initial network traces indicated that I had not followed the deployment guide recommendations verbatim, as the traces were showing keep-alive requests every 300 seconds from the MQ nodes, which also matched the 300 second tcp idle timeout on the BigIP. It was somewhat easy to assume then that the bigip was issuing a connection reset on the idle tcp connection at around the same time the mq heartbeat was getting issued but reducing the MQ Heartbeat interval to below the BigIP's TCP Idle Timeout value did not resolve the issue.
The resolution was found with the help from this F5 Solution Article: "SOL8049: Implementing TCP Keep-Alives for server-client communication using TCP profiles". Add a Keep Alive Interval to the TCP profile (or create a new tcp profile) and assign the value to half of what your tcp timeout is set to (or an even smaller value), restart the MQ channels, and these errors should go away.