Insider Details
Introduction
Click for technical details on what went wrong. For everyone else, it was "gremlins".
The folks whose software we use for ELMAR, Ramius, in turn use servers from a company called Rackspace. Rackspace updated some Input/Output code and some network code in respone to a virus threat some weeks ago. It is unclear which caused the problem: the changed I/O or network routines.
In either case, after the update, we began to experience random thread . These did not generate log errors, but simply shut down the communication between the Ramius database and mail servers.
After finally figuring out that their code was not at fault, Ramius solved the problem by parcelling out each email into an individual job and having all the jobs run in parallel. That way a thread deadlock in any one job would not affect the rest of the email addresses. After changing the affected processes in this manner no thread deadlocks have been seen.
(This slightly reminds me of what a systems programmer once said at a meeting of some UCLA computing technical staff in the late 70’s: "There are no unknown bugs of which we are aware.")
I am pleased they solved the problem and I thank you all for your patience.
[-ch]