Dataverse doing down with connection refused error · troubleshooting

Goes down and then back up again. If it's at a time where I can catch it - rather then the middle of the night) I'll restart payara. The error seems to be:
**

[#|2024-04-08T17:32:50.037+0000|SEVERE|Payara 5.2022.4|org.glassfish.jersey.server.ServerRuntime$Responder|_ThreadI=97;_ThreadName=http-thread-pool::jk-connector(2);_TimeMillis=1712597570037;_LevelValue=1000;|
An I/O error has occurred while writing a response message entity to the container output stream.
org.glassfish.jersey.server.internal.process.MappableException: java.io.IOException: Connection closed
at org.glassfish.jersey.server.internal.MappableExceptionWrapperInterceptor.aroundWriteTo(MappableExceptionWrapperInterceptor.java:67)
**

Philip Durbin 🚀 (Apr 08 2024 at 20:23):

jamie jamison (Apr 09 2024 at 00:02):

Philip Durbin 🚀 (Apr 09 2024 at 00:21):

jamie jamison (Apr 09 2024 at 00:27):

Philip Durbin 🚀 (Apr 09 2024 at 00:45):

#!/bin/bash
# Script for use with Nagios to bounce Tomcat Services when they are critical

PATH=/usr/local/bin:/bin:/usr/bin:/sbin:/usr/sbin

mailTO=""

case $1 in
   OK)
      ;;
   WARNING)
      ;;
   CRITICAL)
      if [ $2 == "HARD" ]; then
         echo "restarting glassfish for Dataverse"
     # get stack dump first
         glassfish_pid=`ps -ef | grep java | grep glassfish | grep -v grep | awk '{print $2}'`
         if [ "x${glassfish_pid}" != "x" ]; then
           /usr/bin/jstack $glassfish_pid > "/tmp/glassfish-jstack.${glassfish_pid}"
     fi
         /etc/init.d/glassfish stop
         sleep 20
         glassfish_pid=`ps -ef | grep java | grep glassfish | grep -v grep | awk '{print $2}'`
         if [ "x${glassfish_pid}" != "x" ]; then
            echo "Dataverse glassfish is not responding to init script; trying kill -9"
            kill -9 $glassfish_pid
            sleep 10
         fi
         /etc/init.d/glassfish start
         echo "" | mail -s "bounced glassfish on `hostname`" $mailTO
      fi
      ;;
   UNKNOWN)
   ;;
esac
exit 0

Philip Durbin 🚀 (Apr 09 2024 at 00:46):

jamie jamison (Apr 09 2024 at 03:12):

Oliver Bertuch (Apr 09 2024 at 04:52):

Please note that it's easy with systemd to write a timer and a oneshot service. That would monitor something to detect Payara hangs and will restart the payara service.

Oliver Bertuch (Apr 09 2024 at 04:53):

Basically the same idea as the healthcheck / liveness probe in Kubernetes/Compose.

Leo Andreev (Apr 12 2024 at 20:21):

Yes, we used to use Nagios a while back. What we are doing now is much simpler, the script below is run every 5 min. from the root's cronjob, like this:

# Monitor Payara and bounce if down. To PAUSE, touch /tmp/stop, then remove /tmp/stop when done!
*/5 * * * * /usr/local/bin/check_site.sh

The script just calls a low-cost page, and forces a restart if it doesn't get a happy response. It's super simple, the only remotely non-trivial parts in it are the hooks for being able to pause it (when you actually want to restart or redeploy), and also, to prevent multiple instances of the script from stacking up, if for whatever reason it is taking more than 5 min.

#!/bin/bash

SITE="http://localhost:8080/loginpage.xhtml"
#SITE="http://localhost:8080/file.xhtml?persistentId=doi:10.7910/DVN/XXXXXX/YYYYYY"
# (we have experimented over the years using more and less expensive pages for
# the purposes of monitoring; using login page, one of the less resource-consuming
# pages, now)


# timeout value is in seconds:
TIMEOUT=180

# This pauses the check. Touch /tmp/stop to disable, remove /tmp/stop to enable
if [ -f /tmp/stop ];then
    exit 0
fi

# Since the curl timeout + whatever overhead may be more than the interval
# at which the cron job runs, let's prevent these jobs from stacking up:

LOCKFILE=/var/lock/check_website.lock

if [ -f $LOCKFILE ]
then
    echo "Site check taking more than 5 minutes!!" | mailx -s "ATTENTION: DELAYED SITE RESPONSE" xxxxx@yyyyy.harvard.edu
    # this implies that the previous instance of this job (run by cron every 5 min.)
    # is still running.

    # the only excuse for a page call to take this long would be if the payara
    # process were in a stop-the-world phase of garbage collection.
    # GC logging (above, and in the Payara log directory) should tell
    # whether that was the case.
    echo -n "bailing out; check already in progress. " >> /var/log/dataverse-bounce.log
    date +%m/%d/%y" "%H:%M >> /var/log/dataverse-bounce.log
    exit 0
fi

touch $LOCKFILE

STAT=`curl -s -o /dev/null -m ${TIMEOUT} -w "%{http_code}" $SITE`

if [ ${STAT}"x" != "200x" ];then

    APP=`cat /xxxxx/payara6/domain1/config/pid.prev`
    /sbin/service payara stop
    sleep 5

    # and, to make sure it's really dead:
    kill -9 $APP 2>/dev/null
    sleep 5
    /sbin/service payara start

    if [ ${STAT}"x" == "000x" ]
    then
    STAT="TIMEOUT"
    fi

    # We have two application nodes in our prod. setup, so on each of the hosts
    # the <hostname> entry in the next two lines indicates the local hostname
    # of the specific node that is being restarted:

    echo "Restarted Payara on <hostname>" | mailx -s "Bounced Payara <hostname>" xxxxx@yyyyy.harvard.edu
    echo -n "Restarted Payara on <hostname>. (http code: ${STAT}) " >> /var/log/dataverse-bounce.log
    date +%m/%d/%y" "%H:%M >> /var/log/dataverse-bounce.log

    # With Payara 6, there is some appreciable time before the completion of asadmin start-domain
    # and the application becoming active and beginning to respond to requests!
    sleep 60
fi

/bin/rm $LOCKFILE

Philip Durbin 🚀 (Apr 12 2024 at 20:23):

Leo Andreev (Apr 12 2024 at 20:26):

There was some weirdness with my initial copy-and-paste, but I have fixed it, I think...

Stream: troubleshooting

Topic: Dataverse doing down with connection refused error

jamie jamison (Apr 08 2024 at 19:53):

Philip Durbin 🚀 (Apr 08 2024 at 20:23):

Philip Durbin 🚀 (Apr 08 2024 at 20:23):

jamie jamison (Apr 09 2024 at 00:02):

Philip Durbin 🚀 (Apr 09 2024 at 00:21):

jamie jamison (Apr 09 2024 at 00:27):

Philip Durbin 🚀 (Apr 09 2024 at 00:45):

Philip Durbin 🚀 (Apr 09 2024 at 00:46):

jamie jamison (Apr 09 2024 at 03:12):

Oliver Bertuch (Apr 09 2024 at 04:52):

Oliver Bertuch (Apr 09 2024 at 04:53):

Leo Andreev (Apr 12 2024 at 20:21):

Philip Durbin 🚀 (Apr 12 2024 at 20:23):

Leo Andreev (Apr 12 2024 at 20:26):

Philip Durbin 🚀 (Apr 12 2024 at 20:31):

Philip Durbin 🚀 (Apr 16 2024 at 15:31):

jamie jamison (Apr 17 2024 at 20:30):

Philip Durbin 🚀 (Apr 17 2024 at 20:34):