Stream: troubleshooting

Topic: Dataverse doing down with connection refused error


view this post on Zulip jamie jamison (Apr 08 2024 at 19:53):

Goes down and then back up again. If it's at a time where I can catch it - rather then the middle of the night) I'll restart payara. The error seems to be:
**

[#|2024-04-08T17:32:50.037+0000|SEVERE|Payara 5.2022.4|org.glassfish.jersey.server.ServerRuntime$Responder|_ThreadI=97;_ThreadName=http-thread-pool::jk-connector(2);_TimeMillis=1712597570037;_LevelValue=1000;|
  An I/O error has occurred while writing a response message entity to the container output stream.
org.glassfish.jersey.server.internal.process.MappableException: java.io.IOException: Connection closed
        at org.glassfish.jersey.server.internal.MappableExceptionWrapperInterceptor.aroundWriteTo(MappableExceptionWrapperInterceptor.java:67)
**

view this post on Zulip Philip Durbin 🚀 (Apr 08 2024 at 20:23):

Hmm, I don't think I've seen that error.

view this post on Zulip Philip Durbin 🚀 (Apr 08 2024 at 20:23):

We have a restart script at Harvard Dataverse.

view this post on Zulip jamie jamison (Apr 09 2024 at 00:02):

Is that part of the Dataverse package or a separate script?

view this post on Zulip Philip Durbin 🚀 (Apr 09 2024 at 00:21):

It's separate. I'm not sure if we've ever published it.

view this post on Zulip jamie jamison (Apr 09 2024 at 00:27):

Could I take a look at it?

view this post on Zulip Philip Durbin 🚀 (Apr 09 2024 at 00:45):

The version I have is from 2015:

#!/bin/bash
# Script for use with Nagios to bounce Tomcat Services when they are critical

PATH=/usr/local/bin:/bin:/usr/bin:/sbin:/usr/sbin

mailTO=""

case $1 in
   OK)
      ;;
   WARNING)
      ;;
   CRITICAL)
      if [ $2 == "HARD" ]; then
         echo "restarting glassfish for Dataverse"
     # get stack dump first
         glassfish_pid=`ps -ef | grep java | grep glassfish | grep -v grep | awk '{print $2}'`
         if [ "x${glassfish_pid}" != "x" ]; then
           /usr/bin/jstack $glassfish_pid > "/tmp/glassfish-jstack.${glassfish_pid}"
     fi
         /etc/init.d/glassfish stop
         sleep 20
         glassfish_pid=`ps -ef | grep java | grep glassfish | grep -v grep | awk '{print $2}'`
         if [ "x${glassfish_pid}" != "x" ]; then
            echo "Dataverse glassfish is not responding to init script; trying kill -9"
            kill -9 $glassfish_pid
            sleep 10
         fi
         /etc/init.d/glassfish start
         echo "" | mail -s "bounced glassfish on `hostname`" $mailTO
      fi
      ;;
   UNKNOWN)
   ;;
esac
exit 0

view this post on Zulip Philip Durbin 🚀 (Apr 09 2024 at 00:46):

Let me ask around tomorrow to see if we have something newer.

view this post on Zulip jamie jamison (Apr 09 2024 at 03:12):

thank you

view this post on Zulip Oliver Bertuch (Apr 09 2024 at 04:52):

Please note that it's easy with systemd to write a timer and a oneshot service. That would monitor something to detect Payara hangs and will restart the payara service.

view this post on Zulip Oliver Bertuch (Apr 09 2024 at 04:53):

Basically the same idea as the healthcheck / liveness probe in Kubernetes/Compose.

view this post on Zulip Leo Andreev (Apr 12 2024 at 20:21):

Philip Durbin said:

Let me ask around tomorrow to see if we have something newer.

Yes, we used to use Nagios a while back. What we are doing now is much simpler, the script below is run every 5 min. from the root's cronjob, like this:

# Monitor Payara and bounce if down. To PAUSE, touch /tmp/stop, then remove /tmp/stop when done!
*/5 * * * * /usr/local/bin/check_site.sh

The script just calls a low-cost page, and forces a restart if it doesn't get a happy response. It's super simple, the only remotely non-trivial parts in it are the hooks for being able to pause it (when you actually want to restart or redeploy), and also, to prevent multiple instances of the script from stacking up, if for whatever reason it is taking more than 5 min.

Here it is, lightly redacted.

#!/bin/bash

SITE="http://localhost:8080/loginpage.xhtml"
#SITE="http://localhost:8080/file.xhtml?persistentId=doi:10.7910/DVN/XXXXXX/YYYYYY"
# (we have experimented over the years using more and less expensive pages for
# the purposes of monitoring; using login page, one of the less resource-consuming
# pages, now)


# timeout value is in seconds:
TIMEOUT=180

# This pauses the check. Touch /tmp/stop to disable, remove /tmp/stop to enable
if [ -f /tmp/stop ];then
    exit 0
fi

# Since the curl timeout + whatever overhead may be more than the interval
# at which the cron job runs, let's prevent these jobs from stacking up:

LOCKFILE=/var/lock/check_website.lock

if [ -f $LOCKFILE ]
then
    echo "Site check taking more than 5 minutes!!" | mailx -s "ATTENTION: DELAYED SITE RESPONSE" xxxxx@yyyyy.harvard.edu
    # this implies that the previous instance of this job (run by cron every 5 min.)
    # is still running.

    # the only excuse for a page call to take this long would be if the payara
    # process were in a stop-the-world phase of garbage collection.
    # GC logging (above, and in the Payara log directory) should tell
    # whether that was the case.
    echo -n "bailing out; check already in progress. " >> /var/log/dataverse-bounce.log
    date +%m/%d/%y" "%H:%M >> /var/log/dataverse-bounce.log
    exit 0
fi

touch $LOCKFILE

STAT=`curl -s -o /dev/null -m ${TIMEOUT} -w "%{http_code}" $SITE`

if [ ${STAT}"x" != "200x" ];then

    APP=`cat /xxxxx/payara6/domain1/config/pid.prev`
    /sbin/service payara stop
    sleep 5

    # and, to make sure it's really dead:
    kill -9 $APP 2>/dev/null
    sleep 5
    /sbin/service payara start

    if [ ${STAT}"x" == "000x" ]
    then
    STAT="TIMEOUT"
    fi

    # We have two application nodes in our prod. setup, so on each of the hosts
    # the <hostname> entry in the next two lines indicates the local hostname
    # of the specific node that is being restarted:

    echo "Restarted Payara on <hostname>" | mailx -s "Bounced Payara <hostname>" xxxxx@yyyyy.harvard.edu
    echo -n "Restarted Payara on <hostname>. (http code: ${STAT}) " >> /var/log/dataverse-bounce.log
    date +%m/%d/%y" "%H:%M >> /var/log/dataverse-bounce.log

    # With Payara 6, there is some appreciable time before the completion of asadmin start-domain
    # and the application becoming active and beginning to respond to requests!
    sleep 60
fi

/bin/rm $LOCKFILE

view this post on Zulip Philip Durbin 🚀 (Apr 12 2024 at 20:23):

Awesome. Thanks, @Leo Andreev! :dataverse_man:

view this post on Zulip Leo Andreev (Apr 12 2024 at 20:26):

Philip Durbin said:

Awesome ...

There was some weirdness with my initial copy-and-paste, but I have fixed it, I think...

view this post on Zulip Philip Durbin 🚀 (Apr 12 2024 at 20:31):

Looks nice. Even has syntax coloring.

view this post on Zulip Philip Durbin 🚀 (Apr 16 2024 at 15:31):

@jamie jamison just making sure you saw the updated restart script above ^^

view this post on Zulip jamie jamison (Apr 17 2024 at 20:30):

Sorry, will look for the updated version.

view this post on Zulip Philip Durbin 🚀 (Apr 17 2024 at 20:34):

No worries. You just have to scroll up a bit. :grinning:


Last updated: Oct 30 2025 at 06:21 UTC