Goes down and then back up again. If it's at a time where I can catch it - rather then the middle of the night) I'll restart payara. The error seems to be:
**
[#|2024-04-08T17:32:50.037+0000|SEVERE|Payara 5.2022.4|org.glassfish.jersey.server.ServerRuntime$Responder|_ThreadI=97;_ThreadName=http-thread-pool::jk-connector(2);_TimeMillis=1712597570037;_LevelValue=1000;|
An I/O error has occurred while writing a response message entity to the container output stream.
org.glassfish.jersey.server.internal.process.MappableException: java.io.IOException: Connection closed
at org.glassfish.jersey.server.internal.MappableExceptionWrapperInterceptor.aroundWriteTo(MappableExceptionWrapperInterceptor.java:67)
**
Hmm, I don't think I've seen that error.
We have a restart script at Harvard Dataverse.
Is that part of the Dataverse package or a separate script?
It's separate. I'm not sure if we've ever published it.
Could I take a look at it?
The version I have is from 2015:
#!/bin/bash
# Script for use with Nagios to bounce Tomcat Services when they are critical
PATH=/usr/local/bin:/bin:/usr/bin:/sbin:/usr/sbin
mailTO=""
case $1 in
OK)
;;
WARNING)
;;
CRITICAL)
if [ $2 == "HARD" ]; then
echo "restarting glassfish for Dataverse"
# get stack dump first
glassfish_pid=`ps -ef | grep java | grep glassfish | grep -v grep | awk '{print $2}'`
if [ "x${glassfish_pid}" != "x" ]; then
/usr/bin/jstack $glassfish_pid > "/tmp/glassfish-jstack.${glassfish_pid}"
fi
/etc/init.d/glassfish stop
sleep 20
glassfish_pid=`ps -ef | grep java | grep glassfish | grep -v grep | awk '{print $2}'`
if [ "x${glassfish_pid}" != "x" ]; then
echo "Dataverse glassfish is not responding to init script; trying kill -9"
kill -9 $glassfish_pid
sleep 10
fi
/etc/init.d/glassfish start
echo "" | mail -s "bounced glassfish on `hostname`" $mailTO
fi
;;
UNKNOWN)
;;
esac
exit 0
Let me ask around tomorrow to see if we have something newer.
thank you
Please note that it's easy with systemd to write a timer and a oneshot service. That would monitor something to detect Payara hangs and will restart the payara service.
Basically the same idea as the healthcheck / liveness probe in Kubernetes/Compose.
Philip Durbin said:
Let me ask around tomorrow to see if we have something newer.
Yes, we used to use Nagios a while back. What we are doing now is much simpler, the script below is run every 5 min. from the root's cronjob, like this:
# Monitor Payara and bounce if down. To PAUSE, touch /tmp/stop, then remove /tmp/stop when done!
*/5 * * * * /usr/local/bin/check_site.sh
The script just calls a low-cost page, and forces a restart if it doesn't get a happy response. It's super simple, the only remotely non-trivial parts in it are the hooks for being able to pause it (when you actually want to restart or redeploy), and also, to prevent multiple instances of the script from stacking up, if for whatever reason it is taking more than 5 min.
Here it is, lightly redacted.
#!/bin/bash
SITE="http://localhost:8080/loginpage.xhtml"
#SITE="http://localhost:8080/file.xhtml?persistentId=doi:10.7910/DVN/XXXXXX/YYYYYY"
# (we have experimented over the years using more and less expensive pages for
# the purposes of monitoring; using login page, one of the less resource-consuming
# pages, now)
# timeout value is in seconds:
TIMEOUT=180
# This pauses the check. Touch /tmp/stop to disable, remove /tmp/stop to enable
if [ -f /tmp/stop ];then
exit 0
fi
# Since the curl timeout + whatever overhead may be more than the interval
# at which the cron job runs, let's prevent these jobs from stacking up:
LOCKFILE=/var/lock/check_website.lock
if [ -f $LOCKFILE ]
then
echo "Site check taking more than 5 minutes!!" | mailx -s "ATTENTION: DELAYED SITE RESPONSE" xxxxx@yyyyy.harvard.edu
# this implies that the previous instance of this job (run by cron every 5 min.)
# is still running.
# the only excuse for a page call to take this long would be if the payara
# process were in a stop-the-world phase of garbage collection.
# GC logging (above, and in the Payara log directory) should tell
# whether that was the case.
echo -n "bailing out; check already in progress. " >> /var/log/dataverse-bounce.log
date +%m/%d/%y" "%H:%M >> /var/log/dataverse-bounce.log
exit 0
fi
touch $LOCKFILE
STAT=`curl -s -o /dev/null -m ${TIMEOUT} -w "%{http_code}" $SITE`
if [ ${STAT}"x" != "200x" ];then
APP=`cat /xxxxx/payara6/domain1/config/pid.prev`
/sbin/service payara stop
sleep 5
# and, to make sure it's really dead:
kill -9 $APP 2>/dev/null
sleep 5
/sbin/service payara start
if [ ${STAT}"x" == "000x" ]
then
STAT="TIMEOUT"
fi
# We have two application nodes in our prod. setup, so on each of the hosts
# the <hostname> entry in the next two lines indicates the local hostname
# of the specific node that is being restarted:
echo "Restarted Payara on <hostname>" | mailx -s "Bounced Payara <hostname>" xxxxx@yyyyy.harvard.edu
echo -n "Restarted Payara on <hostname>. (http code: ${STAT}) " >> /var/log/dataverse-bounce.log
date +%m/%d/%y" "%H:%M >> /var/log/dataverse-bounce.log
# With Payara 6, there is some appreciable time before the completion of asadmin start-domain
# and the application becoming active and beginning to respond to requests!
sleep 60
fi
/bin/rm $LOCKFILE
Awesome. Thanks, @Leo Andreev! ![]()
Philip Durbin said:
Awesome ...
There was some weirdness with my initial copy-and-paste, but I have fixed it, I think...
Looks nice. Even has syntax coloring.
@jamie jamison just making sure you saw the updated restart script above ^^
Sorry, will look for the updated version.
No worries. You just have to scroll up a bit. :grinning:
Last updated: Oct 30 2025 at 06:21 UTC