Does anyone have deployed a Dataverse swarm stack and if so, is there any documentation on that? I am currently trying to set up a stack and fighting myself through Zulip threads, docs and GitHub/DockerHub repositories ;) I thought I should ask here before diving too deep...
Dumb question: does Docker Swarm have Kubernetes under the hood these days?
No, that's two different things. Docker Swarm is more like an extended version of Docker Compose, which scales horizontally :)
Oh, I see. Well, I obviously don't know much about Docker Swarm, but I'm happy to put it on the agenda for our next meeting ( https://ct.gdcc.io ), which you are welcome to join, to see if anyone has any ideas.
Alright, thanks! I am currently trying to figure out how the Docker Compose configuration of Dataverse works. The Swarm translation is quite straightforward but the devil is in the details...
Good luck! I did go ahead and add Docker Swarm to the agenda: https://docs.google.com/document/d/1mdziHDJTIZGgI1ks8HFNTkfEfQqfsIRmNh4ufF516pk/edit?usp=sharing
Many thanks :)
Btw. I am a bit confused about the env-variables. There is DATAVERSE_DB_HOST (and user/password/name etc.) and also POSTGRES_SERVER... etc. Why are there two databases?
There is only one database. You're using container's right. You probably want to use DATAVERSE_DB_HOST like we do here: https://github.com/IQSS/dataverse/blob/v6.3/docker/compose/demo/compose.yml#L13
Where are you seeing POSTGRES_SERVER?
That was in the sample.env :) OK, I'll continue, thanks ;) Yes, containers!
I just looked at the compose file... that uses different services. There is no minio for example and the dataverse image is not coronawhy/dataverse:... but gdcc/dataverse.
I am a bit confused now
Now I see it's from two weeks ago, so I guess that's the state of the art?
Wait, are you following https://guides.dataverse.org/en/6.3/container/ or dataverse-docker which is community led (please see https://guides.dataverse.org/en/6.3/developers/containers.html )?
The situation is definitely confusing and we have an issue to address it: Document competing containerization efforts and how to choose #10522
@Slava Tykhonov just told me he's moving dataverse-docker ("archive in a box") to the gdcc images, if that helps.
ok thanks, i'll try my luck... definitely a tough piece to get working on docker stack compared to the other services i deployed in the past :D
The gdcc images are maintained by core contributors like me and @Oliver Bertuch
but i see that the number of volumes and file mappings are reduced in the latest compose, which makes things a bit easier
They are somewhat new, the gdcc images. Slava filled a gap with his coronawhy images until the gdcc images were ready.
Yeah, the demo compose is intentionally smaller than the dev compose in the root of the repo.
We have all that stuff like minio in the dev compose for testing.
i see
And sorry, I keep moving your messages around. I can put them back under your original topic if you want: #containers > Docker Swarm deployment of Dataverse
You're still trying to use Docker Swarm? Or are you trying Docker Compose?
No worries, you can move stuff as it fits, I am not so handy with Zulip!
Swarm. I think I now got all the services running on different nodes with shared filesystems
Screenshot-2024-07-03-at-22.57.26.png
looks promising... now trying to get the load balancer with SSL termination to talk to Dataverse :)
Ah, great. Yeah, hopefully having fewer services to deal with helps. :sweat_smile:
It's a bit less confusing at least ;)
I'll put the messages back under the original topic.
Btw. is an Nginx or so recommended or can I directly use the HTTP connection to dataverse?
Well, Dataverse runs on port 8080.
Evening. Lots of chatter here, great. I always would strongly recommend putting Dataverse behind some kind of reverse proxy
You can configure Payara to serve on 80 and 443 easily enough but adding an SSL cert to Payara is a nightmare.
In production we (Harvard) front with Apache but you could use nginx or whatever.
SSL certi is not a concern, I have an SSL layer termination in front of each service
Then you already have a reverse proxy / load balancer in place and are probably good to go
I was just wondering if the built-in webserver is "enough"
alright thanks :)
If you can make that ingress handler take care of routing as well, that's even better
Yep I do
Great. Sound like you're good to got then.
I'll report back and share the stack configuration too
there are some differences between Swarm and Compose
Keep in mind that when you want to use the ip groups feature you need to make your LB/RP send headers with the original addresses and apply some config to make Dataverse accept those
yes that will be probably tricky. is that documented somewhere?
It should be IIRC at the IP groups page
OK thanks
Hmm seems like it's not on that page. Let me dig around the guides
Ah it's in this section: https://guides.dataverse.org/en/latest/installation/config.html#blocking-api-endpoints
It's this config option: https://guides.dataverse.org/en/latest/installation/config.html#dataverse-useripaddresssourceheader
I'm afraid it's one if those leftover options we did not yet enable to set using MPCOnFIG. We're still working on that. But that's not the case with the coronawhy images, too
If you're interested in supporting us to get Dataverse containerization production ready, feel free to join the working group meetings. Deets on https://ct.gdcc.io. Anything helps, all feedback is appreciated, coordinated PRs are very welcome.
Thanks Oliver, I'll check the docs and let's see how far I get. If I can free up some time, I will also help of course!
Still fighting with the services, they crash after 3 minutes
Which ones?
It would be very weird if Solr or Postgres crash after 3 minutes
dataverse. Probably some communication problem with the network
[Entrypoint] running /opt/payara/scripts/init_3_wait_dataverse_db_host.sh
Operation timed out
Ah!
i'll check if the ports and internal DNS is correctly working
Yes, that sound like you need to check the networking part
That script is a poor man's workaround for our compose file not being able to work with healthchecks yet
whatever works ;)
forgot to attach the backend network to dataverse, that should be it...
(deleted)
Oh OK
only the one which is shared with the load balancer was attached to the dataverse service
ok, i see some success ;) Boot Command create-system-properties returned with result SUCCESS :
Screenshot-2024-07-03-at-23.22.18.png
You need to watch out for sth along the lines of "application dataverse deployed at /"
Depending on your resources this is usually done after 60 to 120 secs
Did you add the bootstrapping thing in your swarm setup?
yep
looks good :) running
Great. Once you see that one succeeding you're good to go
so you mean it's still configuring?
Yes. The Payara page means the appserver is responding
Deploying a large webapp like Dataverse taken some time
This is not Django :grinning_face_with_smiling_eyes:
i hope so ;)
Usually Java webapps are made to run a lot longer than you take time for starting them... :melting_face:
i see a bunch of warnings and one error (regarding log4j, which seems to fall back to simple logging)
Screenshot-2024-07-03-at-23.26.33.png
Most of these errors are expected. Good friends, we don't want to miss them
ok, that looks working
Yes!
The real test now is creating a dataset and uploading a file
That's usually a good smoketest
Because it also means you configured your storage successfully
yep :) i'll try
i have not configured the admin port yet though
Don't hate us for the messy configuration stuff. We're actively working on improving that. It's a lot... :innocent:
Even in classic installations the amount of stuff to configure can be overwhelming
The admin port of what? Payara?
Screenshot-2024-07-03-at-23.30.26.png
That looks promising!
yep, i see the data on the shared filesystem
Phew!
i thought the admin port for Payara is needed for further configuration
sorry, i am a noob with dataverse ;) but I have my right hand for that
Nope, you usually don't need to touch that
You can use it, but it's annoying and it won't help you with database options
alright, thanks so far! i'll clean up the swarm configuration the next days and them publish them in case someone is interested
and i'll check the ip header forwarding. i guess currently everyone has the same IP address (the one from the first layer docker instance of the load balancer ;)
If you want, feel free to create an issue at IQSS/dataverse and open a PR to include this in the guides. We probably should talk about a good place for it, but it would nice to include it. Give people options
Personally I wouldn't go for Swarm and opt for K8s, but it's a valid option if you're OK with the vendor thing.
at some point I might convert all our IT services to K8s but currently we live with that ;) I like it to be honest, it's easy to maintain and I have tons of ansible playbooks to automate things like redeployment, backups and details like certificate management
Whatever works for you folks!
That's the nice thing about containers. So many options
i thought about going all in with K8s but my hope was that some of the students help out. Swarm is quite easy to get started with if you already know Docker :D
Indeed.. that might however also be a bad thing ;)
the lisp curse... :laughing:
alright, have to go, thanks again
See you around. :moon:
i'll try to join on thursday
It's next week!
ah ok ;D
Tomorrow is a holiday in the US
So no meeting until Thursday next week
ah yeah, reminds me i wanted to watch that movie again
Last updated: Oct 30 2025 at 05:14 UTC