Hi! Sorry for opening this up again, but I am again not sure where to put my question :) We have finally finished our evalution and want to deploy dataverse now in production - but ideally using containers. It there still a big "don't use this in production"-flag on it all? If yes, perhaps we can contribute to create a setup for production.
@Jutta Schnabel hi! Great question. In our documentation, we are still being cautious about recommending containers for production but there are brave souls already using them!
Security is top of mind for me. When working on the tutorial for using containers as a demo, I made sure to show how to run the setup-all.sh script WITHOUT the --insecureflag.
Please see https://guides.dataverse.org/en/6.2/container/running/demo.html#creating-and-running-a-demo-persona for more on this.
Expecially this:
"One of the main differences between the βdevβ persona and our new βdemoβ persona is that we are now running the setup-all script without the --insecure flag. This makes our installation more secure, though it does block βadminβ APIs that are useful for configuration."
You could go to the equivalent of https://demo.dataverse.org/api/admin/index/solr/schema on your server and make sure you can't reach it from outside. If you can, please see https://guides.dataverse.org/en/6.2/installation/config.html#blocking-api-endpoints
What do others think? What am I forgetting? What do we need for the images to be production-ready? More docs? More configurability?
Stable images tags like 6.2/6.3 are needed for me. Besides that I'm happy with the current containers and they work great in production :)
That makes sense. Currently our tags are "alpha" and "unstable".
We've definitely talked about tags in recent meetings (#containers > weekly meeting ). Probably the best issue to watch is #10478. Plus we have the following topics going:
Great, thanks, that sounds hopeful :) We will try that out and see that we manage.
Sounds good. Please keep the feedback coming!
Besides that I assume the different projects and their means to create images https://github.com/gdcc/dataverse-kubernetes , https://github.com/EOSC-synergy/dataverse-kubernetes , https://github.com/IQSS/dataverse-docker and https://github.com/IQSS/dataverse/blob/develop/docker-compose-dev.yml are a bit confusing for someone new to the community. Hence, my wish-list contains a clean-up, or more detailed documentation or clarification about the projects and their relationship (they all look like official GDCC/IQSS projects)...
@Johannes D good idea. I just created this issue: Document competing containerization efforts and how to choose #10522
Not entirely a container issue, but it would be nice to rework the documentation so that it is independent of the installation method. In particular, the admin and installation guides are in some cases quite specific to a particular installation method. This makes both demo/eval and production use cases more complex than they need to be.
Heads up I will be out all of May (as of next Monday) and will continue working on this in June
Johannes D said:
are a bit confusing for someone new to the community
I agree. I've been meaning to sunset gdcc/dataverse-kubernetes for some time now, but didn't have the time. A PR is welcome. (So don't delete but leave a hint in the README and archive it as read-only)
And documentation of the features that do not work out of the box in a Docker environment would be nice. Like RServe, make data count, or multiple dataverse instances. ... They work, but need a bit of tweaking....
Heads up that #10672 is ready for review! It will increase production usability :smiley:
@Philip Durbin are these container only things going trough the same process with sprint planning etc or can/should we fasttrack it?
My rule of thumb is to look at the files change and to put it on the fast track the changes on affect containers. This one can be fast-tracked, I'd say. Please feel free to put it in "ready for review" if you like.
Done!
It's good to have this in place - we can expect Temurin images based on Ubuntu 24.04 to land within the next 4 to 5 weeks. Always good to be prepared!
yeah
I don't think anyone here can easily do this:
Suggestions on how to test this:
Run the images on a K8s cluster
Ha! I'll edit it to say just run them in Docker :smiling:
Do you think you could help get https://github.com/gdcc/api-test-runner/blob/main/.github/workflows/manual.yml working again to test it?
Not sure - we're not talking Dataverse code here
The test is done once you successfully deploy - this is infrastructure, so the application doesn't matter
You could even try the base image with some other demo app
Oh, even if it were working the "manual" workflow wouldn't test it?
I dunno if we should at some point include a minimal testing app for automation of base image accpetance tests
Well it tests that it builds (that is done in CI already), the one thing left to do is run an actual application...
That doesn't need to be Dataverse, which is huge and clunky
Sure, sounds very useful.
Do you feel like it would be a good addition to this PR to have such smallscale tests around?
Hmm, maybe? I mean, we want more testing of images before we publish them, generally.
True!
Do you feel we need to do this now or should we keep that for another issue/PR?
Meh, I don't think we need it now.
Some day I'd like to retire that api-test-runner repo and have the testing done upstream.
It might be a nice thing to try out https://github.com/arquillian/arquillian-testcontainers for this :smile_cat:
Related: #containers > failing tests in 6.3 from api-test-runner
@Oliver Bertuch I removed you as an assignee from #10672. Items in "ready for review" should't have assignees so it's clear that anyone can pick them up and review them.
I just made this pull request:
update docs to suggest using Docker in production #11862
You can preview the docs here: https://dataverse-guide--11862.org.readthedocs.build/en/11862/installation/prep.html#choose-your-own-installation-adventure
What do you think?
Since we are using those containers for production for the last couple of years, those images are ready for production usage...and the guide reflects it. However, one could add the information that 'make data count' does not work that well and a section about backup & restore in a docker environment is missing.
Hmm. I don't feel very qualified to write about backup and restore. What if we allow #11862 to be reviewed and merged and make a PR in the future for that?
I'm also not sure what I would write about Make Data Count. @Johannes D if you want to make a PR into my PR, please go ahead! :smile:
Maybe I should try to address this as well:
Document competing containerization efforts and how to chooseΒ #10522
I do feel pretty qualified to write about that. :smile:
I think this would be a good addition for a production scenario. Not just in containers... :wink: #11948 CC @Leo Andreev
yeah
@Oliver Bertuch any feedback on the docs I wrote?
#11862 has been merged! Docker in production! Thanks for reviewing and merging, @Steven Winship!
Can I ask a dumb question about k8s. @Oliver Bertuch I get the "not quite production-ready" part. But you have been using it in your production for a while. The way our production works here, running it and keeping it alive relies on having people with ssh access going in and making various tweaks. I'm not just talking about tasks that must be done via localhost-restricted APIs. We have an ecosystem of scripts outside of the application proper. That generate reports, validate metadata, etc. etc. It's a very common scenario where I need to add another regex to the anti-spam script, for ex. If I see an aggressive bot abusing the APIs, I can quickly block the ip by adding an Apache rule. What is a proper equivalent of such real time work under k8s? It is of course possible to ssh into a pod and mess with things the same way... but that would not be ideal obviously.
You can work quite similar to that in Kubernetes. There is no problem running a command inside a running container (no SSH, but kubectl exec).
The caveat is what's actually available to you inside a container
Container images should as small as possible and as stripped down as possible
This reduces attack vectors, reduces data transfer amounts and makes things stricter, tidier
For example, all the bits like setup scripts and other things we need for configuration are not a part of the Dataverse image, but are in configbaker
This way we don't need to polute the container filesystem of the important application with unnecessary stuff like Python etc.
It's important to keep the Kubernetes paradigm in mind: containers are not the atomic operation unit, pods are!
And a pod is more like an atom in real life: it's made up of more things :wink:
A pod can consist of three things: containers (protons), side car containers (electrons) and init containers (neutrons)
Usually you'll have one main container, for example for your application
This is accompanied by sidecars, that run alongside the main container
These sidecars are the helpers for logging, exposing things in controlled fashions, run reports etc
All of it usually aims to follow UNIX philosophy of "do one thing and do it well". It's cheap to have more sidecars and containers, so use 'em
These containers bundled in a pod share a "localhost network". So a sidecar can reach something in the main container via localhost:port
This is quite powerful, as you sometimes might want to expose something in the app server (e.g. the JMX port of Payara) to localhost only, but then reach that from somewhere else. A sidecar can be a safe bridge into that, potentially handling TLS, auth, etc.
These containers can also share any number of mounted filesystems. This way the main container can write something like a report and a sidecar can pick that up for processing
It's also possible to add a container to a pod at runtime. This is a neat trick often used for debugging or other special purposes.
As the injected container shares the network and can mount the volumes in the pod, you can do something on the fly like running a special script you prepared and packaged in a container image.
What you usually should avoid: create a volume, store scripts there and execute as needed by entering the container. This is possible, but it violates the K8s principle of "no pets". You introduce unnecessary state, which is usually an anti-pattern.
If you need to experiment with stuff and need to develop a script, you're usually better of by creating port forwards from your development machine to the running pods/services.
That way you can code on your machine, run the script etc all without having to deal with "how do I get this into the pod"
This obviously has it's limits when it comes to things like capturing network traffic or other low level stuff
For the networking part you mentioned changing rules etc
In modern Kubernetes deployments you usually will have some kind of middleware as your TLS handling and routing gateway. Formerly you'd use the "K8s Ingress API" that is being handled by an Ingress Controller. The current, newer approach that evolved from that is the Gateway API.
Any restrictions like blocking external traffic, adding things like Anubis or web Application Firewalls are handled at these levels
It's of course possible to run another proxy like Apache or Nginx between ingress and the actual application, but I think this is less common these days.
To get the configuration into these middleware, you will need to handle either Configmaps or Custom Resources defined by and depending on the middleware.
You can either manage these yourself with kubectl
Or follow the modern approach, using GitOps
When starting new I'd always recommend using GitOps. Infrastructure as Code is a well known and good if not best practice now for a long time.
But again, that's just my opinion
You are free to use other tooling, do it manually, etc
Depends on your needs, what you and the team feels comfortable with
I think it's fair to say that working with Kubernetes combines classic shell based administration with a lot of automation and abstraction. This requires more and new skills, but it's also more reliable than your average hacky pet. Obviously Harvard Dataverse is not a hacky pet, you all know what you're doing. But it requires a lot of context and experience to know all the details, while adopting IaC make things more reproducible and rebuildable.
Admins adopting K8s will probably need to adapt their way of thinking how to run a server. But that's not a bad thing - it keeps your grey matter lively :wink:
Thank you, really appreciate the info!
I may bug you with followup questions later on.
:pray:
Today I learned about https://github.com/kimdre/doco-cd . Very interesting! Maybe I should switch my DCM26 workshop from Flux+K8s to a potentially simpler DoCoCD?
Last updated: May 30 2026 at 09:11 UTC