So if I am understanding the docs build process correctly, the worker node is cloning the git repo, doing a sphinx build on the Jenkins hardware, scp'ing the zipped archive from the Jenkins hardware into rocky@54.198.97.45, and then unzipping that file on the server, and setting it up.
What is stopping me from creating two GitHub Actions, one that builds the guides and uploads the built artifact for any branch and any PR, and another GitHub Action with AWS IAM role perms set for the action to be able to access the server only from the main branch of the dataverse repo, taking it and unzipping it on the server?
Well, won't the guides server get a bit clogged up if we put docs on it from every PR?
Currently we only use that Jenkins job at two times:
If we build for every PR, I'm worried we'll fill the disk on the guides server.
That's a really good point. Hmmm. Manual trigger then? But with a manual trigger, you would have to specify the exact git branch as an input every single time. It will default to main. Do you ever run build jobs for PRs or other branches? Or only on main/develop?
We use "develop" for https://guides.dataverse.org/en/6.10.1/developers/making-releases.html#build-the-guides-for-the-release-candidate
We use "master" for https://guides.dataverse.org/en/6.10.1/developers/making-releases.html#build-the-guides-for-the-release
I'm fine with a manual trigger (workflow dispatch).
Makes sense to me, I will get a staging Amazon server up and test a mock workflow to see how a workflow dispatch with inputs would look like. Thank you so much!
Sure! Thanks for working on this! ![]()
UNC hosts https://guides.dataverse.org . Is that still ok, going forward?
I wish I could give you a concrete answer, but that's a question for @Don Sizemore
@Philip Durbin ๐ I can't answer for my superiors (who are out today), but it's a lot less work for us to use GitHub Actions to publish to say guides.gdcc.io. How do we feel about GH pages-hosted guides?
In theory, it sounds fine. Can you please do a du to see how much space is used for all the old versions of the guides? However many GB it is, I don't think it'll be completely unreasonable for a git repo. At least I hope not.
@Philip Durbin ๐ 2.3GB though I see four old branch builds that could probably be cleaned up
Hmm, "We recommend repositories remain small, ideally less than 1 GB, and less than 5 GB is strongly recommended." -- https://docs.github.com/en/repositories/working-with-files/managing-large-files/about-large-files-on-github
I dunno, I'm a bit nervous that at 2.3 GB we're already halfway toward the 5 GB strongly recommended limit.
I'm not saying UNC should host it forever (of course!) but maybe we shouldn't move so much content into a git repo.
We have a lot of eternally free options. We have Cloudflare Pages/Workers. We can use S3 + GitHub pages. Or we have https://about.readthedocs.com/ which is forever free for Open Source. No GitHub Actions, no large builds, just point them at the repo, and forget about them.
The current setup works with Jenkins, but translating it into Actions, Role ARNs, staging servers for testing, and security reviews is going to be a long drawn out months long process which we would rather not get into unless there's no other way.
Those setups are so easy that with org approval, I can have deployments up within minutes
I'm not a fan of the ads on ReadTheDocs. If we can continue to serve up Dataverse documentation without ads, this would be my preference.
I don't think there's a way we can opt out of their ethical ads feature. Cloudflare Pages then? Would that be alright?
I've never used Cloudflare but I've certainly heard good things about it.
I use it regularly for my production deployments. dash.srmanda.com and srmanda.com are some examples from my side. Heavy builds.
But you'll have to move the DNS over to Cloudflare, that's the catch. There's always a catch
I don't see any ads at https://srmanda.com (great!) but at https://dash.srmanda.com I'm seeing this:
![]()
Is that an ad? It looks like one! :sweat_smile:
No, that's my login page for the dash :sweat_smile:
oh, phew!
I never thought that it looks like an ad. Now I see it very clearly. Good advice, I'll work on that page. Thank you!
ha, sure
Are you blocked on a decision on which way to go? I'd like to discuss this with the team.
Yes please, feel free to discuss! I'm just trying to find what's the best way I can make this transition smooth
Well I was trying to build a Proof of Concept to showcase to the team in Cloudflare. I didn't realize this looking at the logs, but apparently pip install graphviz doesn't install the actual dot executable, that has to be installed separately. And serverless environments like GitHub Pages or Cloudflare don't support installation of the graphviz binary.
As long as the docs use graphviz, the docs build will always require a full server to run
The only real workaround is for the docs graphs to be rewritten into mermaid, which isn't a complete rewrite, but it is still a major change. And even after we do that, I'm not entirely sure what other hiccups I will run into. But, I can always experiment and try. So I will rewrite my fork's doc graphs into mermaid and let you know how it goes
If it helps, on my Mac I install dot by installing graphviz: https://guides.dataverse.org/en/6.10.1/contributor/documentation.html#sphinx-installed-locally
mhmm I read your docs, they were very helpful, but cloudflare is serverless so brew/apt-get all are disabled and restricted
Cloudflare proof of concept is live at: https://dataverse-smr.pages.dev/
Compare both the rendering softwares at:
Graphviz: https://guides.dataverse.org/en/latest/developers/dependencies.html
Mermaid: https://dataverse-smr.pages.dev/developers/dependencies
I was just looking at the one at https://dataverse-smr.pages.dev/api/intro#what-is-an-api ! Amazing!
@Oliver Bertuch check this out ^^
@Ash Manda you converted all the dot/graphviz images to Mermaid?
Yep. Wasn't that hard
Just some syntax rewrite
Do you have a branch on GitHub for this? Maybe we should just merge it!
@Oliver Bertuch would you miss dot/graphviz terribly?
If we replace dot/graphviz with mermaid, that's fine by me.
@topic we don't need to commit the built pages as sources to Git. So Github Pages may still be an option.
But we have 2.3 GB of old versions we need to put somewhere.
I thought I'd read that we can use the artifact deployment method to get around that, but apparently this is no longer true. https://github.com/marketplace/actions/upload-github-pages-artifact#artifact-validation
Personally, I'd vote against using Cloudflare. They are putting hefty price tags on their services wherever they can. I know that quite a bunch of things like the Turing Way Guide are using Netlify: https://github.com/the-turing-way/the-turing-way/blob/main/netlify.toml
@Don Sizemore do you have statistics about the guide usage like bandwidth/month and requests/month?
Oliver Bertuch said:
Don Sizemore do you have statistics about the guide usage like bandwidth/month and requests/month?
@Oliver Bertuch see
I do have a branch here: https://github.com/srmanda-cs/dataverse/tree/12408-move-docs-jenkins-builds
But the PR needs to come from unc-ch rdmc, for now it's just an experiment more than anything else.
The mermaid change has to be made for ANY serverless deployment though. GitHub Pages, Cloudflare, Netlify, Railway
I'm not against using Mermaid, but wouldn't it be sufficient to just install the necessary dependencies at build time in CI?
(Or use a readily available Docker image for the job)
Looking at https://github.com/IQSS/dataverse/blob/develop/.github/workflows/guides_build_sphinx.yml this is what we use so far...
It takes care of setting up the requirements.
Yeah Cloudflare doesn't give access to apt-get or brew, it's a locked down image
Why would we need that image?
Isn't uploading the stuff (resulting artifact) to cloudflare a separate step?
Ohhhh so build in GitHub actions and upload to Cloudflare? That could work.
Here's how I do it with Maven Site and upload the artifact after for the dataverse-spi: https://github.com/gdcc/dataverse-spi/blob/main/.github/workflows/site.yml
The workflow above with building the guide in Github Actions should already give you something usable as a deployment.
let me do an upload, I'll let you know
So is the final decision to go with GitHub Pages? This is viable
No no, this won't be possible I think
If y'all want to go with Cloudflare though, I would strongly recommend Mermaid, then you can retire all the GitHub Actions and have a lot less to maintain. Just one build command that's it
and Cloudflare Pages is very very generous
Github Pages has a supported limit of 1 GB, independent of using a Git branch or Deployment artifact.
Ok, so even smaller than the 5 GB limit I mentioned above.
Should I create a separate topic in #docs about moving from dot/graphviz to Mermaid? And maybe we can have a dedicated issue and PR?
SGTM if the team is okay with it
Cloudflare's free tier limit is 20k files, each not exceeding 25MB
That might be cutting it close...
We have a lot of versions...
Ash Manda said:
SGTM if the team is okay with it
Great, please see #docs > switch from dot/graphviz to mermaid
Oliver Bertuch said:
We have a lot of versions...
we might have to use R2 then. 1.5 cents a month per GB. It's all tradeoffs, no perfect solution :sob:
@Philip Durbin ๐ as you hate the ads on RTD, by using the new deployment mechanism (whatever it may be in the end) we can drop using it if we build the deploy chain on our own. Will impact the amount of files and used bandwidth though.
well do let me know what the team converges on! I'm happy to go any way. But that unc docs server is going away, that much is sure, and I want to prevent the site from going down completely.
I've also been chatting with @Leo Andreev @Steven Winship and @Ceilyn Boyd about this in Slack. Please stay tuned!
AI Summary of everything so far so that I don't get confused and cause y'all headaches
Context: The UNC docs server and Jenkins build server are both going away. We need a new home for the Dataverse documentation. Here's where things stand:
The disk reality: du on the current docs is 2.3 GiB, which already exceeds GitHub Pages' 1 GiB hard limit and puts us uncomfortably close to the soft 5 GiB repo ceiling. GitHub Pages is effectively off the table.
Options on the table:
ReadTheDocs: Works today with zero migration effort, but has ads. Its an option but Phil has noted the ads are a pain.
GitHub Pages: Hard 1 GiB limit for Pages, soft 5 GiB for the repo. GitHub actively enforces the 1 GB limit and has taken down sites for violations. GitHub Pages is fundamentally not built for this usecase. At 2.3 GiB we already exceed the Pages limit. Not viable without significant pruning of old versions.
Cloudflare Pages: Unlimited bandwidth, generous free tier, but capped at 20,000 files per deployment and 25 MB per file. Graphviz (dot) cannot be installed in Cloudflare's locked-down build environment, so this option requires either switching to Mermaid or building the artifact elsewhere first.
The two paths if we go Cloudflare:
Path A: Build on GitHub Actions, upload artifact to Cloudflare: Keep Graphviz, use the existing guides_build_sphinx.yml (which already installs Graphviz and runs the build but does not publish), add a deploy step using cloudflare/pages-action. More moving parts, two systems to maintain.
Path B: Mermaid + Cloudflare native build: Replace Graphviz with sphinxcontrib-mermaid, let Cloudflare handle the entire build with a single command (cd doc/sphinx-guides && pip install -r requirements.txt && make html && make epub && cp build/epub/Dataverse.epub build/html). Retires all the GitHub Actions pipeline entirely. Simpler long-term, one system to maintain.
Active work: There is an experimental branch already at https://github.com/IQSS/dataverse/compare/develop...srmanda-cs:dataverse:12408-move-docs-jenkins-builds with the Mermaid migration (8 commits, 4 files changed). A live Cloudflare deployment of the docs is already running at https://dataverse-smr.pages.dev/ on v6.10.1 as proof of concept.
The 20k file ceiling concern: With multiple versioned doc releases this could become an issue on Cloudflare's free tier. Mitigation would be Cloudflare R2 at roughly $0.015/month per GB, and additional pricing for Class A/Class B operations which adds cost, but is bearable.
Decision needed: Which hosting platform, and if Cloudflare, Path A or Path B?
My priority: It is to make sure the site does not accidentally go down, the migration is smooth, and nobody gets left holding a broken build or left hanging. I am happy to own the entire transition end to end and build whatever path the team decides on. No matter which option we go with, even servers, I will make it work.
Sounds right.
The part that seems missing here: how does Cloudflare pages handle the versioning? If building in CI, we control where to put things. How does this work when building on Cloudflare pages?
One other thing that comes to mind when building at Cloudflare and others (and not in CI): we loose some control over the process, potentially making it less portable. But this may just be me being overly cautious.
By versioning, if you mean version control then cloudflare makes a new deployment url per commit
![]()
They exist indefinitely.
versioning is very important
right now we just have subdirectories for 6.9 or 6.2 or whatever
Well technically, since Cloudflare is capable of generating a static url per commit, you wouldn't need any old docs, you would just point it to an old commit url say the commit url for 6.9 master and point them to view the docs there so number of files won't ever exceed like 700. But in practice I wouldn't know what that would look like. It's definitely possible though
I'm not sure I follow but that's ok. :smile: We want people to have the same experience as today when they visit docs for their old version of Dataverse.
Not to forget we must make sure not to break any existing links. Installations rely on the versioned links to be kept intact, as they link to the guides.
yes, exactly
so the URLs need to stay the same, https://guides.dataverse.org/en/6.2/ or whatever old version
Philip Durbin ๐ said:
I'm not sure I follow but that's ok. :smile: We want people to have the same experience as today when they visit docs for their old version of Dataverse.
Well this is my very first time moving a Sphinx deployment from a server to serverless so this is all very very new to me ;(
From what I understand, this is something Cloudflare does. So every time you push a commit to a branch it is monitoring, it creates a separate static url for THAT specific commit and that specific docs build.
Right now, for example /6.9/ is an entire separate directory with its own files. The way Cloudflare handles it, it would just redirect to a build made using an older commit which has its own url instead of storing all the files in one place. I hope this makes sense?
sort of?
I think I'd need to see it in action to know if it's what we need.
@Philip Durbin ๐ do we have any deep links into a specific page of a version from the application that are user facing? (notifications, ...)
yep, I'm pretty sure we do
This might break depending on the possibilities for a redirect. I don't know enough about their setup, but should be verified.
![]()
Philip Durbin ๐ said:
I think I'd need to see it in action to know if it's what we need.
Yeah, I agree. As I said, it might not work as we need in practice.
What we have today is a static site served up by Apache. Very 90s. Works fine! :smile:
(or maybe it's nginx, I dunno)
it's probably going to be fine for the next few decades even, very robust, but it needs a server ; )
@Philip Durbin ๐ would IQSS be fine in putting some money into this? Or does it have to be for free?
well not just money, the server needs constant upkeep and maintenance too from personnel
I'm saying there's a ton of option when it comes to static website hosting that do not involve any maintenance. We can just mimic what we do now with Jenkins and keep all the old versions the same way we always did. This may involve spending a small fee though.
Oliver Bertuch said:
Philip Durbin ๐ would IQSS be fine in putting some money into this? Or does it have to be for free?
It's a fair question. We can ask Ceilyn Boyd about it.
I mean, maybe we could give ReadTheDocs some money to get rid of the ads. They seem like a nice company.
Leadership would love to know about a timeline for the decommissioning, is there one that y'all think is realistic from your side?
Look at what I just found... https://pico.sh It's 2$/month and seems to enable just dropping off stuff as we used to via SSH. (Would need building in CI though)
BTW a build of our guides takes ~40MB of space.
Ash Manda said:
Leadership would love to know about a timeline for the decommissioning, is there one that y'all think is realistic from your side?
Decommissioning what? Everything UNC hosts? Originally we were talking about Jenkins.
Oliver Bertuch said:
BTW a build of our guides takes ~40MB of space.
Does that include HTML, epub, and PDF?
Philip Durbin ๐ said:
Ash Manda said:
Leadership would love to know about a timeline for the decommissioning, is there one that y'all think is realistic from your side?
Decommissioning what? Everything UNC hosts? Originally we were talking about Jenkins.
Jenkins, yes!
Philip Durbin ๐ said:
Oliver Bertuch said:
BTW a build of our guides takes ~40MB of space.
Does that include HTML, epub, and PDF?
Only HTML.
@Ash Manda for Jenkins, can we pick up the conversation from here: ?
@Oliver Bertuch @Philip Durbin ๐ what do you think about an S3-compatible static host instead of a full-blown server? Cloudflare R2?
GitHub Actions can build, and upload the entire build artifact to R2 using aws s3 sync, the same thing that Jenkins is doing right now. Cloudflare will provide an SSL certificate for guides.dataverse.org.
R2 is going to be free practically forever since the first 10GiB is completely free, egress is zero. Class A operations are writes, Class B operations are reads. That's A LOT.
![]()
This setup can also be emulated with AWS S3 directly, but S3 doesn't provide SSL certificates, so unless we put Cloudfront in front of it, the guides link will only serve HTTP.
Looks nice but let's wait to hear from @Leo Andreev and @Steven Winship on their preferences.
Some hard statistics. The current guides deployment has 42k files and weighs about 2.1GB
![]()
beefy :cow:
Cloudflare R2 PoC live at: https://guides.srmanda.com/
GitHub Actions for regular updates, for example 6.11, will look like this:
and that's about it. Cost? $0
This is a direct dump of the current docs server
Nice! You have all the versions all the way back to https://guides.srmanda.com/en/3.6.2/ ! If nothing else, this is a good backup! :smile:
Ash Manda said:
R2 is going to be free practically forever since the first 10GiB is completely free
And the first 10 GB is free, you said ^^
Last updated: May 30 2026 at 06:18 UTC