While exploring the new Dataverse Single Page Application (SPA), I came across a potential issue with the technology stack (React on the frontend and the Dataverse API on the backend). It seems there may be challenges with search engine visibility, likely due to slower page load times and the fact that key content is only rendered after asynchronous data loading. As a result, search crawlers may index incomplete pages, which negatively impacts SEO. We’ve looked into a few approaches, but so far there doesn’t appear to be a straightforward solution within the current stack.
To address this and improve both user experience and search engine performance, it seems that a more substantial change may be required. One option we’re considering is a migration to Next.js with Incremental Static Regeneration (ISR), as it offers stronger support for server-side rendering, caching, and SEO.
What are your thoughts on this approach?
Hi @Johannes D , I agree, current stack won't be probably SEO friendly.
I believe most pages in the Dataverse Project should use SSR (server-side rendering) since the content changes frequently. ISR is better suited for updating static pages without needing a redeploy, i.e., pages generated at build time. Generating collection(dataverses), dataset, or file pages at build time could also take a very long time, since an installation may contain thousands of them. At the moment, I can’t think of any Dataverse pages that would really benefit from being generated at build time, though it might be different in your case. What do you think?
On the other hand, there’s this whole tricky thing with client-side vs. server-side React apps... the use client directive, javascript hydration issues, and all that. I’m not sure we really need to server-side render the entire page. Maybe just generating solid headers with the collection, dataset, or file data on the server could give us the SEO benefits (really easy with NextJS), while the rest of the page stays client-only without needing big changes to the React code.
A (longish) while ago I had been raising similar concerns with @Guillermo Portas . My idea at the time was to do some server side rendering and include the necessary SEO bits etc. This would probably require adding another exporter for HTML and adding a JAX-RS endpoint to accept text/html queries and sending a proper response. This way all of this stuff can co-exist.
The current exporters already allows us to cache rendered results, so it will probably be faster than current JSF pages. So this would at least help with indexing of datasets. Collections might need some more infrastructure (as we currently don't export them), but this seems very doable as well. (There have been request before about exports for collections, too)
Also, the sitemaps would need to be updated to route search engine crawlers to the JAX-RS endpoints, as the JSF pages will be gone. If we this gives us too much trouble, we can try a hybrid approach and send the cached stuff over the wire and let the SPA kick off from there.
So migration from classic plain react to next.js with server-side-rendering of some pages would be the suggested foreseen solution (roadmap)?
I'm not sure switching from React to Next.js is necessary. React also supports server side rendering, but I don't have enough expertise to tell if it is flexible enough to accomodate our special needs.
You are totally right, yet we need some engine that executes the serverside code parts (either pure node.js, express or someother framework). Here is one of the vites (the current build system) examples on SSR with express. I personally would like to avoid to manage/develop with those lowlevel API calls. Hence, I suggested next.js...
Oh wait - I misunderstood your question. I don't think the engine will be a problem at all - we do have the Java backend code that is perfectly fine to do the server side rendering, caching etc! No need to involve new tech there. As I said above, we don't necessarily have to maintain the JSF code, depending on how we approach this.
So to explain in a bit more detail what I'm suggesting we could explore...
In the interest of keeping the URLs for SPA and SEO the same, we can easily use content negotiation and serve either HTML or JSON depending on who's asking.
@Path("/dataset")
public class DatasetResource {
@GET
@Path("/{id}")
@Produces(MediaType.TEXT_HTML)
public Response getDatasetPage(@PathParam("id") Long id,
@Context HttpServletRequest request) {
Dataset dataset = datasetService.findById(id);
if (dataset == null || !dataset.isPublic()) {
return Response.status(404).build();
}
String html = seoTemplateService.generateDatasetHTML(dataset, request);
return Response.ok(html)
.header("Cache-Control", "public, max-age=3600")
.build();
}
@GET
@Path("/{id}")
@Produces(MediaType.APPLICATION_JSON)
public Response getDatasetData(@PathParam("id") Long id) {
Dataset dataset = datasetService.findById(id);
if (dataset == null || !dataset.isPublic()) {
return Response.status(404).build();
}
return Response.ok(dataset)
.header("Cache-Control", "public, max-age=1800")
.build();
}
}
The JAX-RS endpoints would return something like he below. We can use our exporters and create a new HTML one. This will be updated on any changes to datasets. Of course we'd need to discuss how to go about stuff like files and collections, but this is an implementation detail.
<!DOCTYPE html>
<html>
<head>
<title>Dataset: Climate Data 2023 | Dataverse</title>
<meta name="description" content="Comprehensive climate dataset with temperature and precipitation data...">
<meta property="og:title" content="Climate Data 2023">
<meta property="og:description" content="...">
<meta property="og:url" content="https://yoursite.com/dataset/123">
<!-- JSON-LD Structured Data -->
<script type="application/ld+json">
{
"@context": "https://schema.org/", "@type": "Dataset", "name": "Climate Data 2023", "description": "...", "creator": {...} } </script>
<!-- React App Scripts -->
<script src="/static/js/app.js" defer></script>
</head>
<body>
<div id="root">
<!-- Minimal loading content or skeleton -->
<div class="loading-skeleton">Loading dataset...</div>
</div> <!-- Hydration data for React -->
<script>
window.__INITIAL_DATA__ = {
"datasetId": 123,
"route": "/dataset/123"
};
</script>
</body>
</html>
@Oliver Bertuch I like your thought about exporters. All our export formats (Croissant, DataCite, etc.) are updated every time a dataset is published so perhaps we could use that latest, static, public data (JSON or XML) to render and cache some SEO-friendly HTML to serve up to search engines.
Doing it slightly more fancy by exporting CSS etc from the React SPA built would even allow to style it somewhat. If you browse with default deactivated JS on the web like me you see a lot of these pages where they either just have a simple "activate JS" or even a static, styled export of the site.
This approach works and provides an elegant solution with a tightly coupled integration of frontend and backend logic/code paths. An alternative would be to establish a clear separation, using Java strictly as an API-only backend and implementing a completely separate frontend stack for all web UI functionality. Both approaches have their advantages and drawbacks. Ultimately, it comes down to making a decision on the stack and roadmaps.
I’m not sure how well a page with only <head> markup (and no matching visible content) would rank in SEO. A few years ago, it was important that the metadata values also appeared in the visible content for better scoring.
I completely agree about making content on the static page available. Which is why we can simply extend the pre-rendered HTML from the exporter to include more details. As we are talking exporters, you can even simply create your own customized variant if need be, as you can load your own as a plugin.
This also has the advantage that someone can explore at least browsing the data with deactivated JavaScript. Personally I hate websites that just give me a plain text string that I need to activate javascript or even worse, show me an empty page.
Johannes D said:
It seems there may be challenges with search engine visibility, likely due to slower page load times and the fact that key content is only rendered after asynchronous data loading.
Are we able to measure these? How slow is the page load? Does async data loading matter (and how much)?
It's all measurable. But it will require more sophisticated tooling and code changes.
Observability is a huge thing these days. It allows you to track requests coming from somewhere (e.g. the SPA) being tracked along the way in the application. You can use the request and include it in logging contexts (but we'd need to move away from JUL for this).
With Microprofile there's also a standardized way to define measurements and metrics, that can be collected with appropriate tooling.
And of course there's Open Telemetry these days, allowing you to have a standardized way to package and ship all of these things.
I would be glad to see all of this implemented, but I'm afraid it'll take a whole FTE for a year just to get this done properly.
I'm just wondering if speeding up the page will help. Like @Johannes D said, that would "improve both user experience and search engine performance". But it would be nice to measure it somehow.
There’s no real need to implement a complex measurement system to assess performance. Instead, you can use the SEO debugging tools provided by Google, Bing, or other search engines to verify whether the correct metadata is being retrieved.
The issue lies primarily in the crawling techniques used. Simple crawlers do not execute JavaScript at all, which means you can easily debug them using basic curl commands. More sophisticated crawlers, however, do execute JavaScript—but only for a limited amount of time before processing the resulting HTML. Since we don’t know the exact timeout duration, our best approach is to ensure that the page renders quickly enough and then verify whether major search engines or AI services are correctly retrieving the intended content.
We should consider drafting an Architecture Decision Record (ADR) to outline the chosen strategy for SEO optimization. Such an ADR would define the context, the options evaluated (e.g., client-side rendering, pre-rendering, server-side rendering), the selected approach, and the rationale behind it. I believe this is the right time to discuss the topic, as the SPA has now reached a sufficient level of maturity to make an informed and sustainable decision.
@Johannes D sure, I agree it makes sense to use SEO debugging tools but before that shouldn't some basic steps be taken such as setting up sitemap so that search engines can more easily crawl the site?
Does the API backend generate the sitemap? If so, it seems that the routes are defined twice — once in the Java backend and once in the SPA code. If that’s the case, then the approach suggested by Oliver is the right way to go and will most likely resolve the SEO issue.
However, the SPA architecture should also account for the fact that some information is/or could already present in the initial HTML markup, meaning a REST call or lookup is unnecessary. This interweaving between the backend and SPA should be implemented and tested early on to ensure consistent behavior and to avoid redundant data fetching later in the development process.
Concerning the sitemap: Yes, but if no information can be obtained from the linked pages nothing gets indexed.
Yes, the backend creates the sitemap and only for JSF pages: https://guides.dataverse.org/en/6.8/installation/config.html#creating-a-sitemap-and-submitting-it-to-search-engines
Philip Durbin 🚀 said:
Yes, the backend creates the sitemap and only for JSF pages: https://guides.dataverse.org/en/6.8/installation/config.html#creating-a-sitemap-and-submitting-it-to-search-engines
Is the same process planned for the SPA?
I'm not aware of any specific plan but obviously when JSF is removed completely it doesn't sense to have entries in a sitemap pointing to pages that don't exist.
So the sitemap generation needs to be reworked at some point.
Do you create a sitemap?
Nope, this whole SEO issue is still open...
Well, the SEO debugging tools from Google, for example, will probably work better if you submit a sitemap to them.
Last updated: Nov 01 2025 at 14:11 UTC