Stream: ui-dev

Topic: potential SEO issues with the new SPA frontend


view this post on Zulip Johannes D (Sep 17 2025 at 12:22):

While exploring the new Dataverse Single Page Application (SPA), I came across a potential issue with the technology stack (React on the frontend and the Dataverse API on the backend). It seems there may be challenges with search engine visibility, likely due to slower page load times and the fact that key content is only rendered after asynchronous data loading. As a result, search crawlers may index incomplete pages, which negatively impacts SEO. We’ve looked into a few approaches, but so far there doesn’t appear to be a straightforward solution within the current stack.

To address this and improve both user experience and search engine performance, it seems that a more substantial change may be required. One option we’re considering is a migration to Next.js with Incremental Static Regeneration (ISR), as it offers stronger support for server-side rendering, caching, and SEO.

What are your thoughts on this approach?

view this post on Zulip Germán Saracca (Sep 17 2025 at 14:20):

Hi @Johannes D , I agree, current stack won't be probably SEO friendly.

I believe most pages in the Dataverse Project should use SSR (server-side rendering) since the content changes frequently. ISR is better suited for updating static pages without needing a redeploy, i.e., pages generated at build time. Generating collection(dataverses), dataset, or file pages at build time could also take a very long time, since an installation may contain thousands of them. At the moment, I can’t think of any Dataverse pages that would really benefit from being generated at build time, though it might be different in your case. What do you think?

On the other hand, there’s this whole tricky thing with client-side vs. server-side React apps... the use client directive, javascript hydration issues, and all that. I’m not sure we really need to server-side render the entire page. Maybe just generating solid headers with the collection, dataset, or file data on the server could give us the SEO benefits (really easy with NextJS), while the rest of the page stays client-only without needing big changes to the React code.

view this post on Zulip Oliver Bertuch (Sep 17 2025 at 14:21):

A (longish) while ago I had been raising similar concerns with @Guillermo Portas . My idea at the time was to do some server side rendering and include the necessary SEO bits etc. This would probably require adding another exporter for HTML and adding a JAX-RS endpoint to accept text/html queries and sending a proper response. This way all of this stuff can co-exist.

The current exporters already allows us to cache rendered results, so it will probably be faster than current JSF pages. So this would at least help with indexing of datasets. Collections might need some more infrastructure (as we currently don't export them), but this seems very doable as well. (There have been request before about exports for collections, too)

Also, the sitemaps would need to be updated to route search engine crawlers to the JAX-RS endpoints, as the JSF pages will be gone. If we this gives us too much trouble, we can try a hybrid approach and send the cached stuff over the wire and let the SPA kick off from there.

view this post on Zulip Johannes D (Sep 18 2025 at 05:41):

So migration from classic plain react to next.js with server-side-rendering of some pages would be the suggested foreseen solution (roadmap)?

view this post on Zulip Oliver Bertuch (Sep 18 2025 at 06:15):

I'm not sure switching from React to Next.js is necessary. React also supports server side rendering, but I don't have enough expertise to tell if it is flexible enough to accomodate our special needs.

view this post on Zulip Johannes D (Sep 18 2025 at 07:13):

You are totally right, yet we need some engine that executes the serverside code parts (either pure node.js, express or someother framework). Here is one of the vites (the current build system) examples on SSR with express. I personally would like to avoid to manage/develop with those lowlevel API calls. Hence, I suggested next.js...

view this post on Zulip Oliver Bertuch (Sep 18 2025 at 07:18):

Oh wait - I misunderstood your question. I don't think the engine will be a problem at all - we do have the Java backend code that is perfectly fine to do the server side rendering, caching etc! No need to involve new tech there. As I said above, we don't necessarily have to maintain the JSF code, depending on how we approach this.

view this post on Zulip Oliver Bertuch (Sep 18 2025 at 07:52):

So to explain in a bit more detail what I'm suggesting we could explore...

Architecture Overview

  1. SEO-optimized JAX-RS endpoints serve minimal HTML with critical SEO data
  2. React SPA takes over after initial page load for dynamic functionality
  3. Shared JSON API serves data to both server-rendered pages and client-side app

Implementation Strategy

1. JAX-RS HTML Endpoints

In the interest of keeping the URLs for SPA and SEO the same, we can easily use content negotiation and serve either HTML or JSON depending on who's asking.

@Path("/dataset")
public class DatasetResource {

    @GET
    @Path("/{id}")
    @Produces(MediaType.TEXT_HTML)
    public Response getDatasetPage(@PathParam("id") Long id,
                                   @Context HttpServletRequest request) {
        Dataset dataset = datasetService.findById(id);
        if (dataset == null || !dataset.isPublic()) {
            return Response.status(404).build();
        }

        String html = seoTemplateService.generateDatasetHTML(dataset, request);
        return Response.ok(html)
                .header("Cache-Control", "public, max-age=3600")
                .build();
    }

    @GET
    @Path("/{id}")
    @Produces(MediaType.APPLICATION_JSON)
    public Response getDatasetData(@PathParam("id") Long id) {
        Dataset dataset = datasetService.findById(id);
        if (dataset == null || !dataset.isPublic()) {
            return Response.status(404).build();
        }

        return Response.ok(dataset)
                .header("Cache-Control", "public, max-age=1800")
                .build();
    }
}

2. Minimal HTML Template

The JAX-RS endpoints would return something like he below. We can use our exporters and create a new HTML one. This will be updated on any changes to datasets. Of course we'd need to discuss how to go about stuff like files and collections, but this is an implementation detail.

<!DOCTYPE html>
<html>
<head>
    <title>Dataset: Climate Data 2023 | Dataverse</title>
    <meta name="description" content="Comprehensive climate dataset with temperature and precipitation data...">
    <meta property="og:title" content="Climate Data 2023">
    <meta property="og:description" content="...">
    <meta property="og:url" content="https://yoursite.com/dataset/123">
    <!-- JSON-LD Structured Data -->
    <script type="application/ld+json">
    {
        "@context": "https://schema.org/",        "@type": "Dataset",        "name": "Climate Data 2023",        "description": "...",        "creator": {...}    }    </script>
    <!-- React App Scripts -->
    <script src="/static/js/app.js" defer></script>
</head>
<body>
    <div id="root">
        <!-- Minimal loading content or skeleton -->
        <div class="loading-skeleton">Loading dataset...</div>
    </div>    <!-- Hydration data for React -->
    <script>
        window.__INITIAL_DATA__ = {
            "datasetId": 123,
            "route": "/dataset/123"
        };
    </script>
</body>
</html>

3. :check: Advantages

  1. Leverage existing infrastructure - Use the current JAX-RS setup
  2. Minimal server overhead - Only render SEO-critical data
  3. Fast client-side experience - React SPA handles all user interactions, rendering all the hard bits with editing etc
  4. Gradual implementation - Roll out page by page
  5. Shared API logic - Same JAX-RS services for both SEO and SPA endpoints

4. :warning: Considerations

  1. Duplicate routing logic - Maintain routes in both JAX-RS and React Router. But to be fair: we don't have a ton of routes for SEO display purposes, so this should be fairly simple.
  2. Cache invalidation - Keep SEO HTML in sync with data changes. This is already the case if we follow the exporter pattern already in place. But it needs work for files and collections.
  3. URL strategy - Decide how to handle URL routing between systems. It would be good to have the exact same URLs for the sake of SEO.

view this post on Zulip Philip Durbin 🚀 (Sep 18 2025 at 11:13):

@Oliver Bertuch I like your thought about exporters. All our export formats (Croissant, DataCite, etc.) are updated every time a dataset is published so perhaps we could use that latest, static, public data (JSON or XML) to render and cache some SEO-friendly HTML to serve up to search engines.

view this post on Zulip Oliver Bertuch (Sep 18 2025 at 11:30):

Doing it slightly more fancy by exporting CSS etc from the React SPA built would even allow to style it somewhat. If you browse with default deactivated JS on the web like me you see a lot of these pages where they either just have a simple "activate JS" or even a static, styled export of the site.

view this post on Zulip Johannes D (Sep 18 2025 at 11:31):

This approach works and provides an elegant solution with a tightly coupled integration of frontend and backend logic/code paths. An alternative would be to establish a clear separation, using Java strictly as an API-only backend and implementing a completely separate frontend stack for all web UI functionality. Both approaches have their advantages and drawbacks. Ultimately, it comes down to making a decision on the stack and roadmaps.

I’m not sure how well a page with only <head> markup (and no matching visible content) would rank in SEO. A few years ago, it was important that the metadata values also appeared in the visible content for better scoring.

view this post on Zulip Oliver Bertuch (Sep 18 2025 at 11:35):

I completely agree about making content on the static page available. Which is why we can simply extend the pre-rendered HTML from the exporter to include more details. As we are talking exporters, you can even simply create your own customized variant if need be, as you can load your own as a plugin.

This also has the advantage that someone can explore at least browsing the data with deactivated JavaScript. Personally I hate websites that just give me a plain text string that I need to activate javascript or even worse, show me an empty page.

view this post on Zulip Philip Durbin 🚀 (Sep 18 2025 at 13:10):

Johannes D said:

It seems there may be challenges with search engine visibility, likely due to slower page load times and the fact that key content is only rendered after asynchronous data loading.

Are we able to measure these? How slow is the page load? Does async data loading matter (and how much)?

view this post on Zulip Oliver Bertuch (Sep 18 2025 at 13:22):

It's all measurable. But it will require more sophisticated tooling and code changes.

Observability is a huge thing these days. It allows you to track requests coming from somewhere (e.g. the SPA) being tracked along the way in the application. You can use the request and include it in logging contexts (but we'd need to move away from JUL for this).

With Microprofile there's also a standardized way to define measurements and metrics, that can be collected with appropriate tooling.

And of course there's Open Telemetry these days, allowing you to have a standardized way to package and ship all of these things.

I would be glad to see all of this implemented, but I'm afraid it'll take a whole FTE for a year just to get this done properly.

view this post on Zulip Philip Durbin 🚀 (Sep 18 2025 at 13:27):

I'm just wondering if speeding up the page will help. Like @Johannes D said, that would "improve both user experience and search engine performance". But it would be nice to measure it somehow.

view this post on Zulip Johannes D (Oct 06 2025 at 09:33):

There’s no real need to implement a complex measurement system to assess performance. Instead, you can use the SEO debugging tools provided by Google, Bing, or other search engines to verify whether the correct metadata is being retrieved.

The issue lies primarily in the crawling techniques used. Simple crawlers do not execute JavaScript at all, which means you can easily debug them using basic curl commands. More sophisticated crawlers, however, do execute JavaScript—but only for a limited amount of time before processing the resulting HTML. Since we don’t know the exact timeout duration, our best approach is to ensure that the page renders quickly enough and then verify whether major search engines or AI services are correctly retrieving the intended content.

We should consider drafting an Architecture Decision Record (ADR) to outline the chosen strategy for SEO optimization. Such an ADR would define the context, the options evaluated (e.g., client-side rendering, pre-rendering, server-side rendering), the selected approach, and the rationale behind it. I believe this is the right time to discuss the topic, as the SPA has now reached a sufficient level of maturity to make an informed and sustainable decision.

view this post on Zulip Philip Durbin 🚀 (Oct 06 2025 at 13:31):

@Johannes D sure, I agree it makes sense to use SEO debugging tools but before that shouldn't some basic steps be taken such as setting up sitemap so that search engines can more easily crawl the site?

view this post on Zulip Johannes D (Oct 06 2025 at 14:31):

Does the API backend generate the sitemap? If so, it seems that the routes are defined twice — once in the Java backend and once in the SPA code. If that’s the case, then the approach suggested by Oliver is the right way to go and will most likely resolve the SEO issue.

However, the SPA architecture should also account for the fact that some information is/or could already present in the initial HTML markup, meaning a REST call or lookup is unnecessary. This interweaving between the backend and SPA should be implemented and tested early on to ensure consistent behavior and to avoid redundant data fetching later in the development process.

Concerning the sitemap: Yes, but if no information can be obtained from the linked pages nothing gets indexed.

view this post on Zulip Philip Durbin 🚀 (Oct 06 2025 at 14:32):

Yes, the backend creates the sitemap and only for JSF pages: https://guides.dataverse.org/en/6.8/installation/config.html#creating-a-sitemap-and-submitting-it-to-search-engines

view this post on Zulip Johannes D (Oct 06 2025 at 14:33):

Philip Durbin 🚀 said:

Yes, the backend creates the sitemap and only for JSF pages: https://guides.dataverse.org/en/6.8/installation/config.html#creating-a-sitemap-and-submitting-it-to-search-engines

Is the same process planned for the SPA?

view this post on Zulip Philip Durbin 🚀 (Oct 06 2025 at 14:36):

I'm not aware of any specific plan but obviously when JSF is removed completely it doesn't sense to have entries in a sitemap pointing to pages that don't exist.

So the sitemap generation needs to be reworked at some point.

view this post on Zulip Philip Durbin 🚀 (Oct 06 2025 at 14:36):

Do you create a sitemap?

view this post on Zulip Johannes D (Oct 06 2025 at 14:50):

Nope, this whole SEO issue is still open...

view this post on Zulip Philip Durbin 🚀 (Oct 06 2025 at 14:55):

Well, the SEO debugging tools from Google, for example, will probably work better if you submit a sitemap to them.


Last updated: Nov 01 2025 at 14:11 UTC