Stream: community

Topic: Feature : Notify harvested repo admins of harvesting errors


view this post on Zulip Dimitri Szabo (Apr 12 2024 at 10:35):

Hi there,
I'm writing the issue for a feature but it has some items that should be discussed first, so feel free to react :smile:

view this post on Zulip Dimitri Szabo (Apr 12 2024 at 10:36):

Here is the current description of the feature :

Overview of the Feature Request

As a repository administrator whose repository is harvested by Dataverse,
I can receive a notification when harvesting errors occur harvesting my repository.
In order to identify the datasets that are not harvested and be able to improve the harvesting of my repository.

What kind of user is the feature intended for?

Harvested repository administrator (not necessarily a Dataverse user)

What inspired the request?

Use of Dataverse as a metadata catalog for Recherche Data Gouv.

Questions to be addressed about the feature

  1. Should we rather consider sending notifications even if the harvesting is a success ?
  2. Should this be an optional setting or is it sufficient that adding a contact email is optional ?
    * Should there be an option to set if the installation admin also receives such notifications ?

  3. Should we use Dataverse user accounts rather than emails ? This would allow to be able to both
    * having the ability to send Dataverse notifications in addition/instead of an email
    * have the being able to disable such notifications on an individual basis
    * and in the future maybe give harvested repositories admin access to a dedicated dashboard for their harvesting clients

  4. If yes (for 3.), would it be simpler to do it in two steps (emails and then maybe accounts) or directly with accounts ?

  5. Could this feature or some of the prerequisite be associated with existing or future funding (e.g. NIH GREI) ?

Any brand new behavior do you want to add to Dataverse?

Two prerequisites are needed and could probably be done in parallel :

1. Being able to send a summary of Harvesting errors

As a Superuser,
When errors occur in Harvesting, I can receive an email notification providing the following information :

Needs clarification of question 3. Should we use Dataverse user accounts rather than emails ? before detailing either for emails only or Dataverse accounts.

As a Superuser,
I can add one or more email addresses to a harvesting client information
In order to be able to contact the harvested repository administrator
Associated Issue :

- Expand API creation of Harvesting Clients to add contact(s)
- (Directly in the SPA) Add contact(s) in the Harvesting Clients creation form

view this post on Zulip Philip Durbin 🚀 (Apr 12 2024 at 11:40):

Interesting. At a high level, sure, I can see how this could be useful. For a client we show if harvesting failed or not but we don't show if the server-side failed (neither set nor entire collection). @Leo Andreev would probably be able to guess how much effort it would be. And @Julian Gautier handles harvesting for Harvard Dataverse (client-side, at least). He might be interested in this feature.

@Dimitri Szabo if we forget about errors for a moment, do you already have an idea of how many of your datasets are harvested and from where? How would you know? Logs?

view this post on Zulip Julian Gautier (Apr 15 2024 at 14:58):

Yes the benefits of this proposal would be great!

The GitHub issue at https://github.com/IQSS/dataverse/issues/9294 feels related to me.

And last week I think I saw another GitHub issue that might be helpful to be aware of. It was also about giving installation admins more information about the status of harvesting clients and attempts. I can't seem to find that GitHub issue. @Ceilyn Boyd is this a GitHub issue you might have opened?

view this post on Zulip Juan Pablo Tosca Villanueva (Apr 15 2024 at 20:46):

I was thinking about this, could there be cases where the failure happens on the client's side while trying to harvest from an installation? for example network failures and do administrators want to be notified when someone tries to harvest from them and it fails if they can't do anything? What would be the criteria to determine if a failure should be notified or not? Also If I have a Harvesting Server that is harvested by 20 installations probably I wouldn't like to receive 20 emails :thinking:

view this post on Zulip Dimitri Szabo (Apr 16 2024 at 07:31):

Philip Durbin said:

Dimitri Szabo if we forget about errors for a moment, do you already have an idea of how many of your datasets are harvested and from where? How would you know? Logs?

We currently have no view on how many datasets are harvested, we usually only check the platforms where we want to be harvested. I don't know if you could check that globally via the logs it'd be very interesting to know.

view this post on Zulip Dimitri Szabo (Apr 16 2024 at 07:40):

Juan Pablo Tosca Villanueva said:

I was thinking about this, could there be cases where the failure happens on the client's side while trying to harvest from an installation? for example network failures and do administrators want to be notified when someone tries to harvest from them and it fails if they can't do anything? What would be the criteria to determine if a failure should be notified or not?

That's a great point, I think it could still be helpful to have this information even if there is not always a solution on the "harvested side", but as you pointed out maybe not in every case and with the ability to disable notifications from an installation that could always fail.

Juan Pablo Tosca Villanueva said:

Also If I have a Harvesting Server that is harvested by 20 installations probably I wouldn't like to receive 20 emails :thinking:

Yes, that's where using user accounts or other mechanism allowing to disable notification at the user level would be helpful.

view this post on Zulip Julian Gautier (Apr 16 2024 at 13:38):

For helping troubleshoot issues with the harvesting that Harvard Dataverse is doing, I've been using the Dataverse API and scraping repositories' OAI-PMH feeds to create this spreadsheet: https://docs.google.com/spreadsheets/d/1ek3AMYDWBo3ck6D18Q47FL1lTDbkLbf7tWhHliVtjbQ. Let me know if you'd like to talk more about this.

The GitHub issue at https://github.com/IQSS/dataverse-pm/issues/171 lists most if not all of the issues we've been working on, too.


Last updated: Nov 01 2025 at 14:11 UTC