Stream: dev

Topic: changing collection PID provider


view this post on Zulip Vera Clemens (Apr 12 2024 at 07:49):

Hi! I am very happy to see the support for multiple PID providers in 6.2. Right now I am looking for a way to set the PID provider for a dataverse via API. Maybe I have looked in the wrong places, but I didn't find anything. Only for datasets/datafiles: https://guides.dataverse.org/en/latest/api/native-api.html#configure-the-pid-generator-a-dataset-uses-if-enabled

Is it possible for dataverses as well? Thank you!

view this post on Zulip Oliver Bertuch (Apr 12 2024 at 08:23):

It looks like setting the PID provider for a collection is only possible via UI for now. It seems to have been overlooked by us to add an Admin API endpoint like we have for the storage driver of a collection.

view this post on Zulip Oliver Bertuch (Apr 12 2024 at 08:24):

I'll ping folks on Slack, but I fear this will be released with 6.3. Usually a patch release is only done when we find very critical bugs.

view this post on Zulip Vera Clemens (Apr 12 2024 at 09:50):

Aw, I see. Thanks for checking.

view this post on Zulip Vera Clemens (Apr 12 2024 at 09:52):

Another question regarding multiple PID providers, is it possible at all to change the PID provider of a dataset after it has been created? Example: I moved a dataset from a Permalink dataverse to a DOI dataverse, but the dataset still has a Permalink. Is it possible for that dataset to lose the Permalink and receive a DOI, without re-creating it?

view this post on Zulip Oliver Bertuch (Apr 12 2024 at 09:54):

I don't think that's possible.

view this post on Zulip Oliver Bertuch (Apr 12 2024 at 09:54):

It kinda violates the principle that a PID is permanent...

view this post on Zulip Oliver Bertuch (Apr 12 2024 at 09:55):

There might be an exception with the FAKE and PermaLink providers, but that has not been put into code

view this post on Zulip Oliver Bertuch (Apr 12 2024 at 09:55):

As far as I know, that is

view this post on Zulip Vera Clemens (Apr 12 2024 at 09:59):

Hmm. Yes. In our use case, we would like to optionally mint DOIs for datasets, otherwise only a Permalink. So the default would be to receive a Permalink, with the option to receive a DOI instead or "upgrade" to a DOI later. (The Permalink would actually also continue to exist, but afaik Dataverse doesn't support multiple PIDs per dataset.)

view this post on Zulip Vera Clemens (Apr 12 2024 at 10:01):

If it were possible to change the PID provider of a dataset from FAKE or Permalink to DOI, that would be very nice. Is this something that might be put into code, you think?

view this post on Zulip Philip Durbin ๐Ÿš€ (Apr 12 2024 at 10:02):

No API for this? :doh:

view this post on Zulip Oliver Bertuch (Apr 12 2024 at 10:02):

In our use case, we would like to optionally mint DOIs for datasets, otherwise only a Permalink.

That is possible: everything that needs a real DOI goes into a DOI-PID-provider enabled collection and permalink stays the default provider for the instance.

view this post on Zulip Oliver Bertuch (Apr 12 2024 at 10:03):

with the option to receive a DOI instead or "upgrade" to a DOI later

Feel free to open an issue for this feature request

view this post on Zulip Oliver Bertuch (Apr 12 2024 at 10:04):

afaik Dataverse doesn't support multiple PIDs per dataset

We kind of do. We allow "alternative identifiers", but obviously these are unmanaged. On the other hand: a permalink thing is not necessary to manage, as the metadata is not going anywhere outside the instance.

view this post on Zulip Oliver Bertuch (Apr 12 2024 at 10:06):

something that might be put into code, you think?

I don't see why not. It will require extending the DOI provider interface with sth like "allowMigration" so no one moves real stuff to other real stuff and trips over. Migrating datasets from FAKE to a real provider after some initial demo phase etc sounds very reasonable to me, so I don't see why it shouldn't happen.

view this post on Zulip Oliver Bertuch (Apr 12 2024 at 10:07):

Obviously: if you want such a feature and you can contribute a PR it speeds up things. Please let us know in the issue description.

view this post on Zulip Oliver Bertuch (Apr 12 2024 at 11:09):

@Vera Clemens Jims says this about migrating PIDs on the internal Slack:

As for migrating PIDs, there is no support for changing the PID of a dataset, aside from editing in the db, but that's probably not the only model (presumably people are referencing the existing PID). What is supported now is migrating a dataset into Dataverse with a PID that doesn't match the local protocol/authority/shoulder, and either 1) adding a new provider that matches the protocol/authority/shoulder to allow that dataset to be managed, or 2) asking DataCite to move that specific PID(s) to the existing account and adding those additional PIDs to the managed list for the existing provider (which would then work for the original protocol/authority/shoulder plus only the specific PIDs listed that don't match the pattern.

view this post on Zulip Oliver Bertuch (Apr 12 2024 at 11:10):

Again, please feel free to create more feature requests and discuss possible code contributions! :smiley_cat:

view this post on Zulip Philip Durbin ๐Ÿš€ (Apr 12 2024 at 13:42):

@Vera Clemens I see Oliver already encouraged you to open an issue. I'll just echo that. We've been talking about it in Slack and we all agree we need this.

view this post on Zulip Vera Clemens (Apr 15 2024 at 14:51):

I have opened two issues for what we have discussed here: https://github.com/IQSS/dataverse/issues/10497 and https://github.com/IQSS/dataverse/issues/10496

view this post on Zulip Johannes D (Apr 16 2024 at 08:17):

@Philip Durbin @Oliver Bertuch I believe the current implementation of multiple PID provider has an inconvenient bug.
There is an API to move an unpublished dataset from one dataverse to another. Assuming both dataverses have different PID providers configured. Which PID provider do we expect to mint the DOI? IMHO it shall be the new parent, i.e. the one it was moved to. Hence, the PID properties must be altered during the move operation, right?

view this post on Zulip Oliver Bertuch (Apr 16 2024 at 08:33):

Nope, the PID will stay consistent.

view this post on Zulip Oliver Bertuch (Apr 16 2024 at 08:33):

Anything already minted will not be changed

view this post on Zulip Johannes D (Apr 16 2024 at 08:36):

An unpublished dataset does not have a minted DOI, does it?

view this post on Zulip Oliver Bertuch (Apr 16 2024 at 08:36):

Oh well yeah that is a corner case

view this post on Zulip Oliver Bertuch (Apr 16 2024 at 08:37):

From the data model perspective it has though...

view this post on Zulip Oliver Bertuch (Apr 16 2024 at 08:37):

I'm looking through the code to learn what would happen

view this post on Zulip Oliver Bertuch (Apr 16 2024 at 08:39):

So we are talking about this case: the dataset has not published, the create time for the PID is null. https://github.com/poikilotherm/dataverse/blob/222b326aab6be19be9d7bc8907504801bc362343/src/main/java/edu/harvard/iq/dataverse/engine/command/impl/FinalizeDatasetPublicationCommand.java#L101-L101

view this post on Zulip Oliver Bertuch (Apr 16 2024 at 08:39):

So we're now looking at this: https://github.com/poikilotherm/dataverse/blob/222b326aab6be19be9d7bc8907504801bc362343/src/main/java/edu/harvard/iq/dataverse/engine/command/impl/AbstractDatasetCommand.java#L156-L156

view this post on Zulip Oliver Bertuch (Apr 16 2024 at 08:40):

So the provider will be looked up from the PID that has been assigned when the dataset was created

view this post on Zulip Oliver Bertuch (Apr 16 2024 at 08:40):

So it will still use the provider that was selected on creation

view this post on Zulip Oliver Bertuch (Apr 16 2024 at 08:41):

So this definitely is a corner case. Let's see what's in MoveDatasetCommand about this

view this post on Zulip Johannes D (Apr 16 2024 at 08:42):

I agree with you that things that once minted shall not be altered! However, tracing minting issues and configuration management of moved datasets will be a mess...but thats another story. This is possible a special case of a previous configured provider is no longer available...

@Vera Clemens is currently hacking some unit/integration tests to see whats happening.

view this post on Zulip Oliver Bertuch (Apr 16 2024 at 08:42):

Jupp, nothing in that command dealing with migrating the PID provider.

view this post on Zulip Oliver Bertuch (Apr 16 2024 at 08:43):

@Vera Clemens go go go! I would love to see some tests for this! It would be great to see some BDD testing here. Describing the scenario and forming it into a test...

view this post on Zulip Oliver Bertuch (Apr 16 2024 at 08:47):

It would be really great to see some Cucumber around!

view this post on Zulip Oliver Bertuch (Apr 16 2024 at 08:47):

We have so many tests, but they would benefit a lot from story driven testing

view this post on Zulip Johannes D (Apr 16 2024 at 08:52):

Is it a bug or a feature? IMHO we need to thing about some edge scenarios and at least document the expected behaviour. I have those cases in mind: a) removed PID provider, thats was used to mint resources - what happens during publication of an update? b) moved minted PIDs to other dataverse - what happen if I publish an updated version 1) provider still present and 2) provider no longer present (analog to a) and 3) moved non-minted datasets to other dataverse - which PID provider is used?

view this post on Zulip Oliver Bertuch (Apr 16 2024 at 08:55):

You're right, this should be documented somewhere.

view this post on Zulip Oliver Bertuch (Apr 16 2024 at 08:55):

I'm not sure if there is a design document somewhere, I might have lost the link

view this post on Zulip Oliver Bertuch (Apr 16 2024 at 08:56):

@Philip Durbin @Gustavo Durand another reason to make those open, at least after the implementation is done...

view this post on Zulip Oliver Bertuch (Apr 16 2024 at 08:56):

(Maybe even shape them into Architecture Decision Records?)

view this post on Zulip Oliver Bertuch (Apr 16 2024 at 08:56):

The US is not awake yet, lets pester them once they are around

view this post on Zulip Johannes D (Apr 16 2024 at 08:58):

Maybe we can create a PID FAQ somewhere in the docs? Assuming lay/ non-technical users have the same questions and are overwhelmed with technical details in the design documents.

view this post on Zulip Oliver Bertuch (Apr 16 2024 at 08:58):

Sounds like a great addition to the Admin Guide. Usually moving datasets is admin only

view this post on Zulip Philip Durbin ๐Ÿš€ (Apr 16 2024 at 13:48):

With DataCite anyway, the DOI is "registered" (reserved) on the DataCite side when a draft dataset is created in Dataverse.

Researchers might put this DOI in their publication. So I think it would be potentially surprising to them if the DOI/PID changes.

Once the DOI is published, it changes state from "registered" to "findable". Please see also https://support.datacite.org/docs/doi-states

view this post on Zulip Oliver Bertuch (Apr 16 2024 at 14:14):

Further up in this discussion I said this might be a feature depending on the provider

view this post on Zulip Johannes D (Apr 17 2024 at 08:05):

So DRAFTS in dataverse are registered with datacite and thus are already present via the handle system and subsequently cannot be deleted anymore? I assumed they are just draft records with in datacite, and are deleted if the dataset is deleted in dataverse. What happens with the DOI when a draft is deleted?

view this post on Zulip Oliver Bertuch (Apr 17 2024 at 08:14):

Registered != published. A DataCite DOI in registered mode is not discoverable and still deletable

view this post on Zulip Johannes D (Apr 17 2024 at 08:14):

https://github.com/IQSS/dataverse/blob/d9a79228ef6776aff155f8d8a03349eb4f06751d/src/main/java/edu/harvard/iq/dataverse/pidproviders/doi/datacite/DataCiteDOIProvider.java#L89

view this post on Zulip Oliver Bertuch (Apr 17 2024 at 08:14):

Honestly I'm not sure what happens :see_no_evil:

view this post on Zulip Johannes D (Apr 17 2024 at 08:15):

https://github.com/IQSS/dataverse/blob/d9a79228ef6776aff155f8d8a03349eb4f06751d/src/main/java/edu/harvard/iq/dataverse/pidproviders/doi/datacite/DataCiteDOIProvider.java#L140

view this post on Zulip Johannes D (Apr 17 2024 at 08:15):

registered != publihsed but also registered != draft

view this post on Zulip Johannes D (Apr 17 2024 at 08:16):

From this code snippet it assume its a DRAFT record and not a REGISTERED record. This means it can be deleted...

view this post on Zulip Johannes D (Apr 17 2024 at 08:17):

And if its only a DRAFT, it shall not be used in any publication or published content. Thus, we can delete and create a new DOI according to another DOI-Provider configuration.

view this post on Zulip Oliver Bertuch (Apr 17 2024 at 08:18):

https://support.datacite.org/docs/doi-states

view this post on Zulip Oliver Bertuch (Apr 17 2024 at 08:18):

Reserving in Dataverse means draft state at DataCite

view this post on Zulip Oliver Bertuch (Apr 17 2024 at 08:18):

IIRC

view this post on Zulip Oliver Bertuch (Apr 17 2024 at 08:18):

Let me check

view this post on Zulip Oliver Bertuch (Apr 17 2024 at 08:19):

Yes. https://github.com/IQSS/dataverse/blob/d9a79228ef6776aff155f8d8a03349eb4f06751d/src/main/java/edu/harvard/iq/dataverse/pidproviders/doi/datacite/DataCiteDOIProvider.java#L89

view this post on Zulip Johannes D (Apr 17 2024 at 08:19):

That is a good thing, yet a bit confusing.

view this post on Zulip Oliver Bertuch (Apr 17 2024 at 08:20):

publicizeIdentifier() is used to switch from draft to findable.

view this post on Zulip Johannes D (Apr 17 2024 at 08:23):

This means, we could alter the moveDatasetCommand to delete the DRAFT DOI of one provider and create a new draft with another provider, such that configured PID provider of the target dataverse is used.

view this post on Zulip Oliver Bertuch (Apr 17 2024 at 08:26):

Please create a feature request if not yet existing :smile_cat:

view this post on Zulip Johannes D (Apr 17 2024 at 08:40):

Already created I assume, but as said Vera is on the issue and will create some tests for it. This is not the only edge case we need to think of.

view this post on Zulip Oliver Bertuch (Apr 17 2024 at 08:41):

It might be good to talk about this at a future tech hour discussion @Gustavo Durand @Philip Durbin

view this post on Zulip Philip Durbin ๐Ÿš€ (Apr 17 2024 at 10:27):

Sure, but again, please warn the user:

"Are you sure you want to change the PID of this dataset? If you put the dataset's original PID in your unpublished paper, please update it!"

view this post on Zulip Johannes D (Apr 17 2024 at 13:01):

Philip Durbin said:

Sure, but again, please warn the user:

"Are you sure you want to change the PID of this dataset? If you put the dataset's original PID in your unpublished paper, please update it!"

Is there an UI feature to move a dataset?

view this post on Zulip Philip Durbin ๐Ÿš€ (Apr 17 2024 at 13:11):

Yes. It's superuser-only right now. Screenshots:

dashboard.png

move-dataset.png

Docs: https://guides.dataverse.org/en/6.2/admin/dashboard.html#move-data

view this post on Zulip Johannes D (Apr 17 2024 at 13:12):

Thanks, I've never noticed this button

view this post on Zulip Johannes D (Apr 17 2024 at 13:21):

Would this call-out/documentation be sufficient and describe the intended behaviour of the system? "This function can be used to transfer a dataset from one dataverse to another. The PID settings of the target dataverse become active when an unpublished (i.e. draft) dataset is moved to another dataverse. This invalidates the existing PID and creates a new one. If the PID has already been in use outside the system, it will have to be adjusted. The PID configuration is not adjusted for dataset that have already been published."

view this post on Zulip Philip Durbin ๐Ÿš€ (Apr 17 2024 at 13:24):

It certainly helps! My concern is... is the dataset author in the loop? Do they know what the superuser is up to?

view this post on Zulip Johannes D (Apr 17 2024 at 13:27):

Fair enough, but I imagine this function is performed as part of a service request on behalf of a user. We could add a notification to inform the user about the performed change.

view this post on Zulip Philip Durbin ๐Ÿš€ (Apr 17 2024 at 13:28):

Sure, that makes sense.

view this post on Zulip Philip Durbin ๐Ÿš€ (Apr 17 2024 at 13:28):

At some point we should summarize in #10497

view this post on Zulip Johannes D (Apr 17 2024 at 13:29):

This could something like: The dataset [] was moved from [] to []. Since it wasn't published the planned PID changed to []. ย If the former PID has already been in use outside the system, it will have to be adjusted.

view this post on Zulip Johannes D (Apr 17 2024 at 13:30):

Philip Durbin said:

At some point we should summarize in #10497

This is another feature request.

view this post on Zulip Philip Durbin ๐Ÿš€ (Apr 17 2024 at 13:30):

Sure, I would show the old PID as well.

view this post on Zulip Philip Durbin ๐Ÿš€ (Apr 17 2024 at 13:30):

Oh, well, maybe we need a new issue?

view this post on Zulip Philip Durbin ๐Ÿš€ (Apr 17 2024 at 13:31):

It different than this one: set dataverse PID provider via API #10496

view this post on Zulip Johannes D (Apr 17 2024 at 13:34):

This thread talks about moved datasets, they should pick up the new PID configuration if not already published. ย #10497 is about upgrading published datasets from a suboptimal PID system to a "better" one. (e.g. started with internal PermaLinks and later upgrade to nice DOI PIDs.) ย #10496 is just about configure a dataverse PID provider via API.

view this post on Zulip Philip Durbin ๐Ÿš€ (Apr 17 2024 at 13:37):

Originally, this Zulip topic was about changing dataverse PID provider via API. :grinning:

Now it's about much more. :grinning:

We could start new topics and move messages around.

view this post on Zulip Philip Durbin ๐Ÿš€ (Apr 17 2024 at 13:38):

Or maybe re-title this thread? We're trying to say "collection" instead of "dataverse" these days. Maybe "changing collection PID provider"? Broad enough?

view this post on Zulip Johannes D (Apr 17 2024 at 13:40):

Thats nice, we also use collection in our project and renamed dataverse.

view this post on Zulip Philip Durbin ๐Ÿš€ (Apr 17 2024 at 13:44):

Great, I renamed this topic.

view this post on Zulip Philip Durbin ๐Ÿš€ (Apr 17 2024 at 13:44):

But do we have the right number of issues? We still need one more, right? "moved datasets, they should pick up the new PID configuration if not already published"

view this post on Zulip Johannes D (Apr 17 2024 at 13:45):

Philip Durbin said:

But do we have the right number of issues? We still need one more, right? "moved datasets, they should pick up the new PID configuration if not already published"

I going to create one and most likely implement it. Hopefully this feature can be part of the next release.

view this post on Zulip Johannes D (Apr 17 2024 at 14:13):

Here is the issue https://github.com/IQSS/dataverse/issues/10501

view this post on Zulip Philip Durbin ๐Ÿš€ (Apr 17 2024 at 14:13):

Looks great. Could you please add a link back to this topic on Zulip?

view this post on Zulip Johannes D (Apr 17 2024 at 14:14):

I wanted to but I cannot find the option to create a link in zulip.

view this post on Zulip Johannes D (Apr 17 2024 at 14:14):

Nevermind, got it now!

view this post on Zulip Philip Durbin ๐Ÿš€ (Apr 17 2024 at 14:36):

I usually use the sidebar

view this post on Zulip Philip Durbin ๐Ÿš€ (Apr 17 2024 at 14:37):

Ah, you linked to where you brought this up. Perfect.

view this post on Zulip Vera Clemens (Apr 17 2024 at 14:44):

I have been working on the tests to illustrate this issue. I played around with Cucumber @Oliver Bertuch thank you for the pointer. It has been a while since I used Cucumber for testing.

I think it makes the most sense to implement the tests as API tests, so that is what I started with. I have now run into the following issue: how can we best test the cases involving a DOI provider? I can configure the tested dataverse with a DOI provider, however we don't want the tests to cause actual requests to be sent to the DataCite APIs. We also don't want the tests to fail because we have configured a fake DataCite API URL that doesn't respond in the expected way. Is mocking the DataCite API endpoints the right way to go? How would I go about this?

Or do you envision some other way for the tests to be implemented?

I've pushed my current state here https://github.com/vera/dataverse/tree/moving-datasets-between-pid-providers it's very WIP, happy to receive feedback on it (run tests with mvn test -Dtest=RunCucumberTest)

view this post on Zulip Philip Durbin ๐Ÿš€ (Apr 17 2024 at 14:50):

Hmm, mocking sounds fine to me.

view this post on Zulip Oliver Bertuch (Apr 18 2024 at 12:12):

We discussed this issue with DataCite sometime ago on Slack as well

view this post on Zulip Oliver Bertuch (Apr 18 2024 at 12:13):

One idea to avoid real calls to the DataCite (Test) Fabrica was to use sth like WireMock


Last updated: Nov 01 2025 at 14:11 UTC