Migration of records w/o moving the data? · community

Stream: community

Topic: Migration of records w/o moving the data?

Bethany Seeger (Aug 28 2024 at 20:07):

Hello,

We have a few collections we'd like to migrate into Dataverse where the files are already in an S3 bucket and curated by another application. Ideally we wouldn't have to move the files, as they could be, in theory, accessed from where they already are, plus they already have handles pointing to them there (not that we couldn't change this pointer, I think). We'd like to just give Dataverse access to this other bucket, in addition to it's other datastores.

I know it'd be straightforward, via the native API, to move the metadata into Dataverse. For the files, if we didn't want to migrate them, would we essentially be following the process for moving a large data set?
0) ensure that the second S3 bucket is configured to be access by Dataverse

1) have the metadata migration create place holder files for the datasets

2) have a script that manipulates the Dataverse database to point to the right S3 bucket and location w/i it. (This would be more than just replacing a placeholder, as the files wouldn't be where the place holder was set)

Would this work?

There are a few unknowns for us --

Can dataverse link to mulitple S3 buckets?
Is the only way to make the connection from the datasets to the file in S3 be by manipulating the database?

Note: As mentioned, we do have Handles on the files that point directly to the files in the buckets, and one thought we've had is to just use those as links to the data in the Dataverse record.

(I don't think OAI-PMH harvesting would be enough for this collection because the Datasets wouldn't technically be hosted elsewhere to point to. The goal here is to have the dataset in one place and the curation tool and the public access website (Dataverse) access it from there)

I'm still very new to Dataverse, so there might be other options I missing. Would love to hear some perspectives on this.

Philip Durbin 🚀 (Aug 28 2024 at 20:29):

My first thought is that Jim Myers knows the most about this so you might want to cross post to https://groups.google.com/g/dataverse-community to get his attention. :grinning:

Philip Durbin 🚀 (Aug 28 2024 at 20:30):

Yes, Dataverse can link to multiple S3 buckets. Each "store" can be configured separately, to use the same bucket, to use different buckets; it's up to you.

Philip Durbin 🚀 (Aug 28 2024 at 20:31):

You might want to look into Trusted Remote Storage: https://guides.dataverse.org/en/6.3/installation/config.html#trusted-remote-storage

Philip Durbin 🚀 (Aug 28 2024 at 20:32):

Do you want Dataverse to take over management of the files? Or do you want to manage them in S3 separately and simply let Dataverse know where the files live?

Bethany Seeger (Aug 28 2024 at 20:37):

Thanks, Phil. I'll cross post there.
Good question about file management. For this collection of data, I think it might be the later option, but I'll have to ask the collection manager. What does it mean to have Dataverse take over management of the files?

Philip Durbin 🚀 (Aug 28 2024 at 20:57):

Here's the cross post. Thanks again.

Philip Durbin 🚀 (Aug 28 2024 at 20:57):

Well, do you want Dataverse to be able to delete files?

Philip Durbin 🚀 (Aug 28 2024 at 20:58):

When a dataset is in draft and files are deleted, they are removed from the S3 bucket. Poof.

Philip Durbin 🚀 (Sep 09 2024 at 17:51):

A new reply from Jim: https://groups.google.com/g/dataverse-community/c/133bNBCtXYc/m/mZkJ-3W_AAAJ

Last updated: Jan 09 2026 at 14:18 UTC