Stream: community

Topic: Migration of records w/o moving the data?


view this post on Zulip Bethany Seeger (Aug 28 2024 at 20:07):

Hello,

We have a few collections we'd like to migrate into Dataverse where the files are already in an S3 bucket and curated by another application. Ideally we wouldn't have to move the files, as they could be, in theory, accessed from where they already are, plus they already have handles pointing to them there (not that we couldn't change this pointer, I think). We'd like to just give Dataverse access to this other bucket, in addition to it's other datastores.

I know it'd be straightforward, via the native API, to move the metadata into Dataverse. For the files, if we didn't want to migrate them, would we essentially be following the process for moving a large data set?
0) ensure that the second S3 bucket is configured to be access by Dataverse

1) have the metadata migration create place holder files for the datasets

2) have a script that manipulates the Dataverse database to point to the right S3 bucket and location w/i it. (This would be more than just replacing a placeholder, as the files wouldn't be where the place holder was set)

Would this work?

There are a few unknowns for us --

  1. Can dataverse link to mulitple S3 buckets?
  2. Is the only way to make the connection from the datasets to the file in S3 be by manipulating the database?

Note: As mentioned, we do have Handles on the files that point directly to the files in the buckets, and one thought we've had is to just use those as links to the data in the Dataverse record.

(I don't think OAI-PMH harvesting would be enough for this collection because the Datasets wouldn't technically be hosted elsewhere to point to. The goal here is to have the dataset in one place and the curation tool and the public access website (Dataverse) access it from there)

I'm still very new to Dataverse, so there might be other options I missing. Would love to hear some perspectives on this.

view this post on Zulip Philip Durbin ๐Ÿš€ (Aug 28 2024 at 20:29):

My first thought is that Jim Myers knows the most about this so you might want to cross post to https://groups.google.com/g/dataverse-community to get his attention. :grinning:

view this post on Zulip Philip Durbin ๐Ÿš€ (Aug 28 2024 at 20:30):

Yes, Dataverse can link to multiple S3 buckets. Each "store" can be configured separately, to use the same bucket, to use different buckets; it's up to you.

view this post on Zulip Philip Durbin ๐Ÿš€ (Aug 28 2024 at 20:31):

You might want to look into Trusted Remote Storage: https://guides.dataverse.org/en/6.3/installation/config.html#trusted-remote-storage

view this post on Zulip Philip Durbin ๐Ÿš€ (Aug 28 2024 at 20:32):

Do you want Dataverse to take over management of the files? Or do you want to manage them in S3 separately and simply let Dataverse know where the files live?

view this post on Zulip Bethany Seeger (Aug 28 2024 at 20:37):

Thanks, Phil. I'll cross post there.
Good question about file management. For this collection of data, I think it might be the later option, but I'll have to ask the collection manager. What does it mean to have Dataverse take over management of the files?

view this post on Zulip Philip Durbin ๐Ÿš€ (Aug 28 2024 at 20:57):

Here's the cross post. Thanks again.

view this post on Zulip Philip Durbin ๐Ÿš€ (Aug 28 2024 at 20:57):

Well, do you want Dataverse to be able to delete files?

view this post on Zulip Philip Durbin ๐Ÿš€ (Aug 28 2024 at 20:58):

When a dataset is in draft and files are deleted, they are removed from the S3 bucket. Poof.

view this post on Zulip Philip Durbin ๐Ÿš€ (Sep 09 2024 at 17:51):

A new reply from Jim: https://groups.google.com/g/dataverse-community/c/133bNBCtXYc/m/mZkJ-3W_AAAJ


Last updated: Nov 01 2025 at 14:11 UTC