I'm getting excited about the Distribits conference next week! If you check the schedule, @Oliver Bertuch @Jan Range and I are giving a talk entitled "Distributed Metadata and Data with Dataverse".
I invited others in the community to register. We'll see. :grinning:
@AsbjΓΈrn SkΓΈdt as it turns out, my layover in in Copenhagen. :grinning:
I'm bringing three sizes of Dataverse stickers.
Hey Philip. :grinning: You are very welcome to contact if you have enough time in Copenhagen to meetup.
Ha, thanks! It's a tight layover, less than an hour, but hopefully I'll be back, maybe even with the family.
Heh. Bumped into Yarik and Chris at the airport.
Where are the pictures? :rolling_on_the_floor_laughing:
I thought about it
They were telling me I should do what they're doing, spend time in the layover city. They're going to spend half a day exploring Amsterdam.
We also chatted about https://aws.amazon.com/opendata/open-data-sponsorship-program/
At least I think that's the right program. Free egress from S3.
Just landed in DΓΌsseldorf
f2eea90f-cbe9-46a2-998d-3802f7d52e1f.jpg
Welcome to timezone CEST
I hope you got some sleep earlier :sleeping:
A little bit. Happy they had a room ready for me at Hotel Favor
I have a view of the Rhine Tower
bb3a77cf-292b-4e84-a732-d5842cd3325c.jpg
When are you guys arriving? Do you want to have dinner?
I will arrive later, at 10 PM, unfortunately. If you are still at the bar though, I am happy to join :grinning_face_with_smiling_eyes:
If I'm up, I'll try to find you :big_smile:
Awesome! How's the hotel? Booked it too :smile:
The hotel is good. Surprisingly, the mini bar is included. The gym is no longer across the street. It's an 8 minute walk. Looking forward to an easy breakfast
finally some pictures!
Philip Durbin schrieb:
The hotel is good. Surprisingly, the mini bar is included. The gym is no longer across the street. It's an 8 minute walk. Looking forward to an easy breakfast
Hehe the minibar is good news :grinning_face_with_smiling_eyes:
Oh and free umbrellas to borrow. It's raining.
Yea, german weather is not nice at the moment :frown:
Warmer than Boston!
Surely some KΓΆlsch will help to cope with it
Oh okay, didn't expect it to be warmer
I saw lots of people standing at tables under umbrellas drinking. :grinning:
The stereotype holds true :grinning:
Philip Durbin said:
Warmer than Boston!
AFF6DD3F-3F11-43DB-8B2E-0F8DC25F6F7F.jpg
:upside_down:
Phew! Ok, I think I'm done futzing with my slides. Time for dinner. Will practice later.
Jan Range said:
Surely some KΓΆlsch will help to cope with it
Don't order KΓΆlsch in DΓΌsseldorf!!!
They might throw stones or empty mugs
True, I forgot! Alt is the right one :grinning:
It's fun watching but Leverkusen is killing us
IMG_20240403_212103673.jpg
Don't worry, it was all Alt
Livestream of Distribits day 1: https://youtube.com/live/BwRy3z_hQ70?feature=share
55049c93-f9f7-408e-a3d0-b2a3069a5188.jpg
This tool sounded interesting: https://pypi.org/project/tinuous/
git-annex organelles
a85d6fc7-ec62-4f92-a3e4-1d72d1ca20ca.jpg
Is Dataverse a good candidate to become a DataLad / Git-Annex Proxy? Question popping up in my head from the git-annex talk by Joey. (Or should Dataverse be able to talk to such proxies to ingest data / access data on tapes / ...?)
I think the latter case would be very useful
Well, that proxy thing is just an idea for now, but sure, let's pick Joey's brain while we're here. :grinning:
413a45ec-1077-4b6f-846f-3045d4364eae.jpg
Dataverse hosting a git repository using datalad-annex. Fancy! (Though slow - but probably good enough for a publication that keeps history / provenance)
Yep! Here's a pic:
47355de7-a58a-4f14-b5d2-0b1a5769226a.jpg
I will admit I was having trouble with it yesterday: https://github.com/datalad/datalad-dataverse/issues/302
03cc47fe-8fa3-40ee-9538-a115d97c1842.jpg
Schema for a generic data distribution record - https://concepts.datalad.org/s/distribution/unreleased/
73a2e853-a0ec-4914-81b6-8168a2730d96.jpg
196a9708-08a8-4f49-b485-89e94b13875d.jpg
Sounds like Dataverse has some work cut out here: we might need to enable better per-file metadata...
(Which also has been a request independent from DataLad since at least 2020, where I first heard it at Tromso)
Yeah. We have auxiliary files, at least!
Mmm, pull requests for datasets. :yum:
What would that look like, I wonder. I'll have to ask.
Huh, I hadn't heard of Fast Data Transfer - http://monalisa.cern.ch/FDT/ - but the speaker just said rsync is faster. :thinking:
a5d0d056-e2c6-4b58-9ceb-a0d76f2d5f65.jpg
Wondering if the idea behind https://neurobagel.org/ might be implemented with Dataverse repositories as well...
We do have a limited set of rather harmonized metadata, but obviously it gets much harder for file metadata
As said above, maybe one of the next big goals for Dataverse?
Cool, they seem to be doing joins across variables?
IMG_20240404_141316361.jpg
@Oliver Bertuch let's compare notes and maybe corner the speaker :big_smile:
@Jan Range pyDataverse module to add a https://docs.pyfilesystem.org for Dataverse?
(As seen in the OneData presentation)
Isnt this similar to what Stefano (Compute on Data) developed?
I don't know :shrug:
I think so, because he was also basically fetching data files as if these are present on the filesystem.
Would be ace having something like this in PyDdataverse - Kind of an DvDownloader
I would also :heart: a Nextcloud external storage plugin for Dataverse, so we could have this Windows Sync+Share feeling provided by the Nextcloud client without developing our own client...
(As they seem to do for OneData)
We have a lot of researchers that are very much accustomed to using SMB shares - the Sync+Share experience is probably the closest you can get to that.
Yes, would be nice
https://github.com/libis/rdm-integration/issues/4
Could OneData's (beta) S3 storage driver be used with Dataverse? As a way to have distributed data, maybe?
Or is OneData simply a competitor from the Dataverse perspective?
And I also opened https://github.com/gdcc/pyDataverse/issues/178
And I just opened an issue for our hackathon idea @Jan Range @Philip Durbin https://github.com/distribits/distribits-2024-hackathon/issues/3
Ha! Great!
And I do hope to hack on trying to add OneData as an S3 provider. The speaker seems game
Or as a Globus-like thing?
He said that files stored have a unique identifier, so very similar to Globus
Group pic
1000007932_20240404154448~2.jpg
Yeah, the main thing for me is that he isn't offended if we try to use Onedata "just" as a storage backend. :sweat_smile:
If I have the conversion right, your talk is at 9:35 EST guys?
Ok I think I am right lol
image.png
Our talk is tomorrow
Yeah I was just checking the time :rolling_on_the_floor_laughing: sorry about the confusion
https://time.is/compare/1535_5_Apr_2024_in_D%C3%BCsseldorf/Boston
Fancy!
bc335f8f-9b7b-4bba-8477-da1ebce8d5d4.jpg
You should see the new cluster :sweat_smile:
Having drinks at the same place we had lunch if you'd like to join
Joining soon - Currently in slide flow :grinning_face_with_smiling_eyes:
Julia and I would love some company... The main table was already full :sweat_smile:
On my way :raised_hands:
The xz vulnerability just mentioned: https://arstechnica.com/security/2024/03/backdoor-found-in-widely-used-linux-utility-breaks-encrypted-ssh-connections/
Mention of https://tom.preston-werner.com/2009/05/19/the-git-parable.html
2efc5eb7-cdb9-4e76-aee0-56ff133d6fbf.jpg
4d06f2d2-1633-4961-b068-42abb70491ed.jpg
I am finishing watching your talk! Congratulations @Philip Durbin @Oliver Bertuch && @Jan Range
Yeah? It was ok? :sweat_smile:
The inspiration for the name DataLad
17123283444132688976197007845241.jpg
Philip Durbin said:
Yeah? It was ok? :sweat_smile:
10/10 would watch again :smile:
Philip Durbin said:
The inspiration for the name DataLad
17123283444132688976197007845241.jpg
How does it always comes down to the Simpsons :rolling_on_the_floor_laughing:
Please like and subscribe! :crazy:
Juan Pablo Tosca Villanueva said:
Philip Durbin said:
Yeah? It was ok? :sweat_smile:
10/10 would watch again :smile:
Thx JP! :blush:
@Oliver Bertuch @Philip Durbin happy hacking! :raised_hands: Which ideas did you go for?
We're going for the datalad-dataverse integration and talking to Lukasz about OneData
I'm hacking on the action in parallel
Sounds great! Looking forward
I was looking into PyFilesystem on the train yesterday and will implement a proof of concept. Looks straightforward!
Awesome!!!
@Jan Range are you suffering from FOMO? :crazy:
Yeeees :sob:
I'm half done putting our talk on DataverseTV. Here's the link to the timestamp at least: https://www.youtube.com/watch?v=L1MKaUgg1xs&t=24405s
@Jan Range are you good at combining PDFs? If so, maybe you could combine the two slide decks together? If not, please send along a PDF of yours and I'll try to figure it out. I'd like to add it to https://dataverse.org/presentations and then link to it from DataverseTV.
Yes, will do :-)
There you go: Dataverse_for_distributed_data_Distribits_DurbinBertuchRange.pdf
Hacking on the Dataverse Action: @Jan Range instead of mangling a "s3_enabled" option, maybe it would be easier to have a flavor thing already? The idea would be to make people go for a directory with it that has a compose file and a bootstrap.sh...
Alright, that sounds cool! If I understood correctly, we would provide an interface for other users to ingest their own compose flavor?
Obsolutely!
Really nice! Shall we host a couple of examples people could just grab and use plus get inspired?
I will use this for the S3 flavor for now... :wink:
That's at least 1 example
3fe468ee-c9e8-4ffa-956e-b8e103184ac1.png
b350cc0e-b6b4-4d75-b633-3c8302754d0d.jpg
Beautiful day!
Is this what you thought about? Listing files and downloading on demand works well!
@Philip Durbin I see you've got the hang out of Alt! :grinning_face_with_smiling_eyes:
You can also list particular subdirs
image.png
In terms of editing and uploading data, I was thinking of a context manager:
I would use the DVUploader in the backend to perform the upload and handle S3/native routes
That looks pretty promising already!
Seems it was easy to get started!
IMHO all the more reason to make S3 direct up/down handling part of pyDataverse
True, completely agree now!
Any other things you think would be nice to have there?
How about making it possible to retrieve a single file from a ZIP on Dataverse?
Dataverse knows about ranged requests
So it might be possible to use a pyfilesystem for ZIPs here
Sounds good! Is there a way to retrieve infos about the content of a zip beforehand? Could add it to the listdir-method
@Oliver Bertuch https://guides.dataverse.org/en/6.2/developers/big-data-support.html#features-that-are-disabled-if-s3-direct-upload-is-enabled
I'm not sure how the ZIP file previewer does it, but it seems possible @Jan Range
https://github.com/nedbat/scriv
Ha! nedbat is a friend of mine!
I just listened again to the question at https://www.youtube.com/watch?v=L1MKaUgg1xs&t=27125s by Stephan Heunis about using the API for requesting and granting access to files. It looks like the dataset he's talking about it at https://dataverse.nl/dataset.xhtml?persistentId=doi:10.34894/R1TNL8 which is running Dataverse 6.0 so the APIs he needs should be available:
Now, like Jan pointed out in his answer, these endpoints may not be available via pyDataverse.
Actually, I see both request_access and grant_file_access at https://pydataverse.readthedocs.io/en/latest/reference.html
Let me see if I can get in touch with Stephan.
He should be on Matrix :smile_cat:
Ja. I just DM'ed him.
I just created this issue: Add DataLad to list of integrations #10468
I was able to see the eclipse from the plane! :tada:
Nice, here's just our talk: https://www.youtube.com/watch?v=jSzwAIqjq-o
I'll update DataverseTV
We should also post our slides but we should combine into one PDF, I think
Oh, Jan already posted it above. I sent it off to be added to https://dataverse.org/presentations
I just posted a mini trip report at https://groups.google.com/g/dataverse-community/c/huhI8TyE8a0/m/7xldNlW8AQAJ
Just posted: https://dataverse.org/presentations/distributed-metadata-and-data-dataverse
@Oliver Bertuch @Jan Range and others, this just came in:
"@room We have just reserved the venue for a distribits 2025. It will take place in DΓΌsseldorf again. This time Oct 23-25. More information and an official announcement will follow soon. We are hoping to see you again!"
Is anyone thinking about going? Should we start a new topic for Distribits 2025?
Sounds great! Can't tell if I will be available, but if there's time I am happy to join again :smile:
Personally, I'd be happy to visit a new city but whatever, DΓΌsseldorf was great.
Last updated: Nov 01 2025 at 14:11 UTC