Total File Size Postgres Query · troubleshooting

Stream: troubleshooting

Topic: Total File Size Postgres Query

Sherry Lake (Oct 15 2024 at 17:40):

I have been using Julian's list of postgres queries:
https://docs.google.com/document/d/1-Y_iUduSxdDNeK1yiGUxe7t-Md7Fy965jp4o4m1XEoE/edit?tab=t.0

@Julian Gautier do you have a new updated set?

I ask because this dataset in our repo: https://doi.org/10.18130/V3/G1NW7F
w/ 76 files (74 of them > 5GB) gets a "result" from the query: "Sum of file sizes in a specific dataset"...... of only 16GB... which is not correct.

Here I am only showing the affected dataset...

dataset_id | byte_size | file_count
------------+---------------+------------
58650 | 16440586164 | 76

So I don't know how accurate this query is. Spot checking some of my other datasets, looks pretty accurate, not sure why this one is wonky.

There is an API command for total storage, which shows 431GB

curl -H "X-Dataverse-key:$API_TOKEN" "https://dataverse.lib.virginia.edu/api/datasets/58650/storagesize"
{"status":"OK","data":{"message":"Total size of the files stored in this dataset: 431,676,682,164 bytes"}}

but I like the query of combo of total filesize with total number of files. (if it is reliable).

Julian Gautier (Oct 15 2024 at 18:41):

Hey @Sherry Lake. I don't have a different query and I'm not sure why that one returns the wrong size for your dataset while returning the right number of files. It's a relatively simple dataset, too. Only one major version, no ingested tabular files, the usual things that I think make these queries more complex.

Would you be able to adjust the database query, like the select statement and the "group by" part, so you can see what it's saying each file's size is? Then you could compare those sizes to what's in the Dataverse JSON metadata export to see if there are particular files whose right sizes aren't being returned for some reason?

Sherry Lake (Oct 15 2024 at 20:08):

@Julian Gautier Here you go..... I'm taking off for the day, so no need to hurry to get back to me today.

Would you be able to adjust the database query, like the select statement and the "group by" part, so you can see what it's saying each file's size is?

[Note from Sherry, I added "DISTINCT" because there are two versions 1.0, 1.1 -both have the same files, so those were just being repeated]

select
         DISTINCT  dataset.id as dataset_id,
        datafile.filesize,
        filemetadata.datafile_id
from filemetadata
join datafile on datafile.id = filemetadata.datafile_id
full outer join datatable on datatable.datafile_id = filemetadata.datafile_id
full outer join datasetversion on datasetversion.id = filemetadata.datasetversion_id
join dataset on dataset.id = datasetversion.dataset_id
where
        dataset.harvestingclient_id is null and dataset.id = 58650;

dataset_id | filesize | datafile_id
------------+------------+-------------
58650 | 5767168000 | 58710
58650 | 5767168000 | 58654
58650 | 5767168000 | 58715
58650 | 5767168000 | 58684
58650 | 5767168000 | 58656
58650 | 5767168000 | 58720
58650 | 5767168000 | 58692
58650 | 5767168000 | 58723
58650 | 5767168000 | 58655
58650 | 5767168000 | 58680
58650 | 5767168000 | 58665
58650 | 5767168000 | 58724
58650 | 5767168000 | 58683
58650 | 5767168000 | 58676
58650 | 5767168000 | 58722
58650 | 5767168000 | 58717
58650 | 5767168000 | 58660
58650 | 5767168000 | 58659
58650 | 5767168000 | 58704
58650 | 5767168000 | 58706
58650 | 5767168000 | 58661
58650 | 5767168000 | 58718
58650 | 5767168000 | 58653
58650 | 5767168000 | 58694
58650 | 5767168000 | 58669
58650 | 5767168000 | 58664
58650 | 5767168000 | 58716
58650 | 5767168000 | 58695
58650 | 5767168000 | 58675
58650 | 5767168000 | 58693
58650 | 5767168000 | 58700
58650 | 5767168000 | 58658
58650 | 5767168000 | 58701
58650 | 5767168000 | 58699
58650 | 2923070292 | 58719
58650 | 5767168000 | 58703
58650 | 5767168000 | 58671
58650 | 5767168000 | 58697
58650 | 5767168000 | 58707
58650 | 5767168000 | 58698
58650 | 5767168000 | 58681
58650 | 5767168000 | 58663
58650 | 5767168000 | 58651
58650 | 5767168000 | 58673
58650 | 5767168000 | 58721
58650 | 5767168000 | 58662
58650 | 5767168000 | 58685
58650 | 5767168000 | 58689
58650 | 5767168000 | 58674
58650 | 5767168000 | 58667
58650 | 5767168000 | 58702
58650 | 5767168000 | 58682
58650 | 5767168000 | 58668
58650 | 5767168000 | 58714
58650 | 5767168000 | 58726
58650 | 5767168000 | 58690
58650 | 5767168000 | 58711
58650 | 5767168000 | 58712
58650 | 5767168000 | 58691
58650 | 5699988502 | 58679
58650 | 5767168000 | 58696
58650 | 5767168000 | 58709
58650 | 5767168000 | 58713
58650 | 5767168000 | 58666
58650 | 5767168000 | 58686
58650 | 5767168000 | 58725
58650 | 5767168000 | 58705
58650 | 5767168000 | 58677
58650 | 5767168000 | 58687
58650 | 5767168000 | 58678
58650 | 2050359370 | 58672
58650 | 5767168000 | 58652
58650 | 5767168000 | 58708
58650 | 5767168000 | 58688
58650 | 5767168000 | 58670
58650 | 5767168000 | 58657
(76 rows)

Then you could compare those sizes to what's in the Dataverse JSON metadata export
------ Filesizes in export looks OK - what is showing in UI.
Here's a snipped from the file metadata export - they match.

curl https://dataverse.lib.virginia.edu/api/datasets/58650 | jq .

"files": [
{
"label": "CAESAR-Dataset-L.z01",
"restricted": false,
"version": 1,
"datasetVersionId": 1109,
"dataFile": {
"id": 58713,
"persistentId": "",
"filename": "CAESAR-Dataset-L.z01",
"contentType": "application/octet-stream",
"friendlyType": "Unknown",
"filesize": 5767168000,
"storageIdentifier": "s3://dataverse-storage-production:183b84ee6a2-4bec26798f2a",
"rootDataFileId": -1,
"md5": "ae8f61ab138ff61a54a2ff6708f738ed",
"checksum": {
"type": "MD5",
"value": "ae8f61ab138ff61a54a2ff6708f738ed"
},
"tabularData": false,
"creationDate": "2022-10-08",
"publicationDate": "2022-10-09",
"fileAccessRequest": true
}
},
{
"label": "CAESAR-Dataset-L.z02",
"restricted": false,
"version": 1,
"datasetVersionId": 1109,
"dataFile": {
"id": 58714,
"persistentId": "",
"filename": "CAESAR-Dataset-L.z02",
"contentType": "application/octet-stream",
"friendlyType": "Unknown",
"filesize": 5767168000,
"storageIdentifier": "s3://dataverse-storage-production:183b8514ef2-f3ded382cbbb",
"rootDataFileId": -1,
"md5": "5b8e6403ffe0788fa43143462addccc4",
"checksum": {
"type": "MD5",
"value": "5b8e6403ffe0788fa43143462addccc4"
},
"tabularData": false,
"creationDate": "2022-10-08",
"publicationDate": "2022-10-09",
"fileAccessRequest": true
}
},
{
"label": "CAESAR-Dataset-L.z03",
"restricted": false,
"version": 1,
"datasetVersionId": 1109,
"dataFile": {
"id": 58681,
"persistentId": "",
"filename": "CAESAR-Dataset-L.z03",
"contentType": "application/octet-stream",
"friendlyType": "Unknown",
"filesize": 5767168000,
"storageIdentifier": "s3://dataverse-storage-production:183b853bc91-2b8b3c5374e0",
"rootDataFileId": -1,
"md5": "4781b0004eba02b871f68e6b9b30cf39",
"checksum": {
"type": "MD5",
"value": "4781b0004eba02b871f68e6b9b30cf39"
},
"tabularData": false,
"creationDate": "2022-10-08",
"publicationDate": "2022-10-09",
"fileAccessRequest": true
}
},

Julian Gautier (Oct 16 2024 at 17:52):

Thanks. Yeah I see that the file sizes for each file match.

When I take the sum of all the byte sizes from the results of your database query, I get 431,676,682,164 bytes. Exactly what your API call reports.

I think that the problem is with the part of the database query that's getting the sum of non-tabular files. It's "sum(distinct coalesce(datafile.filesize, 0)". This is getting the sum of all files that have a distinct size, and because most of the files in this dataset have the same size, it's getting the sum of just the four distinct file sizes in this dataset, which is exactly the number you reported yesterday, 16440586164.

And maybe the database query works with other datasets you've spot checked because those datasets have files that all have different sizes?

So we need another way to get the sum of distinct files :thinking:

Sherry Lake (Oct 16 2024 at 17:58):

Wait.... @Julian Gautier didn't you have a new SQL line for me to try?
I was just reading it, went to open a window.... and you removed it?

Didn't want to miss anything.

Sherry Lake (Oct 16 2024 at 17:59):

And @Julian Gautier great detective work!!!!

Julian Gautier (Oct 16 2024 at 18:02):

Lol as I was writing I realized that how I thought to edit the query wouldn't work, so I removed it

Julian Gautier (Oct 16 2024 at 18:03):

Was hoping you wouldn't see that. Thought you had today off. Or was that yesterday?

Sherry Lake (Oct 16 2024 at 18:04):

Took off early yesterday. All :eyes: today. :wink:

Julian Gautier (Oct 16 2024 at 18:11):

We/I should probably remove that query until we can figure out how to improve it. Or leave a note saying it doesn't really work :grimacing:

Sherry Lake (Oct 16 2024 at 18:12):

Wonder if there is a way to mimic what the API is doing? I'm looking at that code.... but I'm no Java programmer.

Julian Gautier (Oct 16 2024 at 18:14):

Is the reason why you prefer the queries to the API endpoint because you need to also see the number of files?

Sherry Lake (Oct 16 2024 at 18:15):

I think a comment around the query would be helpful. It does work most of the time. I checked and out of 432 datasets - only 59 had differing file sizes (from the SQL query vs the API).

Sherry Lake (Oct 16 2024 at 18:20):

And, yes, I need to see number of files with file sizes.

We are manually bagging our datasets to send to our preservation system (APTrust). There is a great command linescript to bag all "un-bagged" datasets, but I don't want to include the "large" ones in the auto script (large file sizes, large number of files), so I do those separately - from the list that the SQL Query provides.

This one dataset is the only one that was WAY off with size.

I created a spreadsheet comparing the SQL query and API:
https://docs.google.com/spreadsheets/d/1Mw5l_pU7NZXJ54fV2R321r80JTx2DGQX/edit?usp=sharing&ouid=100246654920053034549&rtpof=true&sd=true

I was trying to see if there was a pattern why they were different (draft, multi versions, zip, tab...... etc.), that's what the colors and comments are about.

Julian Gautier (Oct 16 2024 at 18:52):

I added a comment to the query for now. Will probably figure out later how to get it working for datasets with files that are the same sizes.

Sometimes I use database queries instead of the API because it's faster than the API, which are usually pretty granular. And I've heard the same from folks running other installations.

Is that true for you here, too? I imagine running one more endpoints on hundreds or thousands of files would take too long

Julian Gautier (Oct 17 2024 at 14:11):

I changed the query that's in that Google Doc so that it gets the sum of file sizes from the database's storageuse table, which I think was added in v6.1. UVA Dataverse is on 6.2 right? So I think it should work for you @Sherry Lake. I tested the query a little and it seemed to work well. Definitely simpler!

Sherry Lake (Oct 17 2024 at 15:00):

I made a correction to your new query on the google doc: "=" in place of "in".

It didn't work for me with "in".

Changing to "=" and it worked for me; got the correct correct size for the single dataset.

Now can we have it do all datasets? Something to work on later. Not needed now.

Julian Gautier (Oct 17 2024 at 15:17):

Oh thanks! Yeah I briefly considered making the query work with a list of dataset IDs and backed out of that idea, but left that "in" in.

Julian Gautier (Oct 17 2024 at 15:22):

About making it work with all datasets, I think we'd just remove the where clause in that temporary "datasetfilecount" table or cte. I left a comment to indicate this.

I wrote "should" 'cause I haven't tested this to make sure. Do you plan to test it?

Julian Gautier (Oct 17 2024 at 17:59):

I tested it a bit since I was already querying for another reason.

Removing that where clause seems to work well. I didn't even have to add a clause to exclude harvested datasets, which was a nice surprise.

Sherry Lake (Oct 17 2024 at 18:02):

Thanks, Julian. I was trying to figure a pyDataverse problem - see my zulip question in python channel (not for you.... just an FYI... I have been catching up on Dataverse things, that have been on my todo list.

Last updated: Oct 30 2025 at 05:14 UTC