Stream: troubleshooting

Topic: ingest job exhausting resources


view this post on Zulip Jay Sundu (Mar 20 2025 at 14:23):

Hi there. We have an ingest job that is exhausting all our resources. We have run the imqcmd purge command to try to clear the job, but it will not clear for some reason. Are there any other steps we can take to clear the job?

view this post on Zulip Don Sizemore (Mar 20 2025 at 14:47):

it was my understanding that I could purge the job queue, but not running jobs - I just had to wait.

view this post on Zulip Don Sizemore (Mar 20 2025 at 15:20):

A number of installations preempt this problem by setting https://guides.dataverse.org/en/latest/installation/config.html#tabularingestsizelimit to some fraction of your Payara JVM heap setting. Leonid has said that R formats in particular can consume up to 10x the file size in memory during ingest.

view this post on Zulip Jay Sundu (Mar 20 2025 at 15:22):

That's very helpful @Don Sizemore I'll try that once this job finishes.

view this post on Zulip Jay Sundu (Mar 20 2025 at 16:29):

Is it possible to separate out the ingest process onto another machine? Has anyone else done that? We're thinking of using a worker job on another machine whose only job would be to process ingest jobs.

view this post on Zulip Don Sizemore (Mar 20 2025 at 16:30):

There is a proposal to do exactly that but I don't think the work has been planned / picked up yet.

view this post on Zulip Philip Durbin πŸš€ (Mar 20 2025 at 16:36):

Yeah. Here's a related issue: Ingest Modularity/ImprovementsΒ #7852

view this post on Zulip Jay Sundu (Mar 20 2025 at 17:30):

Is there a way to know if this particular job is actually making progress? How can we monitor it and know when it's done?

view this post on Zulip Jay Sundu (Mar 20 2025 at 17:51):

Is it feasible (and safe) to run two active Dataverse instances on different VMs, but using the same database, filesystem, etc.? We're wondering if, in that setup, we could load balance ingestion requests to one of the DV instances and web requests to the other. If it's possible without risking data corruption, that would eliminate the problem of ingestion interfering with web users.

view this post on Zulip Philip Durbin πŸš€ (Mar 20 2025 at 18:16):

That's what https://github.com/IQSS/dataverse.harvard.edu/issues/111 is about, setting up a dedicated ingest server for Harvard Dataverse. We haven't done it though, and that issue is quite old at this point.

view this post on Zulip Don Sizemore (Mar 20 2025 at 18:17):

Harvard runs with a dual-application-node setup and has for some time: https://guides.dataverse.org/en/latest/installation/prep.html though there were I think two concurrency problems in the database over the years.

view this post on Zulip Philip Durbin πŸš€ (Mar 20 2025 at 18:17):

It is possible and I dare say safe to run multiple app servers pointed at the same database. We do this for Harvard Dataverse (two app servers) but you'll want to keep in mind the caveats at https://guides.dataverse.org/en/6.6/installation/advanced.html#multiple-app-servers

view this post on Zulip Jay Sundu (Mar 20 2025 at 18:23):

Thanks! What about the question of monitoring the ingest job. Is there a way to observe it's progress? We just want to make sure that it is in fact making progress.

view this post on Zulip Philip Durbin πŸš€ (Mar 20 2025 at 18:34):

Hmm, nothing at https://guides.dataverse.org/en/6.6/admin/troubleshooting.html#long-running-ingest-jobs-have-exhausted-system-resources

view this post on Zulip Philip Durbin πŸš€ (Mar 20 2025 at 18:34):

I assume that's where you found the imqcmd command.

view this post on Zulip Don Sizemore (Mar 20 2025 at 18:38):

@Jay Sundu if you're running Linux and have strace installed, you can watch the system calls made by the sub-process handling ingest. In my case I could see it reading and seeking, and just let it finish.

view this post on Zulip Don Sizemore (Mar 20 2025 at 18:40):

IIRC you can find the subprocess in top by pressing H? then strace -p pid

view this post on Zulip Don Sizemore (Mar 20 2025 at 18:40):

Stopping and starting Payara will only slow things down, as Payara will maintain job state and pick up where it left off once you start it back up.

view this post on Zulip Jay Sundu (Mar 20 2025 at 19:54):

FYI, our long running job just finished and I've put in place the TabularIngestSizeLimit so hopefully that'll give us some safety but we're still looking at perhaps setting up another instance to offload the ingest process. Thanks for all your help!

view this post on Zulip Philip Durbin πŸš€ (Mar 20 2025 at 20:11):

Phew! How long did it take?

view this post on Zulip Jay Sundu (Mar 20 2025 at 21:02):

About twenty hours.

view this post on Zulip Philip Durbin πŸš€ (Mar 20 2025 at 21:03):

Wow, what kind of file was it?

view this post on Zulip Jay Sundu (Mar 20 2025 at 21:05):

There were six 3-5GB files with TXT and CSV. I haven't seen them myself yet just was told by the person who did the uploading what they were.

view this post on Zulip Don Sizemore (Mar 20 2025 at 21:07):

now THAT's gonna be some variable-level metadata!

view this post on Zulip Philip Durbin πŸš€ (Mar 20 2025 at 21:07):

Interesting. Was ingest successful?

view this post on Zulip Jay Sundu (Mar 20 2025 at 22:02):

Apparently the publish is still in progress

view this post on Zulip Don Sizemore (Mar 21 2025 at 11:33):

@Jay Sundu Dataverse will verify checksums on dataset publication; on larger files this can take some time depending on your datastore type. There is a maximum setting for that as well, but I haven't yet implemented it.

view this post on Zulip Philip Durbin πŸš€ (Mar 21 2025 at 12:14):

https://guides.dataverse.org/en/6.6/installation/config.html#datasetchecksumvalidationsizelimit


Last updated: Oct 30 2025 at 06:21 UTC