Hi there. We have an ingest job that is exhausting all our resources. We have run the imqcmd purge command to try to clear the job, but it will not clear for some reason. Are there any other steps we can take to clear the job?
it was my understanding that I could purge the job queue, but not running jobs - I just had to wait.
A number of installations preempt this problem by setting https://guides.dataverse.org/en/latest/installation/config.html#tabularingestsizelimit to some fraction of your Payara JVM heap setting. Leonid has said that R formats in particular can consume up to 10x the file size in memory during ingest.
That's very helpful @Don Sizemore I'll try that once this job finishes.
Is it possible to separate out the ingest process onto another machine? Has anyone else done that? We're thinking of using a worker job on another machine whose only job would be to process ingest jobs.
There is a proposal to do exactly that but I don't think the work has been planned / picked up yet.
Yeah. Here's a related issue: Ingest Modularity/ImprovementsΒ #7852
Is there a way to know if this particular job is actually making progress? How can we monitor it and know when it's done?
Is it feasible (and safe) to run two active Dataverse instances on different VMs, but using the same database, filesystem, etc.? We're wondering if, in that setup, we could load balance ingestion requests to one of the DV instances and web requests to the other. If it's possible without risking data corruption, that would eliminate the problem of ingestion interfering with web users.
That's what https://github.com/IQSS/dataverse.harvard.edu/issues/111 is about, setting up a dedicated ingest server for Harvard Dataverse. We haven't done it though, and that issue is quite old at this point.
Harvard runs with a dual-application-node setup and has for some time: https://guides.dataverse.org/en/latest/installation/prep.html though there were I think two concurrency problems in the database over the years.
It is possible and I dare say safe to run multiple app servers pointed at the same database. We do this for Harvard Dataverse (two app servers) but you'll want to keep in mind the caveats at https://guides.dataverse.org/en/6.6/installation/advanced.html#multiple-app-servers
Thanks! What about the question of monitoring the ingest job. Is there a way to observe it's progress? We just want to make sure that it is in fact making progress.
Hmm, nothing at https://guides.dataverse.org/en/6.6/admin/troubleshooting.html#long-running-ingest-jobs-have-exhausted-system-resources
I assume that's where you found the imqcmd command.
@Jay Sundu if you're running Linux and have strace installed, you can watch the system calls made by the sub-process handling ingest. In my case I could see it reading and seeking, and just let it finish.
IIRC you can find the subprocess in top by pressing H? then strace -p pid
Stopping and starting Payara will only slow things down, as Payara will maintain job state and pick up where it left off once you start it back up.
FYI, our long running job just finished and I've put in place the TabularIngestSizeLimit so hopefully that'll give us some safety but we're still looking at perhaps setting up another instance to offload the ingest process. Thanks for all your help!
Phew! How long did it take?
About twenty hours.
Wow, what kind of file was it?
There were six 3-5GB files with TXT and CSV. I haven't seen them myself yet just was told by the person who did the uploading what they were.
now THAT's gonna be some variable-level metadata!
Interesting. Was ingest successful?
Apparently the publish is still in progress
@Jay Sundu Dataverse will verify checksums on dataset publication; on larger files this can take some time depending on your datastore type. There is a maximum setting for that as well, but I haven't yet implemented it.
https://guides.dataverse.org/en/6.6/installation/config.html#datasetchecksumvalidationsizelimit
Last updated: Oct 30 2025 at 06:21 UTC