DAZ Studio 4.21.0.5 new error behavior on GPU OOM
This is almost certainly a GPU OOM issue but I've never seen it before so I figure I'll document it here. I have a simple scene (HDRI, one character) but I'm running the render at SubD 5 using a (dedictated) TitanXP with 12GByte of GPU (yeah, I know, buy a 4090...) So this is right at the limit of the GPU RAM, in this case the OOM fallback pushed 2/3 of the frame buffer into CPU memory:
2022-10-19 09:51:45.916 [WARNING] :: ..\..\..\..\..\src\pluginsource\DzIrayRender\dzneuraymgr.cpp(369): Iray [WARNING] - IRAY:RENDER :: 1.10 IRAY rend warn : CUDA device 0 (NVIDIA TITAN Xp): Failed to allocate 180.000 MiB for (device) frame buffer, will try allocating smaller (partial) frame buffer
2022-10-19 09:51:45.932 Iray [INFO] - IRAY:RENDER :: 1.10 IRAY rend info : CUDA device 0 (NVIDIA TITAN Xp): Allocated 90.000 MiB for device frame buffer
2022-10-19 09:51:45.947 Iray [INFO] - IRAY:RENDER :: 1.10 IRAY rend info : CUDA device 0 (NVIDIA TITAN Xp): Allocated 180.000 MiB for host-side frame buffer
2022-10-19 09:51:45.947 [WARNING] :: ..\..\..\..\..\src\pluginsource\DzIrayRender\dzneuraymgr.cpp(369): Iray [WARNING] - IRAY:RENDER :: 1.10 IRAY rend warn : CUDA device 0 (NVIDIA TITAN Xp): Succeeded in allocating partial device frame buffer. Device efficiency will be affected.
This doesn't worry me, it's a 4K render and it happens quite frequently, despite the obsequious warning my experience is that the performance is hardly affected. Everything went fine for a while:
2022-10-19 09:51:49.531 Iray [INFO] - IRAY:RENDER :: 1.0 IRAY rend progr: Received update to 00001 iterations after 8.519s.
2022-10-19 09:51:52.786 Iray [INFO] - IRAY:RENDER :: 1.0 IRAY rend progr: Received update to 00002 iterations after 11.778s.
2022-10-19 09:51:56.073 Iray [INFO] - IRAY:RENDER :: 1.0 IRAY rend progr: Received update to 00003 iterations after 15.052s.
* * *
2022-10-19 14:38:47.369 Iray [INFO] - IRAY:RENDER :: 1.0 IRAY rend progr: 99.24% of image converged
2022-10-19 14:38:47.389 Iray [INFO] - IRAY:RENDER :: 1.0 IRAY rend progr: Received update to 04791 iterations after 17226.367s.
At which point this happened:
2022-10-19 14:41:59.716 Iray [INFO] - IRAY:RENDER :: 1.0 IRAY rend progr: Received update to 04844 iterations after 17418.696s.
2022-10-19 14:42:35.365 Iray [INFO] - IRAY:RENDER :: 1.14 IRAY rend info : CUDA device 0 (NVIDIA TITAN Xp): Prevent device timeout
2022-10-19 14:42:35.567 Iray [INFO] - IRAY:RENDER :: 1.14 IRAY rend info : CUDA device 0 (NVIDIA TITAN Xp): Execute device timeout
2022-10-19 14:42:35.577 Iray [INFO] - IRAY:RENDER :: 1.4 IRAY rend info : CUDA device 0 (NVIDIA TITAN Xp): Prevented device timeout
2022-10-19 14:42:35.577 Iray [INFO] - IRAY:RENDER :: 1.4 IRAY rend info : Device timeout executed, resume 62 unfinished samples.
2022-10-19 14:42:37.796 Iray [INFO] - IRAY:RENDER :: 1.14 IRAY rend info : CUDA device 0 (NVIDIA TITAN Xp): Prevent device timeout
2022-10-19 14:42:37.999 Iray [INFO] - IRAY:RENDER :: 1.14 IRAY rend info : CUDA device 0 (NVIDIA TITAN Xp): Execute device timeout
2022-10-19 14:42:38.166 Iray [INFO] - IRAY:RENDER :: 1.4 IRAY rend info : CUDA device 0 (NVIDIA TITAN Xp): Prevented device timeout
2022-10-19 14:42:38.166 Iray [INFO] - IRAY:RENDER :: 1.4 IRAY rend info : Device timeout executed, resume 2055 unfinished samples.
So far as I can tell the render is now DITW:
2022-10-19 15:08:01.986 Iray [INFO] - IRAY:RENDER :: 1.14 IRAY rend info : CUDA device 0 (NVIDIA TITAN Xp): Prevent device timeout
2022-10-19 15:08:02.189 Iray [INFO] - IRAY:RENDER :: 1.14 IRAY rend info : CUDA device 0 (NVIDIA TITAN Xp): Execute device timeout
2022-10-19 15:08:02.314 Iray [INFO] - IRAY:RENDER :: 1.4 IRAY rend info : CUDA device 0 (NVIDIA TITAN Xp): Prevented device timeout
No advance from 99.3% converged and no more updates until:
2022-10-19 15:18:15.612 Iray [INFO] - IRAY:RENDER :: 1.0 IRAY rend progr: Received update to 04898 iterations after 19594.593s.
That was the last message in the log file at the time though now regular (5 minute) updates have resumed. (It's diffcult to track this in real time because Studio doesn't seem to flush the log regularly).
This is no big deal for me; I'm rendering at 100% convergence so that's only 58,061 pixels not converged if I understand what the "coverged" percentage means and it seems entirely possible that the problem corresponds to me opening another scene in Studio, which consumes between 100 and 200MByte of GPU RAM (just for starting Studio). I use the TitanXP as a dedicated compute card; Studio, PhotoShop, PtGui but nothing else that I can find out how to disable. While the render had, apparently, recovered from the problems (i.e. it was producing new iterations) convergence wasn't increasing fast enough for me so I just cancelled it; the missing pixels aren't immediately apparent ;-)