RTX2080ti : GPU switch to CPU: an illegal memory access was encountered (while de-allocating memory)

Hello,
I have also found many other articles on this subject, but no solution that has helped me so far.
I have used / tested three different 1080ti cards in the last seven months, which all worked fine in DAZ 4.10. But since I had problems with games (or defective fans), I had to send them all back.
The last weeks my old 970 GTX was back without any problems.
Since one week I now have a new 2080ti and finally no problems with video games. But problems with the latest version DAZ 4.11. After 1-3 hours the rendering changes from GPU to CPU.
I only have a few days left to send the card back and don't know if this is a driver / Iray / DAZ or a hardware problem.
I buyed this expensive card especially for DAZ, for games I could have bought a cheaper one! I am helpless whether this is a permanent state or can be solved by patch / driver (soon).
In different threads there were different tips on how:
- - OC => "lowering the stock overclock on my graphics card" => I have tried with 100MHz less
- - Scene too large for VRAM => only 6.4GB of 11GB are occupied in the example scene
- - deactivate Denoiser => is not activated
- - GPU overheated => constant 71°C is not too much
- - try out other scenes => i try 3-4 different scenes
- - pagefile issue => there are only about 15GB of 32GB RAM occupied according to GPUz = so I don't think that's an issue.
- - Driver problem / new installation => I'd cleaned up with DDU.
Does anyone have any other tips on what I might try or what might be the cause?
Thanks in advance!
-------------------------------------------------------------------------------------------
System:
- DAZ: 4.11.0.366
- GPU: 2080ti 11 GB VRAM
- GPU driver version: 430.64
- System with 32 GB RAM
- CPU i5-6600K CPU @ 3.50GHz
- Windows 7 64bit SP1
GPUz => typical values shortly before switching to the CPU
- - GPU Temperature [°C]: 71°
- - GPU Memory Used [MB]: 6461
- - System Memory Used [MB]: 15252
- - GPU Load [%]: 96-98 (after the change to the CPU 0)
- - [% TDP]: 75 - 80
++ DAZ-Log ++
2019-05-25 01:15:29.874 Iray VERBOSE - module:category(IRAY:RENDER): 1.3 IRAY rend stat : Texture memory consumption: 1.55789 GiB for 246 bitmaps (device 0)
2019-05-25 01:15:30.686 Iray INFO - module:category(IRAY:RENDER): 1.3 IRAY rend info : Light hierarchy initialization took 0.82s
2019-05-25 01:15:30.701 Iray VERBOSE - module:category(IRAY:RENDER): 1.3 IRAY rend stat : Lights memory consumption: 9.93431 MiB (device 0)
2019-05-25 01:15:30.717 Iray VERBOSE - module:category(IRAY:RENDER): 1.3 IRAY rend stat : Material measurement memory consumption: 0 B (GPU)
2019-05-25 01:15:30.826 Iray VERBOSE - module:category(IRAY:RENDER): 1.3 IRAY rend stat : Materials memory consumption: 3.00267 MiB (GPU)
2019-05-25 01:15:30.826 Iray VERBOSE - module:category(IRAY:RENDER): 1.3 IRAY rend stat : PTX code (165 KiB) for sm75 generated in 0.113s
2019-05-25 01:15:31.029 Iray INFO - module:category(IRAY:RENDER): 1.3 IRAY rend info : JIT-linking wavefront kernel in 0.051s
2019-05-25 01:15:31.060 Iray INFO - module:category(IRAY:RENDER): 1.3 IRAY rend info : JIT-linking mega kernel in 0.025s
2019-05-25 01:15:31.060 Iray INFO - module:category(IRAY:RENDER): 1.2 IRAY rend info : CUDA device 0 (GeForce RTX 2080 Ti): Scene processed in 22.037s
2019-05-25 01:15:31.091 Iray INFO - module:category(IRAY:RENDER): 1.2 IRAY rend info : CUDA device 0 (GeForce RTX 2080 Ti): Allocated 284.868 MiB for frame buffer
2019-05-25 01:15:31.216 Iray INFO - module:category(IRAY:RENDER): 1.2 IRAY rend info : CUDA device 0 (GeForce RTX 2080 Ti): Allocated 1.6875 GiB of work space (2048k active samples in 0.138s)
2019-05-25 01:15:31.232 Iray INFO - module:category(IRAY:RENDER): 1.2 IRAY rend info : CUDA device 0 (GeForce RTX 2080 Ti): Used for display, optimizing for interactive usage (performance could be sacrificed)
2019-05-25 01:15:37.830 Iray INFO - module:category(IRAY:RENDER): 1.2 IRAY rend info : Allocating 1-layer frame buffer
2019-05-25 01:16:04.241 Iray INFO - module:category(IRAY:RENDER): 1.0 IRAY rend info : Received update to 00005 iterations after 55.211s.
...
...
...
2019-05-25 03:27:19.917 Iray VERBOSE - module:category(IRAY:RENDER): 1.0 IRAY rend progr: 3.49% of image converged
2019-05-25 03:27:20.478 Iray INFO - module:category(IRAY:RENDER): 1.0 IRAY rend info : Received update to 01318 iterations after 7930.998s.
2019-05-25 03:31:17.832 WARNING: ..\..\..\..\..\src\pluginsource\DzIrayRender\dzneuraymgr.cpp(302): Iray ERROR - module:category(IRAY:RENDER): 1.13 IRAY rend error: CUDA device 0 (GeForce RTX 2080 Ti): Kernel [18] (MaterialSurface ) failed after 0.019s
2019-05-25 03:31:17.832 WARNING: ..\..\..\..\..\src\pluginsource\DzIrayRender\dzneuraymgr.cpp(302): Iray ERROR - module:category(IRAY:RENDER): 1.13 IRAY rend error: CUDA device 0 (GeForce RTX 2080 Ti): an illegal memory access was encountered (while launching CUDA renderer)
2019-05-25 03:31:17.832 WARNING: ..\..\..\..\..\src\pluginsource\DzIrayRender\dzneuraymgr.cpp(302): Iray ERROR - module:category(IRAY:RENDER): 1.13 IRAY rend error: CUDA device 0 (GeForce RTX 2080 Ti): Failed to launch renderer
2019-05-25 03:31:17.848 WARNING: ..\..\..\..\..\src\pluginsource\DzIrayRender\dzneuraymgr.cpp(302): Iray ERROR - module:category(IRAY:RENDER): 1.2 IRAY rend error: CUDA device 0 (GeForce RTX 2080 Ti): Device failed while rendering
2019-05-25 03:31:17.848 WARNING: ..\..\..\..\..\src\pluginsource\DzIrayRender\dzneuraymgr.cpp(302): Iray WARNING - module:category(IRAY:RENDER): 1.2 IRAY rend warn : All available GPUs failed.
2019-05-25 03:31:17.848 WARNING: ..\..\..\..\..\src\pluginsource\DzIrayRender\dzneuraymgr.cpp(302): Iray WARNING - module:category(IRAY:RENDER): 1.2 IRAY rend warn : No devices activated. Enabling CPU fallback.
2019-05-25 03:31:17.848 WARNING: ..\..\..\..\..\src\pluginsource\DzIrayRender\dzneuraymgr.cpp(302): Iray ERROR - module:category(IRAY:RENDER): 1.2 IRAY rend error: CUDA device 0 (GeForce RTX 2080 Ti): an illegal memory access was encountered (while initializing memory buffer)
2019-05-25 03:31:17.848 WARNING: ..\..\..\..\..\src\pluginsource\DzIrayRender\dzneuraymgr.cpp(302): Iray ERROR - module:category(IRAY:RENDER): 1.2 IRAY rend error: All workers failed: aborting render
2019-05-25 03:31:17.848 WARNING: ..\..\..\..\..\src\pluginsource\DzIrayRender\dzneuraymgr.cpp(302): Iray ERROR - module:category(IRAY:RENDER): 1.2 IRAY rend error: CUDA device 0 (GeForce RTX 2080 Ti): an illegal memory access was encountered (while de-allocating memory)
2019-05-25 03:31:17.848 WARNING: ..\..\..\..\..\src\pluginsource\DzIrayRender\dzneuraymgr.cpp(302): Iray ERROR - module:category(IRAY:RENDER): 1.2 IRAY rend error: CUDA device 0 (GeForce RTX 2080 Ti): an illegal memory access was encountered (while de-allocating memory)
...
...
...
2019-05-25 03:31:17.864 Iray INFO - module:category(IRAY:RENDER): 1.0 IRAY rend info : CPU: using 4 cores for rendering
2019-05-25 03:31:17.864 Iray INFO - module:category(IRAY:RENDER): 1.0 IRAY rend info : Rendering with 1 device(s):
2019-05-25 03:31:17.864 Iray INFO - module:category(IRAY:RENDER): 1.0 IRAY rend info : CPU
2019-05-25 03:31:17.864 Iray INFO - module:category(IRAY:RENDER): 1.0 IRAY rend info : Rendering...
2019-05-25 03:31:17.864 Iray VERBOSE - module:category(IRAY:RENDER): 1.2 IRAY rend progr: CPU: Processing scene...
2019-05-25 03:31:36.147 Iray VERBOSE - module:category(IRAY:RENDER): 1.3 IRAY rend stat : Native CPU code generated in 0.0789s
2019-05-25 03:31:36.147 Iray INFO - module:category(IRAY:RENDER): 1.2 IRAY rend info : CPU: Scene processed in 18.288s
2019-05-25 03:31:36.272 Iray INFO - module:category(IRAY:RENDER): 1.2 IRAY rend info : CPU: Allocated 284.868 MiB for frame buffer
2019-05-25 03:33:57.779 Iray INFO - module:category(IRAY:RENDER): 1.2 IRAY rend info : Allocating 1-layer frame buffer
2019-05-25 03:46:04.271 WARNING: ..\..\..\..\..\src\pluginsource\DzIrayRender\dzneuraymgr.cpp(302): Iray WARNING - module:category(POST:RENDER): 1.0 POST rend warn : renderer iray has no more devices available. Postprocessing falling back to CPU.
2019-05-25 03:52:06.378 Iray INFO - module:category(IRAY:RENDER): 1.0 IRAY rend info : Received update to 01324 iterations after 9416.800s.
Comments
You cannot run multiple renders in Iray sequentially on any GPU. There is a memory leak that will eventually cause the render to fail to the CPU. You're probably experiencing this sooned on the 2080ti than the 970 because it renders so much faster. You need to restart Studio periodically or this will happen no matter what GPU you use.
The hints with the 1080 and 970 were only to show that I had no problems with 4 cards before.
Currently only the 2080ti is installed and it occurs in a single render, but mostly after 1-3 hours just like that. In GPUz I don't see any indication that anything is overflowing, getting too hot or behaving unusually.
Hello,
Thanks for your questions.
In the last test that was 3.5% I think to explain, it was a dark scene with the rain from KindredArts. The scene isn't big, but it needs a lot of iterations to get halfway clean (that wouldn't have been possible with my old 970GTX in one night). Render quality is mostly 1.5-2.0 and 98%.
I always run my render overnight and break it off after max. 6-7h in the morning, no matter how far they are. With a few exceptions, this was usually good enough.
------------
GPU clock was actually constant at 1770 MHz, maybe 18 minutes before it was once (10s clock) 90MHz less.
Memory Clock was constant 1717.5 MHz
CPU is unfortunately only the temperature listed...
I got a new Enermax Titan 800 Watt power supply for the card, because my old one (~10 years) had only 650 Watt. With the old power supply I also had PC crashes (in Hitman 2) with the 2080ti, only when I installed the new power supply one day later, the games run smoothly => therefore I assume that this should suffice.
Windows and DAZ are on a Samsung 850 EVO 500GB.
The "My Library" is currently still on a normal HD (Seagate or Wester Digital) because of its size.
OS has only the updates declared as "important". (Or what do you mean by platform updates?)
An 800W PSU might not be sufficient for you system. The boost clock of the standard 2080ti is 1545 Mhz. A 220 Mhz overclock is clearly sustainable for your cooling but it could be straining the PSU. I'd start by checking your manufacturers website for the TDP of the card, assuming its a factory OC. If you did it yourself dial it back to base and see if that changes things.
Did some quick checking the only factory OC at that level seems to be the Gigayte Aorus Extreme which strangely doesn't list a TDP but recommends a lower wattage PSU than the standard card which clearly draws less power. I think if that is the card you have your PSU might be overheating and that could be the source of your porblems.
Yes, it's the AORUS Extreme. 750W PSU recommended on this page: https://www.gigabyte.com/de/Graphics-Card/GV-N208TAORUS-X-11GC#sp
NVidea itself recommends for the Founders Edition only 650W on its homepage.
I tested the GPU one day with my old 650W power supply (Hitman 2) and had 4 PC crashes there. The next day I had the new 800W power supply installed and since then no more reboots while playing.
According to GPUz the power consumption for rendering was mostly around 80-85%, but it was up to 97% when playing. Whereby it fluctuates constantly, while rendering is relatively constant.
So "Power Consumption (W)" shows GPUz mostly around 250W when rendering. At Assassins Creed O. I even had 295W.
I don't want to rule it out completely, but I would have expected the 800W power supply to be sufficient.
The thing is a consistent crash after running for a while points to one of a few things. Overheating the GPU has been rules out. The most likely next problem is some other component overheating. Since the crash is occuring only on the GPU, and Nvidia GPU's are notoriously sensitive to voltage issues, I think the issue is an overheating PSU. Check that the intake for it is clear and that the fan does run.
That you don't have issues when gaming does count against my theory, assume you game for at least as long as the renders run for.
If that's not the problem you could be looking at a RAM issue. But I think that is highly unlikely.
Your other possibilities are some sort of software fault, a memory leak that no one else has which could happen but you should be able to see that in hardware monitor.
I bought 2 Titan RTX's last Wednesay. They were working great until I noticed the drop to CPU during a particularly long render on Saturday (~3 hours into the render). My log showed the identical issue. After this incident it started happening every render. After a lot of troubleshooting, PCIE slot swapping, and plugging in only 1 card at a time I isolated the issue to one of the cards. Further, I observed that underclocking the VRAM by ~500MHz on the problematic card temporarily resolved the issue. After about 5 hours of rendering at the reduced clock rate it once again failed. Now the card will no longer render under any condition.
I found it intersting that 1) the card worked perfectly for the first few days with no changes to my PC, and that 2) after said incident the issue was easily replicated, suggesting that the VRAM is susceptable to permanent damage through normal use. Whereas I cannot 100% confirm this to be true, I do have 2 identical RTX cards to use as a comparison case, and with all settings equal and testing in the same PCIE slot for posterity, said results can only be replicated on the one card. I also have 4 other pascal GPUs that work in the same PC, same daz build, and nvidia drivers without issue.
With a ramped up fan profile the card was running in the mid 70c range under load. PSU is EVGA 1600W platinum, 3 weeks old. Intel i9-7900X. Asus Rog Rampage VI Extreme mobo. 64gb ddr4 @3200. Windows 10 Pro version 1809 build 17763.503. Nvidia driver version 430.64. Daz build: 4.11.366 Beta.
As described by kimbleag, the card does still work in a handful of games I tested, as well as with Unity RTX beta. However, it does not initialize in Maya when I try to render with the Arnold beta GPU renderer.
I already RMA'd the card.
Nvidia and its partners claimed to have fixed those issues with VRAM back right after the RTX cards launched. I guess they didn't.
If it works fine for games then it's not the card nor the psu or any hardware issue. Games usually require more wattage than just iray rendering. I'd rather bet for a driver issue. Windows 7 isn't probably the best OS to deal with new cards.
@kenshaw011267
Last night I had a simpler scene (2 characters and HDRI), rendered from 3 perspectives with max. 2000 iterations (batch job script), it took only ~70-80 minutes per image and in the course of ~4-5h all 3 images were completed on gpu. But between the pictures the project was closed, the new one loaded etc. there the hardware (e.g. PSU) would have time to cool down.
The bad thing is that the tests are so time-consuming...
We have a holiday tomorrow, I'll try a "final" ~3 hour stability test in Assassins Creed.
I would have liked to test the VRAM, but couldn't find a tool that could do it. There is an old one, but it only seems to work up to ~2GB.
All the GPU test tools (like FurMark) occupy only 2-4GB.
@nothingmore
Oh, your story really sounds like it's the GPU itself, even if it works in games and otherwise smoothly. Unfortunately, my 2-week return policy expired today and I would have to open a warranty case here as well. But I have time for that and will probably do a few more tests.
When the new AMD processors came on the market in July, I wanted to change CPU and motherboard anyway and switch to Win 10. Maybe I'll wait until then. Because as written I have been fighting for 3/4 years with the topic "new GPU" and have slowly absolutely no more nerves.
I keep my fingers crossed for you that your RMA is going through and you get a flawless Titan RTX. I'm pretty jealous of your PC.
@Padone
Thank you for your comment, I am currently also tending to a software problem. In 2-3 months I will upgrade the rest of the PC and (unfortunately) have to switch to Win 10 anyway. Maybe the problem will be solved by then and if not, I have to try to open a warranty case for the GPU.
That you were able to render images in succession using the render queue strongly supports that either something is over heating or that you have something leaking memory.
If you're not tryingto play games or something of that sort while the render is running I strongly doubt its a memory leak. I still lean toward something being wrong in the PSU. Have you checked its intake and that it's fan is spinning while the machine is on?
I am having this issue with 256x256 thumbnails which used to render just fine without crashing. Larger renders I would reboot the program every time but with thumbanils that'll take forever.
The problem with testing these issues with games is that most people don't play games as long as renders can take, so any stability issues in the card might not show up during game play. Try a benchmark program like heaven https://benchmark.unigine.com/heaven and let it run for six or more hours.
I don't think it's a power issue, 800 Watts is more than enough, and, as was pointed out, games use more power than renders (they are using parts of the card that IRay isn't).
In case it's any help to you, my crashes stopped when I turned off OptiX acceleration. My errors said I was out of memory but the line that started it all each time mentioned Opti-x, I waited longer than I should have to try it because I didn't want things to slow down but the difference was not much and not crashing more than made up for it. Your log is similar, mine mentioned dzneuraymgr.cpp a lot but also OptiX which yours doesn't seem to...