Trying to fix intermittent fault - question about nvidia drivers...

pyspilf

Getting the hang of it
May 22, 2017
87
65
Madrid, Spain
I am experiencing a frequent issue where the BI server completely freezes. I have narrowed it down to a problem with the video card.

This is windows 10 system and these issues (LiveKernelEvent 141) started happening after Windows decided to update some components despite my configuration to not allow updates...

As you can see below, the system was very stable until some updates were installed...

Using the WinDBG tool, and analyzing the minidump log, this is the issue:

VIDEO_SCHEDULER_INTERNAL_ERROR (119) The video scheduler has detected that fatal violation has occurred. This resulted in a condition that video scheduler can no longer progress.

So I see a few options:

1. Try to roll back these updates as nothing else has changed on the system and it was stable before
2. Restore a backup from the day prior to this install, and then try (again) to prevent windows from applying any updates
3. Update the nvidia drivers to see if this will fix it

For #3 @MikeLud1 I would like to make sure I don't break anything, perhaps you have some tips...

I am currently on the most recent versions of BI 5.9.25 and CP AI 2.9.5

The video card is a NVIDIA GeForce GTX 1060 6GB using Studio Driver 516.94

Running nvcc --version outputs this:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:41:10_Pacific_Daylight_Time_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0


1739367417430.png
 
So, clearly going forward with the Nvidia drivers was not a good idea... had Windows check for updates, and it updated Nvidia to 560.94... resulting in LiveKernelEvent 193 instead of 141 ... just about 1-2' after starting BlueIris.

As long as BlueIris is not running (i.e. CP AI is sitting idle) the system is stable...

Rolled back to 516.94 but I presume this will only get me back to where I started
 
Just in case anyone comes across a similar problem, the final solution for me was to source a second hand replacement i7-4790K CPU and now all works fine (or so it seems for the moment). Baffles me as to why a failing CPU would produce kernel errors clearly identified as being caused by the video card, but in the end it is Windows, so...

This is the third CPU in this system. It has never overheated or anything but I suppose being on 24/7 will make its lifespan in terms of elapsed time a lot shorter
 
  • Like
Reactions: bp2008
what is the brand/ model/ of the computer?
Maybe the heatsink is too small for the intel processor. For instance Dell had 3 different heatsinks for 4th generation intel chips. The one for the 4790/4590 will not cool the 4790K enough.
 
Oh, well, to be honest, I hadn't thought of that as the CPU fan is always running relatively slow, so I assumed it wasn't heating enough to require more cooling... CPU is usually running in the low 20% or less, only spikes a bit when raining or windy obviously... and I have never overclocked it

No specific brand as this was a build with a 4U server case, etc. The fan is from Intel and I believe it came as part of the kit... but no harm done in looking for a better cooler, I can sure try.

Thanks so much for the suggestion
 
Check the CPU temperature before bothering with a new cooler. If it is below about 80°C then I doubt there is any reason to upgrade the cooling.

Otherwise I'd just go with something like this
In a 4U case you probably have 155mm of clearance to suit that cooler but it is best to confirm. The other concern will be RAM clearance. Many coolers like this one put a fan directly above the RAM so if the RAM has tall heatsinks it can have clearance issues.
 
Last edited:
As an Amazon Associate IPCamTalk earns from qualifying purchases.
  • Like
Reactions: Flintstone61
Thanks! Yes, precisely that is the one I was looking at for a new build I hope in the next few months with a 14th gen CPU, and it was widely recommended for a 4U case

1741202642704.png

OK, I had no idea the temps were that high... I will turn it off before I cook this one too, and go get a better cooler!!

Thanks @bp2008 and @Flintstone61 for getting me looking into this
 
Yup. It looks like you have a hot spot on your CPU with the wide temperature difference between cores. I am not sure how much of that is normal for Haswell parts (4000 series Intel) but with good cooling you should basically never see the peak temperature turn red like that without overclocking.

Be careful to have the CPU and heatsink surfaces as clean as possible (you don't want dust, hairs, or fingerprints sandwiched in there), and don't skimp on thermal paste. "Too much thermal paste" is a myth. Even if you go way overboard adding paste, the excess will get squirted out from underneath the CPU cooler and not harm anything. The vast majority of thermal pastes are not electrically conductive so it won't short anything even it somehow spans two electrical contacts.
 
I concur! I am very careful with things like this. When replacing the processor I cleaned both the heatsink and the CPU with isopropyl alcohol and made sure to get no oil/fingerprints/dust etc on any surface and applied what I considered to be a decent enough amount of thermal paste... but clearly something didn't go to plan...

I will be taking delivery of the PA120SE v3 cooler this afternoon and installing that, this time with more thermal paste. I will let you know what the core temperatures look like after that.

I will also attach a monitor to the case to see what the temperatures are doing, as I was using Windows RDP and I think that increases CPU usage a fair bit, and I want to see readings with only BI/CP AI using the CPU and not RDP too.

Cheers
 
As promised, update. Installed the PA120SE V3 (was fun to learn the LGA1150 socket requires a plate underneath the motherboard, so it was a complete teardown...) - and things are very tight even in a 4U case, but with all installed and cables inside as neat as possible to not interfere with airflow, server back in the rack.

These readings done with BI running on green shield, remote desktop open with BI open, CP AI running and also BI open on the iphone app - to me this looks like a significant improvement.

1741362946528.png

Compare to with the stock cooler, BI on red shield, app not in the foreground on the desktop and no connection from mobile app:

1741363021356.png

So, again, thanks a lot for all your tips!

Cheers
 
  • Like
Reactions: bp2008