Apologies for the word vomit, but I need some help plz.
Hate to even ask this, but I'm at my wits end and have no hair left to pull out at this point and desperately need some direction/guidance. Have been running a 3rd gen i5 desktop and deepstack with 9 cameras for the better part of three years. Deepstack worked VERY well, with the exception of randomly every five or six weeks, just dying and would require a reinstall to resume functioning. Discovered CodeProject.Ai, read hundreds of pages of forums postings on it, then decided to pull the trigger and migrate. That was.... Rough, but kind of successful?
Unfortunately I've developed a need for some LPR cameras, and noticed that despite CodeProject seems to be working (except the fact my poor i5 cpu was basically at 100% usage 100% of the time) I decided to perform a few upgrades and get myself a more powerful cpu, more ram, and after all that decided to get a newer version of BI while I was at it. Great, now I've got my LPR cams installed and decently tuned in, except, even now with an i7-7800 and 16gb of ram, my CodeProject.ai seems to be maxing out this system resource wise as well. This issue leads to sometimes seeing either abandoned object analyzation, or 20+ second returns to identify an object (on top of identifying everything under the sun as a person/car/truck/van/hose/cat/dog/chair every time there is ANY motion detected, but that's a different problem) if even at all.
More purchases were made, and unfortunately, this is where things get.... Bad? Complicated. Picked up 2x NVidia Tesla P4s and some extra ram for my main hypervisor server that runs 98% of the compute based stuff I do, Proxmox. The massive amount of reading I did, this sounded like it'd be a piece of cake! Even found a few guides to follow, none of which have really returned anything remotely functional. I'll bulletpoint the flustered cluster of what I'm attempting, hopefully to clarify my rambling a bit:
- Blue Iris: i7-8700 desktop, 16gb ram, Server 2k19 running bare metal.
- Hypervisor: 2x e5-2690v3s, 256gb ram, Proxmox 6.4 (don't judge me!) running bare metal.
- GPU: 2x NVidia Tesla P4 headless datacenter cards
- NVIDIA-SMI 460.106.00 Driver Version: 460.106.00 CUDA Version: 11.2
- Proxmox LXC Container: Ubuntu 20.04LTS container, unprivileged = no, nesting = 1
- <ctid>.conf contains:
- lxc.cgroup.devices.allow: c 195:* rwm
lxc.cgroup.devices.allow: c 238:* rwm
lxc.cgroup.devices.allow: c 241:* rwm
lxc.mount.entry: /dev/nvidia0 dev/nvidia0 none bind,optional,create=file
lxc.mount.entry: /dev/nvidiactl dev/nvidiactl none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm dev/nvidia-uvm none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-modeset dev/nvidia-modeset none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm-tools dev/nvidia-uvm-tools none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm-tools dev/nvidia-caps/nvidia-cap1 none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm-tools dev/nvidia-caps/nvidia-cap2 none bind,optional,create=file
lxc.apparmor.profile: unconfined
lxc.cgroup.devices.allow: a
lxc.cap.drop:
lxc.mount.auto:
- NVIDIA-SMI 460.106.00 Driver Version: 460.106.00 CUDA Version: 11.2 (no kernel driver install)
- Docker version 24.0.6, build ed223bc (installed using instructions found here)
- Running docker run -it --rm --gpus all ubuntu nvidia-smi returns an identical dataset found on my hypervisor host, along with the docker host lxc container
- running nvidia-contianer-cli info returns the following:
nvidia-container-cli info
NVRM version: 460.106.00
CUDA version: 11.2
Device Index: 0
Device Minor: 0
Model: Tesla P4
Brand: Tesla
GPU UUID: GPU-83729f44-3fb8-b4ed-2efb-656e152d3d12
Bus Location: 00000000:82:00.0
Architecture: 6.1
Device Index: 1
Device Minor: 1
Model: Tesla P4
Brand: Tesla
GPU UUID: GPU-eb6187de-dff9-83e5-ce33-140d9e466b12
Bus Location: 00000000:83:00.0
Architecture: 6.1
- Attempting to run: docker run --name CodeProject.AI -d -p 32168:32168 --gpus all codeproject/ai-server:gpureturns:
- Error response from daemon: Cannot restart container CP.AI: failed to create task for container: failed to create shim task: OCI runtime create failerunc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout:stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: requirement error: unsatisfied condition: cuda>=11.7, please update your driver to a newer version, or use an earlier cuda conter: unknown
- Purpose: From the documentation I've read, you can use a singular GPU and use pcie passthrough to multiple lxc containers for distributed GPU compute loads. Ideally, I'd like plex, CP.ai, along with a few other things to be able to share a pair of GPUs, though right now, I'd be thrilled beyond belief if I could get CP.ai to even START on this damn thing!
What in the hell am I missing? At this point I've got no less than fifteen hours invested in this project and have only moderately progressed towards my goal of making something even functional, muchless maybe slightly improved identification times. I've uninstalled and reinstalled the driver/cuda on the baremetal hypervisor no fewer than ten times, countless reboots, deleted and recreated lxc containers probably two dozen times... Plexmediaserver runs on a dedicated 20.04LTS lxc container and has no problems at all utilizing the GPU for transcoding. I'd just
LOVE to take one of the damn p4s out of the server and install it in the window PC running, but one of these cards won't physically fit in the chassis of my BI PC, nor do they have any active cooling as they're intended for datacenter chassis machines like I've got them installed in.
Any suggestion would be endlessly appreciated with this!!