I’m interested, but when it comes down to it, the TPU is best at running one model. There is significant overhead swapping the internal cache from one model to another. So ideally we would have one model that does everything for object detection, and then all the remaining models are run on the CPU or GPU. The model is called “You Only Look Once” for a reason, after all, and by default uses 80 (!) classes.
Some examples:
YOLOv8 small is roughly 12 MB in size and runs at 87 ms per inference on one TPU on my computer. My testing isn't able to make much speedup when adding TPU segments. So 2 TPUs should have roughly 2x the throughput of inferences.
YOLOv8 medium is roughly 24 MB in size and runs at 287 ms per inference on one TPU on my computer. It seems to run optimally when segmented across 2-3 TPUs. It has roughly these speedups for each additional TPU:
- # 113.89787337975577 ms per inference
- # 68.5618184232153 ms per inference
- # 52.671655846294016 ms per inference
- # 42.78380761388689 ms per inference
- # 32.067516354843974 ms per inference
- # 29.010483619291335 ms per inference
- # 26.180413575377315 ms per inference
YOLOv8 large is roughly 44 MB in size and runs at 1079ms per inference for one TPU on my computer. It seems to run optimally when segmented across 2-3 TPUs. Note that this means that most of the model is still running on the CPU, but my testing has shown that additional TPUs are best utilized running those same 2-3 segments instead of additional segments. It has roughly these speedups for each additional TPU:
- # 168.47994122793898 ms per inference
- # 116.75911782123148 ms per inference
- # 89.93861217377707 ms per inference
- # 77.7926594489254 ms per inference
- # 64.36888010893017 ms per inference
- # 57.39340456202626 ms per inference
- # 51.6873672218062 ms per inference
@MikeLud1's ipcam-general-v8 is 12 MB in size and runs at 238 ms per inference on one TPU. Interestingly, although it's the same size as YOLOv8 small, it prefers to be split into 2-3 segments. Timing with each additional TPU:
- # 51.34019722510129 ms per inference
- # 24.159977784845978 ms per inference
- # 19.426164492033422 ms per inference
- # 15.750231327023357 ms per inference
- # 15.237237785942852 ms per inference
- # 12.418005119077861 ms per inference
- # 11.074969850946218 ms per inference
You can see that the speedup is often non-linear and doesn't always make sense.