Yes, you can, but no way it takes an hour.
You need to collect and label at least 100 images for each class you want to detect from your CCTV feed (and you have to train the model using pictures of people in different positions, crouching etc...), then your first model will give you a lot of...