Rotobot OpenFX Plugin

Rotobot has been used to create the video layer of the cat to enable the addition of the computer generated imagery of the skull, by specifying “cat’ and some further refinement of the edge using a machine learning in about 5 seconds a frame computation using a CPU.

Production of video layers is either done by planning in advance using a chroma key green or blue screen.

Or by using the process of Rotoscoping, which involved tracing around the edge of the feature you want as a video layer.

For footage at 24 frames per second it can take more than week to process 300 frames of footage to finished quality. This requires a skilled labour force, typically billed at more than USD 100 dollars a day. During this time, downstream departments will be blocked waiting for the video layers.

How amazing would it be to just say “I want the people on a video layer!” and that layer would appear after some computation?

Rotobot OpenFX doesn’t take verbal commands yet. But it does the digital content creation equivalent of that request.

Your task is to input the footage into your digital content creation package put it into the correct colour space and appropriate resolution, ask for the types of video layers you want by checking a tick box next to the label of person, car, cat, etc. This task could also be automated as part of the media ingest process.

Then allow Rotobot OpenFX Plugin to process the footage to a quality which is close enough to allow processes which would otherwise be blocked waiting for the highest quality result, or adding to the cost by rushing a placeholder.

The speed of the computation is only limited by the scale of computational resources occupied and the number of concurrent licenses purchases. The software is accelerated by NVIDIA’s CUDA computation on GPUs that are routinely used in Media and Entertainment production.

If you have five times as many licenses and computers you will get the result back in one fifth of the time, using the same shared pool of computation that a company has in place for the rest of the compositing tasks. Same if you have one hundred times the amount of licenses and computation, you can process in near realtime.

None of your data will ever leave your local network, licensing and software is installed locally, but if you engage your own trusted cloud provider this configuration can work equally well.

The two components are the Plugin to the compositing package and the license server. Many studios have been up and running in under an hour from receiving the installer and the software license.

That is great for person, but what else can you isolate?

Screen Shot 2018-09-08 at 3.57.25 pm
“Person”, “Bicycle”, “Car”, “Motorcycle”, “Airplane”, “Bus”, “Train”, “Truck”, “Boat”, “Traffic Light”, “Fire Hydrant”, “Stop Sign”, “Parking meter”, “Bench”, “Bird”,”Cat” ,”Dog”,”Horse” ,”Sheep” ,”Cow” ,”Elephant” ,”Bear” ,”Zebra” ,”Giraffe” ,”Backpack” ,”Umbrella” ,”Handbag” ,”Tie” ,”Suitcase” ,”Frisbee” ,”Skis” ,”Snowboard” ,”Sportsball” ,”Kite” ,”Baseball bat” ,”Baseball glove” ,”Skateboard” ,”Surfboard” ,”Tennis Racquet” ,”Bottle” ,”Wine Glass” ,”Cup” ,”Fork” ,”Knife” ,”Spoon” ,”Bowl” ,”Banana” ,”Apple” ,”Sandwich” ,”Orange” ,”Broccoli” ,”Carrot” ,”Hotdog” ,”Pizza” ,”Donut” ,”Cake” ,”Chair” ,”Couch” ,”Potted Plant” ,”Bed” ,”Dining Table” ,”Toilet” ,”TV” ,”Laptop” ,”Mouse” ,”Remote” ,”Keyboard” ,”Cell Phone” ,”Microwave” ,”Oven” ,”Toaster” ,”Sink” ,”Refrigerator” ,”Book” ,”Clock” ,”Vase” ,”Scissors” ,”Teddy Bear” ,”Hairdryer” ,”Toothbrush” ,”Banner” ,”Blanket”

The list for segmentation is a little more limited to 19 classes:

“aeroplane”, “bicycle”, “bird”, “boat”, “bottle”, “bus”,
“car”, “cat”, “chair”, “cow”, “dining table”, “dog”, “horse”, “motorbike”,
“person”, “potted plant”, “sheep”, “sofa”, “train”, “tv”

So how will I manage all that and keep the graphical user interface simple for making masks?

The current approach is that there are three panels for red, green and blue.

Simply choose the categories and pixel coordinates that you want in red, green, or blue.

So you could choose, truck, car, train and bus in red, and person in green and then bird in red. Then shuffling the channels out you can use this to mask and effect like desaturating the background that is not person or the vehicle categories.

Typically we would spend hours creating masks by hand tracing around footage at 24, 29.97 or 30 frames per second. With Rotobot, we can process a frame of footage in as little as 5-20 seconds depending on your hardware (64 bit processors and 2GB of RAM are required. No GPU is required.), but may be as high as two to three minutes for more complete accuracy. But that can be more like 0.3 frames per second with GPU acceleration.

The accuracy of the masks is limited to the quality of the deep learning model behind Rotobot. The developers at Kognat are working hard to improve the accuracy and resolution.

Rotobot has been tested at most of the major visual effects studios around the world and is available for 64 bit operating systems as follows: Linux, Windows and macOS.