Login at Kodi Home

hybrideagle · (This post was last modified: 2018-03-10, 19:11 by hybrideagle.)

Goals: Given a video, break it down into a series of component parts. Mark these parts using the chapters/bookmarks system.
Parts could include things like:

Prologue
Intro
part 1
Titlecard
part 2
Outro
Credits

Here. I'm operating on the assumption that these parts are fairly consistent across all shows, ie that 99% of shows follow the prologue -> intro -> part 1 -> titlecard -> part 2 -> outro pattern, or something similar(which seems true from what I've seen, but then again I only really watch anime, so what do I know).

Stretch goal: Build a platform making it easier to perform this type of analysis. This could be useful if someone wants to build a system to perform genre detection, for example.

Benefits: This could help anyone with video files that aren't properly segmented, making it easier to navigate around the video. A few use cases, for example:

Skipping intros and outros
Jumping to a song in a musical

What does it touch in Kodi:

Reading the audio and video of the file
Bookmarks and chapters

Requirements: Machine learning knowledge, Python and C++, any computer with a decent Nvidia GPU(or a cloud equivalent).

Possible mentors: Razze

Methods: This project would be implemented through a deep learning model, specifically some combination of convolutional networks and recurrent networks. While the specific libraries and architectures need (much) more thought, here are a few initial suggestions:

Libraries(order of preference):

Tensorflow/Keras( (+)standard and easy to use (-) slow, requires an extra dependency)
PyTorch
CNTK

Approach: The primary factors in deciding the approach are:

The ratio of decisions made by the model and by heuristics(our code).
The required granularity and accuracy.
The available data(copyright issues and all that).
The number of classes.

Here's one basic end-to-end ML approach:

We model the task as a classification task, with the goal of classifying each moment as one of a set of classes.
The model is a basic recurrent(LSTM) model, that, for each time step, inputs a frame of video and a sample of audio, and produces a softmax probability distribution across a set of classes, with one class for each type of part.
The output is then smoothed (possibly with a HMM) to get the final breakdown.

Here's a slightly less ML-heavy approach:

We model the task as an edge detection task, with the goal of identifying boundaries between different segments.
The model is again a recurrent network, but this time has only one output, producing the likelihood of the current moment being an edge. The network is run across the video, producing a time-series of probabilities. This series is smoothed, segment boundaries are extracted, and the segments are then run through a classifier(convolutional) to figure out what type of segment it is.

The first part of this project would mostly focus on identifying a suitable model and approach.

hybrideagle · 2018-03-10, 19:28

Also, I'm currently working on getting my environment up and running. It's a little complicated since I'm using a non-standard OS(Solus). Honestly, I'm thinking of just giving up and building it in my Ubuntu VM using a shared folder.

**Razze** · 2018-03-11, 11:21

Nice to have you here Smile

Sounds like a very interesting approach.

(2018-03-10, 19:28)hybrideagle Wrote: Also, I'm currently working on getting my environment up and running. It's a little complicated since I'm using a non-standard OS(Solus). Honestly, I'm thinking of just giving up and building it in my Ubuntu VM using a shared folder.

Totally valid approach and shouldn't hurt much Smile