The purpose of this project is to create play lists of the videos in a youtube channel. I didn’t find a free server to support youtube query thus this app is not online. The code can be found on github.

Here I used a simple bag-of-words (BOW) model for topic modeling. For preprocessing, I removed certain stop words and also turned words to their stem form. The relevant codes are as follows.

from gensim import corpora
import gensim

from nltk.stem.porter import PorterStemmer 
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
porter = PorterStemmer()

from nltk.corpus import stopwords
stop = set(stopwords.words('english'))

import re
from functools import partial

if channel.get('stopwords'):
    extra_w = [x.lower() for x in channel['stopwords'].split()]
    stopw = stop.union(extra_w)
    stopw = stop
new_tokenizer = partial(tokenize_text, stopw)
titles = [new_tokenizer(v['title']) for v in videos]

Gensim makes it very easy to perform latent Dirichlet analysis. The relevant codes are as follows.

def do_LDA(titles,num_topics):    
    dictionary = corpora.Dictionary(titles)
    corpus = [dictionary.doc2bow(title) for title in titles]
    threshold = 1/num_topics
    ldamodel = gensim.models.ldamodel.LdaModel(corpus, 
                num_topics=num_topics, id2word = dictionary, 
                passes=20, minimum_probability=threshold)
    # assign topics
    lda_corpus = [max(x,key=lambda y:y[1]) 
                        for x in ldamodel[corpus] ]
    return ldamodel.show_topics(num_topics=num_topics,
                                num_words=4), lda_corpus