The purpose of this project is to create play lists of the videos in a youtube channel. I didn’t find a free server to support youtube query thus this app is not online. The code can be found on github.
Here I used a simple bag-of-words (BOW) model for topic modeling. For preprocessing, I removed certain stop words and also turned words to their stem form. The relevant codes are as follows.
from gensim import corpora import gensim from nltk.stem.porter import PorterStemmer from nltk.tokenize import RegexpTokenizer tokenizer = RegexpTokenizer(r'\w+') porter = PorterStemmer() from nltk.corpus import stopwords stop = set(stopwords.words('english')) import re from functools import partial if channel.get('stopwords'): extra_w = [x.lower() for x in channel['stopwords'].split()] stopw = stop.union(extra_w) else: stopw = stop new_tokenizer = partial(tokenize_text, stopw) titles = [new_tokenizer(v['title']) for v in videos]
Gensim makes it very easy to perform latent Dirichlet analysis. The relevant codes are as follows.
def do_LDA(titles,num_topics): dictionary = corpora.Dictionary(titles) corpus = [dictionary.doc2bow(title) for title in titles] threshold = 1/num_topics ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=num_topics, id2word = dictionary, passes=20, minimum_probability=threshold) # assign topics lda_corpus = [max(x,key=lambda y:y) for x in ldamodel[corpus] ] return ldamodel.show_topics(num_topics=num_topics, num_words=4), lda_corpus