How to harness BERT to create a super-targeted topic cluster content strategy for SEO – Opeard’s Journey & Code
- SEO
- January 26, 2020
That infamous meeting that started it all
This is a story about how our very own SEO R&D team at Operad figured out it is time to revisit the way we build content strategies, and what has happened since that moment.
As SEOs, you may be familiar with a scenario similar to this one:
It is late 2018, a year that will always be remembered for the Bert/Embedding/Deep learning google updates. We are sitting in the 28th floor of the tallest building in town, holding the last quarterly meeting of the year with a one of our most strategic enterprise clients, a leading company in a very (very) technical and challenging vertical.
The meeting was, well.. awkward. The content plan we’ve been working on all year hasn’t made any signs that it was going to achieve the annual KPIs, the year was almost over, and we had nothing to show for our work.
We felt like the universe had just pulled the ice-bucket challenge on us (only we weren’t expecting it).
Along came topic clusters
On the way back to our office, it was already clear to us all that it is time for change.
The first thing we learned in the following weeks, is that topic clustering and understanding Google’s NLP are becoming more and more important to those aiming at creating content that will perform well on search.
That has taken us to a year full of topic clusters creation (inspired by Hubspot’s wonderful take), we created topic clusters for many of our clients and the results were good, to say the least. Topic clusters really proved to be a very effective method to build a content strategy that also helped our clients drive real organic growth.
Our client from this story for example has experienced a 300% increase in organic traffic to their website content (yes, they decided to stick with us:) ) >>
At the same time, we dived deep into everything NLP (thanks to people like Briggsby, AJ Kohn, Bil Slawsky, Kevin Indig, Cindy Krum and others who made progress in this field): entities, salience, query syntax/intent, architecture (TIPR).
We studied and incorporated every piece of information we identified as related to NLP.
Using NLP and Machine Learning
Fast forward to early 2019 – diving into the real deal – machine learning and python.
This is the year Operad’s SEO’s started to roll up their sleeves and understand more about what’s going on down the rabbit hole.
Manually building and tailoring topic clusters in fields we are definitely not experts in was super labor intensive. We knew all along that we got to make it more efficient and data based. It was also important for us to better understand what is going on, what exactly NLP is and how we can use it for our needs.
Text mining for SEO suddenly showed up, we attended NLP meetups and took some courses.
Following Rory Truesdale’s SEJ article about how to mine the SERP for content insights, we started to use LDA Topic modeling, Google NLP cloud and scraping, aiming to base our topic clusters method on Machine Learning.
This resulted in more studying and experimenting and that was the moment we got lucky – again.
Introducing BERT
Reading through a very large article about recent Document embedding techniques we found it – a super clever BERT sentence clustering application, almost as if it was tailor made for our needs. A bit more playing around, and it became fully operational.
Having used to clustering not based on context (word2vec for example), it was very refreshing to see how BERT was able to cluster sentences that had similar meanings.
Two years after that awkward meeting, and now we’re finally starting to feel we understand what our content creation method should look like. How text mining, SEO and content strategy all fit together in one happy family.
Today we are using these tools on some of our current projects and working hard on finalizing them for proper use, still a long way to go but a least now we know where we want to go.
Using BERT – the code:
(Warning: this is the part where the article gets a bit more technical)
Setup
Here is the code for the tool we recently shared, “Topic clusters and text mining with BERT”:
First, let’s install some packages:
!pip install cufflinks !pip install torch==1.3.1+cpu torchvision==0.4.2+cpu -f https://download.pytorch.org/whl/torch_stable.html !jupyter nbextension enable --py --sys-prefix widgetsnbextension !pip install -U sentence-transformers !git clone https://github.com/UKPLab/sentence-transformers.git !pip install google.cloud.language !pip install goose3 nltk.download('punkt') !pip install plotly==4.4.1
Now, we will import the modules used for plotting, calculating and operating in the various parts of the program and at the same time we’ll load the spacy language model (for text preprocessing used in kw extraction) and ‘bert-base-nli-stsb-mean-tokens’ – a model heavily pre trained and fine-tuned especially for clustering tasks.
import ipywidgets as widgets from ipywidgets import interact, interact_manual import re from goose3 import Goose import pandas as pd from nltk.tokenize import sent_tokenize from sklearn.manifold import TSNE from sklearn.cluster import KMeans import plotly import plotly.graph_objects as go from google.cloud import language from google.oauth2 import service_account from google.cloud.language import enums from google.cloud.language import types from google.cloud import language import os import argparse from collections import OrderedDict import numpy as np import spacy from spacy.lang.en.stop_words import STOP_WORDS from sentence_transformers import SentenceTransformer import matplotlib.pyplot as plt nlp = spacy.load('en_core_web_sm') embedder = SentenceTransformer('bert-base-nli-stsb-mean-tokens')
This is where one can use any option available in this application, from Bert base to Bert large, Roberta, Distilbert or even XLNet. You can also further fine-tune it for a specific domain if you want to, but it seems that is not necessary since these guys already trained and fine-tuned this model so much and it is very task-specific.
All of the details can be found in the following repository – https://github.com/UKPLab/sentence-transformers
Summarizing Topics into Words
The next section of the code was added “on the go” to summarize the topics into words so it has the effect of fast understanding of each topic. We think it might be better to replace this section with a Named Entity Recognition (NER) and entities knowledge/connections graphs.
https://www.analyticsvidhya.com/blog/2019/10/how-to-build-knowledge-graph-text-using-spacy/ or open sematic search / neo4j
The section we are referring to is the “class TextRank4Keyword():” section and we have not yet sat down and studied what exactly it does but what we CAN say is that it is a “page rank for keywords in text”.
https://towardsdatascience.com/textrank-for-keyword-extraction-by-python-c0bae21bcec0
Next, we took Paul Shapiro’s and (or) JR Oakes work on Google NLP cloud so we can have some perspective on entities and salience which are key factors in our goal to create content that is dense, rich and most important – relevant to the topics.
Make sure you get proper credentials and the service.json file is as shown in the guides:
https://opensource.com/article/19/7/python-google-natural-language-api
These are the functions that send the text to Google’s NLP cloud and return the relevant information.
def analyze_entities(text, encoding='UTF32'): document = language.types.Document(content=text, language='en',type='PLAIN_TEXT') response = client.analyze_entities(document=document,encoding_type='UTF32') return response def entity_create_list(text): entity_list = [] entities = analyze_entities(text) for x in entities.entities: entity_list.append({ "name": x.name, "salience": x.salience, "entity_type": str(enums.Entity.Type(x.type)).strip('Type.')}) return entity_list def create_score(x, y): return x * y
Input
Here we finally start to work, the program asks whether to scrape a URL or a text file, the text is later broken down to sentences and slightly cleaned.
Further improvements will include multiple URLs (loop) for a more ‘SERP analysis’ style of functionality.
use_url=input('do you want to use URL ? enter yes/Yes') just_domain=input('insert project name ') if use_url == 'yes' or use_url == 'Yes': url=input('insert url ') g = Goose({'browser_user_agent': 'Mozilla'}) article = g.extract(url=url) df=article.cleaned_text df=sent_tokenize(df) print('getting text') else: print('paste the text in "txt" file text.txt located in the folder and press enter when ready') ready=input('press enter to continue') file = open('text.txt', 'r',encoding="utf8") df=file.read() df=sent_tokenize(df) url=just_domain df1=df df_len = ', '.join(df1) df_len = len(df_len) print('document lenght: '+str(df_len)) df=[re.sub(r'[[^()]*]', '', i) for i in df] df=[re.sub('^n', '', i) for i in df] df=[re.sub('^n', '', i) for i in df] df=[re.sub('• ', '', i) for i in df] df=[ i for i in df1 if len(i) >= 50 ] print('proccesing text')
Clustering
And now for some hardcore ML algorithms.
First, we start with the embedder, this takes our sentences/text and uses the Bert model to give each sentence a vector of 500(!) dimensions according to it and its neighbors’ context and meaning.
After we have a vector representation of each sentence we would like to see who is closer to whom. The issue is, that this can be done only in a 2-dimensional space.
This situation is very common in ML and a very popular algorithm, called T-SNE, is often used to reduce the dimensions form hundreds to 2-3.
Using T-SNE gives us the same sentences but with just 2 dimensions (y-axis and x-axis) and keeps the distances between them the same as it was before.
This allows us to use another popular algorithm, K-means – i.e. the clustering algorithm.
K-means, same as the others above uses some very clever statistics involving distances between centroids to understand what the clusters are. K-means receives the desired # of clusters (K) and in return gives each sentence a label (the cluster number).
There are other algorithms that automatically decide how many clusters is the ultimate in terms of coherency (or other terms), but we kind of feel that choosing manually might be useful since the human eye and intuition is often the best judge.
However, just in case, we found there is yet another algorithm (elbow method) which provides an initial sense of the recommended amount of topics.
corpus_embeddings = embedder.encode(df) print('creating embeddings') # Initialize t-SNE tsne = TSNE(n_components = 2, init = 'random', random_state = 10, perplexity = 100) # Use only 400 rows to shorten processing time tsne_df = tsne.fit_transform(corpus_embeddings) print('reducing embeddings dimensions') print('==============================') print('ploting...') x = tsne_df wcss = [] for i in range(1, 11): kmeans = KMeans(n_clusters = i, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0) kmeans.fit(x) wcss.append(kmeans.inertia_) #Plotting the results onto a line graph, allowing us to observe 'The elbow' plt.plot(range(1, 11), wcss) plt.title('The elbow method') plt.xlabel('Number of clusters') plt.ylabel('WCSS') #within cluster sum of squares plt.show() n_clusters=input('insert number of topics ') print('') print('') #Applying kmeans to the dataset / Creating the kmeans classifier kmeans = KMeans(n_clusters = int(n_clusters), init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0) y_kmeans = kmeans.fit_predict(x) y_kmeans1=pd.DataFrame(y_kmeans) y_kmeans1=y_kmeans1.rename(columns={0:"label"}) # Append words to list tsne_df1=pd.DataFrame(tsne_df) tsne_df1=tsne_df1.join(y_kmeans1) df=pd.DataFrame(df) df=df.rename(columns={0:'text'}) tsne_df1=tsne_df1.join(df) tsne_df1.to_csv(just_domain+'_sentences.csv')
And Finally, Interactive Plotting
The last part of the code is using the wonderful module – plotly to plot the clusters, their size (length/total length) and the entities/salience results. The plots are interactive and are exported to html so this can be sent to colleagues (yay!).
Some tables are generated including the amazing ipywidgets which allows us to explore the sentences/topics in Jupyter notebook.
Lastly the keywords extraction functions are called on our text.
For the complete notebook click here
Conclusion
We have come a long way and we have the industry to thank for it. The SEO industry has a wonderful culture of knowledge-sharing and mutual aspiration to achieve improvement, without all these beautiful minds mentioned here (and others) we could never have come this far. We are glad we can now participate and share something of our own with the SEO community and we hope that this was useful or at least interesting.
Do you have any suggestions for improvements? What are you using BERT for?
We would love to hear your input /collaborate with you to develop this further.
If you have any questions or suggestions please feel free to reach out: