SEO

How to Use Python to Automate SEO Keyword Clustering Based on Search Intent

Python can help to make SEO keyword research faster, more accurate, and more scalable. Here's what you should know.

There’s a lot to learn about search intent, from using deep learning to infer search intent by classifying text and breaking down SERP titles with Natural Language Processing (NLP) techniques, to clustering based on semantic relevance, with the benefits explained.

Not only do we understand the benefits of deciphering search intent, but we also have a number of techniques for scale and automation at our disposal.

However, this frequently entails creating your own AI. What if you don’t have the time or knowledge to do so?

In this column, you’ll learn how to use Python to automate keyword clustering based on search intent.

SERPs Provide Insights Into Search Intent

Some methods necessitate extracting all of the copy from the titles of the ranking content for a given keyword, then feeding it into a neural network model (which you must then build and test), or perhaps you’re using NLP to cluster keywords.

There is another method that allows you to use Google’s AI to do the work for you instead of scraping all of the SERPs content and building an AI model.

Assume that Google ranks site URLs in descending order based on the likelihood of the content satisfying the user query. As a result, if the intent for the two keywords is the same, the SERPs will most likely be similar.

Read What Is Rank Tracker in SEO? The Advantages of the Rank Tracking Tool.

To stay on top of Core Updates, many SEO professionals have compared SERP results for keywords to infer shared (or shared) search intent for years, so this is nothing new.

The value-add here is the comparison’s automation and scaling, which provides both speed and greater precision.

How To Cluster Keywords At Scale Based On Search Intent Python Programming (With Code)

Begin by downloading your SERPs results as a CSV file.

1. Import The List Into Your Python Notebook

import pandas as pd
import numpy as np

serps_input = pd.read_csv('data/sej_serps_input.csv')
serps_input

The SERPs file has now been imported into a Pandas dataframe, as shown below.

2. Data Filtering for Page 1

We’d like to compare the Page 1 results of each SERP for different keywords.

Because we want to filter at the keyword level, we’ll split the dataframe into mini keyword dataframes to run the filtering function before recombining into a single dataframe:

# Split 
serps_grpby_keyword = serps_input.groupby("keyword")
k_urls = 15

# Apply Combine
def filter_k_urls(group_df):
    filtered_df = group_df.loc[group_df['url'].notnull()]
    filtered_df = filtered_df.loc[filtered_df['rank'] <= k_urls]
    return filtered_df
filtered_serps = serps_grpby_keyword.apply(filter_k_urls)

# Combine
## Add prefix to column names
#normed = normed.add_prefix('normed_')

# Concatenate with initial data frame
filtered_serps_df = pd.concat([filtered_serps],axis=0)
del filtered_serps_df['keyword']
filtered_serps_df = filtered_serps_df.reset_index()
del filtered_serps_df['level_1']
filtered_serps_df

3. Converting Ranking URLs To Strings

Because there are more SERP result URLs than keywords, we must compress those URLs into a single line to represent the SERP for the keyword.

Here’s how it works:

# convert results to strings using Split Apply Combine
filtserps_grpby_keyword = filtered_serps_df.groupby("keyword")
def string_serps(df):
    df['serp_string'] = ''.join(df['url'])
    return df    

# Combine
strung_serps = filtserps_grpby_keyword.apply(string_serps)

# Concatenate with initial data frame and clean
strung_serps = pd.concat([strung_serps],axis=0)
strung_serps = strung_serps[['keyword', 'serp_string']]#.head(30)
strung_serps = strung_serps.drop_duplicates()
strung_serps

The SERP for each keyword is shown below, compressed into a single line.

4. Examine SERP Similarity

To perform the comparison, we now need every keyword SERP combination paired with other pairs:

# align serps
def serps_align(k, df):
    prime_df = df.loc[df.keyword == k]
    prime_df = prime_df.rename(columns = {"serp_string" : "serp_string_a", 'keyword': 'keyword_a'})
    comp_df = df.loc[df.keyword != k].reset_index(drop=True)
    prime_df = prime_df.loc[prime_df.index.repeat(len(comp_df.index))].reset_index(drop=True)
    prime_df = pd.concat([prime_df, comp_df], axis=1)
    prime_df = prime_df.rename(columns = {"serp_string" : "serp_string_b", 'keyword': 'keyword_b', "serp_string_a" : "serp_string", 'keyword_a': 'keyword'})
    return prime_df

columns = ['keyword', 'serp_string', 'keyword_b', 'serp_string_b']
matched_serps = pd.DataFrame(columns=columns)
matched_serps = matched_serps.fillna(0)
queries = strung_serps.keyword.to_list()

for q in queries:
    temp_df = serps_align(q, strung_serps)
    matched_serps = matched_serps.append(temp_df)

matched_serps

The preceding table displays all of the keyword SERP pair combinations, preparing it for SERP string comparison.

Because there is no open source library that compares list objects by order, the function below was written for you.

The’serp compare’ function compares the overlap of sites and the order of those sites across SERPs.

import py_stringmatching as sm
ws_tok = sm.WhitespaceTokenizer()

# Only compare the top k_urls results 
def serps_similarity(serps_str1, serps_str2, k=15):
    denom = k+1
    norm = sum([2*(1/i - 1.0/(denom)) for i in range(1, denom)])

    ws_tok = sm.WhitespaceTokenizer()

    serps_1 = ws_tok.tokenize(serps_str1)[:k]
    serps_2 = ws_tok.tokenize(serps_str2)[:k]

    match = lambda a, b: [b.index(x)+1 if x in b else None for x in a]

    pos_intersections = [(i+1,j) for i,j in enumerate(match(serps_1, serps_2)) if j is not None] 
    pos_in1_not_in2 = [i+1 for i,j in enumerate(match(serps_1, serps_2)) if j is None]
    pos_in2_not_in1 = [i+1 for i,j in enumerate(match(serps_2, serps_1)) if j is None]
    a_sum = sum([abs(1/i -1/j) for i,j in pos_intersections])
    b_sum = sum([abs(1/i -1/denom) for i in pos_in1_not_in2])
    c_sum = sum([abs(1/i -1/denom) for i in pos_in2_not_in1])

    intent_prime = a_sum + b_sum + c_sum
    intent_dist = 1 - (intent_prime/norm)
    return intent_dist
# Apply the function
matched_serps['si_simi'] = matched_serps.apply(lambda x: serps_similarity(x.serp_string, x.serp_string_b), axis=1)
serps_compared = matched_serps[['keyword', 'keyword_b', 'si_simi']]
serps_compared

We can begin clustering keywords now that the comparisons have been completed.

We will treat any keywords that have a weighted similarity of 40% or higher.

# group keywords by search intent
simi_lim = 0.4

# join search volume
keysv_df = serps_input[['keyword', 'search_volume']].drop_duplicates()
keysv_df.head()

# append topic vols
keywords_crossed_vols = serps_compared.merge(keysv_df, on = 'keyword', how = 'left')
keywords_crossed_vols = keywords_crossed_vols.rename(columns = {'keyword': 'topic', 'keyword_b': 'keyword',
                                                                'search_volume': 'topic_volume'})

# sim si_simi
keywords_crossed_vols.sort_values('topic_volume', ascending = False)


# strip NANs
keywords_filtered_nonnan = keywords_crossed_vols.dropna()
keywords_filtered_nonnan

We now have a potential topic name, keywords with SERP similarity, and search volume for each.

Keyword and keyword b have been renamed to topic and keyword, respectively.

Using the lamdas technique, we’ll iterate over the columns in the dataframe now.

The lamdas technique is more efficient than the.iterrows() function for iterating over rows in a Pandas dataframe because it converts rows to a list.

Here we go:

queries_in_df = list(set(keywords_filtered_nonnan.topic.to_list()))
topic_groups_numbered = {}
topics_added = []

def find_topics(si, keyw, topc):
    i = 0
    if (si >= simi_lim) and (not keyw in topics_added) and (not topc in topics_added): 
        i += 1     
        topics_added.append(keyw)
        topics_added.append(topc)
        topic_groups_numbered[i] = [keyw, topc]          
    elif si >= simi_lim and (keyw in topics_added) and (not topc in topics_added):  
        j = [key for key, value in topic_groups_numbered.items() if keyw in value]
        topics_added.append(topc)
        topic_groups_numbered[j[0]].append(topc)

    elif si >= simi_lim and (not keyw in topics_added) and (topc in topics_added):
        j = [key for key, value in topic_groups_numbered.items() if topc in value]        
        topics_added.append(keyw)
        topic_groups_numbered[j[0]].append(keyw) 

def apply_impl_ft(df):
  return df.apply(
      lambda row:
        find_topics(row.si_simi, row.keyword, row.topic), axis=1)

apply_impl_ft(keywords_filtered_nonnan)

topic_groups_numbered = {k:list(set(v)) for k, v in topic_groups_numbered.items()}

topic_groups_numbered

A dictionary with all of the keywords clustered by search intent into numbered groups is shown below:

{1: ['fixed rate isa',
  'isa rates',
  'isa interest rates',
  'best isa rates',
  'cash isa',
  'cash isa rates'],
 2: ['child savings account', 'kids savings account'],
 3: ['savings account',
  'savings account interest rate',
  'savings rates',
  'fixed rate savings',
  'easy access savings',
  'fixed rate bonds',
  'online savings account',
  'easy access savings account',
  'savings accounts uk'],
 4: ['isa account', 'isa', 'isa savings']}

Let’s put that in a dataframe:

topic_groups_lst = []

for k, l in topic_groups_numbered.items():
    for v in l:
        topic_groups_lst.append([k, v])

topic_groups_dictdf = pd.DataFrame(topic_groups_lst, columns=['topic_group_no', 'keyword'])
                                
topic_groups_dictdf

The search intent groups shown above are a good approximation of the keywords contained within them, which an SEO expert would most likely achieve.

Despite the fact that we only used a small set of keywords, the method can clearly be scaled to thousands (if not more).

Activating the Outputs to Improve Your Search

Of course, the preceding could be taken a step further by using neural networks to process the ranking content for more accurate cluster and cluster group naming, as some commercial products already do.

For the time being, you can use this output to:

  • Incorporate this into your own SEO dashboard systems to improve the relevance of your trends and SEO reporting.
  • Improve your paid search campaigns by organizing your Google Ads accounts by search intent to achieve a higher Quality Score.
  • Merge redundant ecommerce search URLs for facets.
  • Instead of a traditional product catalog, structure the taxonomy of a shopping site based on search intent.

I’m sure there are more applications that I haven’t mentioned; please leave a comment if you know of any important ones that I haven’t already mentioned.

In any case, your SEO keyword research has just become a little more scalable, accurate, and faster!

Need help with our free SEO tools? Try our free Website ReviewerOnline Ping Website ToolPage Speed Checker.

Learn more from SEO and Read SEO Experts Are Frustrated by Web Design Practices.

Related Articles

Leave a Reply

Your email address will not be published.

Back to top button