Build a Reddit bot in Python for similar posts suggestion.
Introduction
In this post, we are going to build a Reddit bot for related posts suggestion. Think of it as a “related” page for posts in the subreddit. The idea is pretty simple: when someone creates a new post - the RSPBot will reply with a bunch of similar posts (if available).
Here is example for r/smallbusiness:
Original post:
Getting my small business bank account today. Should I go with a small credit union or a large bank?
Related posts:
Best small business bank account?
Bank recommendations for small business.
Which bank do you use for you small business and why?
Recommendations for Small Business Bank
Actually it’s a good way to measure how important this topic to the particular subreddit audience, what problem they are trying to solve and in what way.
But before moving to another topic I suggest taking a look at etiquette for Reddit bots. This is a set of rules and suggestions on what to do and what not with your bot. Don’t ignore those rules as banned bot is a dead sad bot :)
Scraping Reddit
Let’s think about the following workflow: rspbot monitors the subreddit for new post submission, then it extracts post title and performs a search for similar posts in the same subreddit, then reply with a list of related posts. Sounds like an easy task, but think about the active subreddits where you receive a large number of new submissions and then try to search for the post titles, there is a chance that after some time you just flood Reddit with search API queries.
What if we just scrape subreddit titles beforehand and make it our database of post titles and create a mechanism to search through them and then reply with a bunch of similar topics? Sounds great as it simplifies our workflow, so we could perform all post titles matching on our local machine.
Scraping reddit posts is pretty simple as there is a great API documentation, but we will use Python API wrapper - PRAW, as it’s encapsulate all API query and provide easy to use programming interface. But what’s more important is that PRAW will split up your large request to multiple API calls each separated with some delay in order to not break Reddit API guidelines.
Here is an excerpt from subreddit scraper source code:
def add_comment_tree(root_comment, all_comments):
comment_prop = {'body': root_comment.body,
'ups': root_comment.ups}
if root_comment.replies:
comment_prop['comments'] = list()
for reply in root_comment.replies:
add_comment_tree(reply, comment_prop['comments'])
all_comments.append(comment_prop)
def get_submission_output(submission):
return {
'permalink': submission.permalink,
'title': submission.title,
'created': submission.created,
'url': submission.url,
'body': submission.selftext,
'ups': submission.ups,
'comments': list()
}
def save_submission(output, submission_id, output_path):
# flush to file with submission id as name
out_file = os.path.join(output_path, submission_id + ".json")
with open(out_file, "w") as fp:
json.dump(output, fp)
def parse_subreddit(subreddit, output_path, include_comments=True):
reddit = praw.Reddit(user_agent=auth['user_agent'], client_id=auth['client_id'],
client_secret=auth['secret_id'])
subreddit = reddit.subreddit(subreddit)
submissions = subreddit.submissions()
for submission in submissions:
print("Working on ... ", submission.title)
output = get_submission_output(submission)
if include_comments:
submission.comments.replace_more(limit=0)
for comment in submission.comments:
add_comment_tree(comment, output['comments'])
# flush to file with submission id as name
save_submission(output, submission.id, output_path)
if __name__ == '__main__':
parse_subreddit("smallbusiness", "/tmp/smallbusiness/", include_comments=False)<span data-mce-type="bookmark" id="mce_SELREST_start" data-mce-style="overflow:hidden;line-height:0" style="overflow:hidden;line-height:0" >&#65279;</span>
parse_subreddit method iterates over all posts in subreddit and extracts all information defined in get_submission_output method with optional comments list and saves it to JSON file with the file name as post ID.
If you don’t need comments, then just set include_comments to False as it speeds up scraper significantly.
Make sure you have created app on Reddit and received your client ID and client secret.
Content-based recommendation engine
So, we have subreddit posts as a list of JSON files with all information we need, now we need to build a mechanism for searching through files and extracting similar posts as recommendations for users.
In rspbot we are converting text documents to a matrix of token occurrences and then use kernels as measures of similarity. So, basically we will use scikit-learn library class HashingVectorizer with linear_kernel method to get similarity matrix.
After we transformed the list of post titles into a matrix of numbers we can save this matrix as a binary file and then when running rspbot - just load this file and use it for getting similar posts. This will improve performance as we don’t need to parse the whole list of posts from JSON files to build matrix representation, we just load it from the file and re-use it from the previous run.
Make sure you have a basic understanding of how HashingVectorizer and linear_kernel work (check out scikit-learn tutorial on basic concepts) before moving to module source code.
Monitoring Reddit for new posts
Now we have a set of post titles as JSON files and a mechanism for matching similar posts, then next step would be to make a bot to listen for new submissions, get all related posts if available and then reply with a list of suggested topics.
With PRAW monitoring for new submissions is easy as writing for loop:
subreddit = reddit.subreddit('AskReddit')
for submission in subreddit.stream.submissions():
# do something with submission
Let’s summarize the whole bot “life” in a list of steps:
- Scrape the subreddit to list JSON files as related post suggestion database.
- Convert the list of JSON files to the matrix of token occurrence for finding related posts. This step is performed only for the first time the bot is started, latter we just re-use the existing matrix saved as a binary file.
- Monitor for new submissions; in case of new submission - convert to the matrix of token occurrence and combine it with the “old” matrix; search for related posts - if found - display a list of titles with URL.