Entity Extraction On a Website | AWS Comprehend

By February 22, 2018 No Comments

Use Case

You want to better understand what entities are embedded in a company’s website so you can understand  what that company is focused on.  You can use a tool like this if you are prospecting, thinking about a partnership, etc.  How do you do this in the most efficient way?  There are some tools that have made this a lot easier.

1. Select Your Target

Here are the steps that we used for  They are a simple squarespace site.  You can see this by checking out

2. Get the data

For entity extraction, raw text is the goal. You want as much as you can get without having duplicates.  Here is how you can pull everything that you need.  Here are some command line arguments to run on a Mac.

  1. For the domain you want to search, change directories to a clean directory labeled YYYYMMDD_the_domain.
  2. Run this command: wget -p -k –recursive
  3. cd into the ./blog directory.
  4. Cat all of the blog articles out using this recursive command: find . -type f -exec cat {} >> ../catted_file “;”

3. Prep Query to an Entity Extraction Engine |  Comprehend

In this simple case, we are going to query a AWS’s Comprehend service.  We will need to write some simple Python3 code.

Since we can’t submit more than 5000 bytes, we need to submit a batched job that break’s up our raw text into simplified batch text.   To do that, I wrote some very simple code:

temp = open('./body_output/catted_file', 'r').read()
strings = temp.split(" ");
counter = 0;
aws_submission = "";
submission_counter = 0;
aws_queued_objects = [] for word in strings:
pre_add_submission = aws_submission
aws_submission = aws_submission + " " + word
if len(aws_submission.encode('utf-8')) >5000:
submission_counter = submission_counter+1
print ("Number = " + str(submission_counter) + " with a byte size of "+\n"+
"+ str(len(pre_add_submission.encode('utf-8'))))
aws_submission = ""

Now,  we have to submit the batched job.  This is very simple, assuming that you have your boto3 library properly installed and your AWS configs running correctly.

response = client.batch_detect_entities(


Now…. all you have to do is visualize the results.  Note, you need to visualize this result outside of the Comprehend tool because there is no way to import data into that viewer.  This snapshot is what it looks like.

More importantly, the key work is to analyze.  We will leave that up to you!


Source Code

It was made to be as simple as possible without over complicating things.