
Use Case
You want to better understand what entities are embedded in a company’s website so you can understand what that company is focused on. You can use a tool like this if you are prospecting, thinking about a partnership, etc. How do you do this in the most efficient way? There are some tools that have made this a lot easier.
1. Select Your Target
Here are the steps that we used for http://www.magicinc.org. They are a simple squarespace site. You can see this by checking out https://builtwith.com/magicinc.org
2. Get the data
For entity extraction, raw text is the goal. You want as much as you can get without having duplicates. Here is how you can pull everything that you need. Here are some command line arguments to run on a Mac.
- For the domain you want to search, change directories to a clean directory labeled YYYYMMDD_the_domain.
- Run this command: wget -p -k –recursive http://www.magicinc.org
- cd into the ./blog directory.
- Cat all of the blog articles out using this recursive command: find . -type f -exec cat {} >> ../catted_file “;”
3. Prep Query to an Entity Extraction Engine | Comprehend
In this simple case, we are going to query a AWS’s Comprehend service. We will need to write some simple Python3 code.
Since we can’t submit more than 5000 bytes, we need to submit a batched job that break’s up our raw text into simplified batch text. To do that, I wrote some very simple code:
temp = open('./body_output/catted_file', 'r').read()
strings = temp.split(" ");
counter = 0;
aws_submission = "";
submission_counter = 0;
aws_queued_objects = []
for word in strings:
pre_add_submission = aws_submission
aws_submission = aws_submission + " " + word
if len(aws_submission.encode('utf-8')) >5000:
submission_counter = submission_counter+1
print ("Number = " + str(submission_counter) + " with a byte size of "+\n"+
"+ str(len(pre_add_submission.encode('utf-8'))))
aws_queued_objects.append(pre_add_submission)
aws_submission = ""
Now, we have to submit the batched job. This is very simple, assuming that you have your boto3 library properly installed and your AWS configs running correctly.
response = client.batch_detect_entities(
TextList=aws_queued_objects,LanguageCode='en')
Analyze
Now…. all you have to do is visualize the results. Note, you need to visualize this result outside of the Comprehend tool because there is no way to import data into that viewer. This snapshot is what it looks like.
More importantly, the key work is to analyze. We will leave that up to you!
Source Code
It was made to be as simple as possible without over complicating things.
Github: https://github.com/Bytelion/aws_comprehend_batched_job