GIS Crawling Done Easy Using Python With PostGreSQL

By AWS, Databases, Development, GIS, Python, RDSNo Comments


Company “Help-i-copter” allows users to rent a ride in a helicopter for a lift in style. However, Help-i-copter is having trouble creating software that find the best flat surface for the helicopter pilots to land to pick up their customers. A perfect solution would be crawling data from sites and to store it on a PostGreSQL database. What steps should they take?


Crawling, in software terms, is the act of copying data from a website using a computer program. This data can be saved to a file, printed to your screen, or put into a database. It entirely depends on the project and its scale. On a much greater scale, where data can be overwhelming, a database would be the best option, which is why it’s important to have one. You could have your grocery list stored online, but you don’t want to login every time to see it. You could instead crawl that information off the web with a command like:

soup.find_all('a', attrs={'class': ‘list'})

  • Write a crawling program that collects data from multiple sources, to get the most accurate data. This program would most likely be written in Python as it has access to modules like Requests and BeautifulSoup.

import requests
from bs4 import BeautifulSoup
url = ""
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
surfaces = soup.find_all('a', attrs={'class': 'surface'})
for d in surfaces:
#getting all the surface information

  • Store the found data on a PostGreSQL database. This allows the data to be stored and sorted. When it’s sorted, this allows the data to be queried quickly and efficiently.

import psycopg2
endpoint = ""
c = psycopg2.connect(host=endpoint ,database="surfaces", user="admin", password="admin")
cursor = c.cursor()
query = "INSERT INTO surfaces (airtraffic,windspeed,date_produced,lat,lon,state) VALUES (%s,%s,%s,%s,%s,%s);"
data = #all info from surfaces in an array
cursor.execute(query, data)

  • Create an app that allows pilots to enter in a variable they want to query and parameters for that variable. Then use Python and the Psycopg2 module to query the data accordingly. This app with allow non-programmers to access the database without having to learn PostGreSQL.

letter = raw_input("Choose a number: ")
cursor.execute("SELECT * FROM surfaces WHERE airtraffic LIKE " + letter)
for row in cursor:
print str(row) + "\n"



So why is it important to store large amounts of data on a database? Simply, it gives you an easy way to find and query data. For example, let’s say you have a list of employees you got from your company’s website, and added it to a PostGreSQL database. In such a database, finding data like “all employees with a first name that begins with an L” would be much simpler, as databases are well organized. Simple commands like:

where employee_name LIKE 'L%'

would return “Larry, Liz, and Lucy” quickly and efficiently.

Bounding Box


This data could be used in a lot of ways. Namely, we could use out latitude and longitude coordinates to create a bounding box and get information about the areas within that box. This could help Help-i-copter in a number of ways. Pilots could use a bounding box to find flat areas near them and sort those by air-traffic, location, etc… It will be essentially asking the pilot the maximum and minimum coordinates of the box and then looking through the database for any surface that fit the description. Here’s what that might look like in Python:

xmin = raw_input("What is the minimum latitude? ")
xmax = raw_input("What is the maximum latitude? ")
ymin = raw_input("What is the minimum longitude? ")
ymax = raw_input("What is the maximum longitude? ")
cursor.execute("SELECT * FROM surfaces WHERE GEOMETRY(lat,lon) && ST_MakeEnvelope("
+ ymin + "," + ymin + "," + xmax + "," + ymax + ", 4326)")
for row in cursor:
print str(row) + "\n"



As you can see, crawling and databases can work very well together, especially when crawling large amounts of data. It would be otherwise inefficient and a lot slower to just store data on a normal document, or to do a crawl of the page every time you run the program. Help-i-copter able to efficiently crawl a site, upload data to a web page, and query that data back to a pilot quickly. Thanks to the power of PostGreSQL and Python.

AWS Sagemaker – predicting gasoline monthly output

By Artificial Intelligence, AWS, Development, Python, Sagemaker

AWS continues to wow me with all of the services that they are coming out with. What Amazon is doing is a very smart strategy. They are leveraging their technology stack to build more advanced solutions. In doing so, Amazon Web Services is following the “Profit From The Core” strategy down to the t.  Aside from following Amazon’s world domination plan, I wanted to see how well their roll out of artificial intelligence tools, like Sagemaker, went.


There are many articles about how AI works.  In some cases, an application is extraordinarily simple.  In other cases, it is endlessly complex. We are going to stick with the most simple model.  In this model, we have to do the following steps.

  1. Collect data
  2. Clean Data
  3. Build Model
  4. Train Model
  5. Predict Something

Amazon has tried to automate these steps as best as possible.   From Amazon’s site: “Amazon SageMaker is a fully-managed platform that enables developers and data scientists to quickly and easily build, train, and deploy machine learning models at any scale. Amazon SageMaker removes all the barriers that typically slow down developers who want to use machine learning.”

Lets see how well they do.  Gentle people…lets start our clocks.  The time is 20 May 2018 @ 6:05pm.

Notebook Instances

The first thing that you do as part of your training is build notebooks. According to Jupyter, the developer of Project Jupyter, a notebook is an application that allows you to create and share documents that contain live code, equations, visualizations and narrative text.

You follow the simple tutorial and it looks something like this.

AWS Sage simple Jupyter Notebook

Time: 6:11:34 (so far so good)

Example Selection – Time Series Forecast

The first thing that we want to do is go to the “SageMaker Examples” tab, and make a copy of “linear_time_series_forecast_2019-05-20”.  I have had some experience predicting when events would happen and wanted to follow something that I already know. If you aren’t familiar, please check out this coursera video.

Time: 6:20:17

Read Background

Forecasting is potentially the most broadly relevant machine learning topic there is. Whether predicting future sales in retail, housing prices in real estate, traffic in cities, or patient visits in healthcare, almost every industry could benefit from improvements in their forecasts. There are numerous statistical methodologies that have been developed to forecast time-series data. However, the process for developing forecasts tends to be a mix of objective statistics and subjective interpretations.

Properly modeling time-series data takes a great deal of care. What’s the right level of aggregation to model at? Too granular and the signal gets lost in the noise, too aggregate and important variation is missed. Also, what is the right cyclicality? Daily, weekly, monthly? Are there holiday peaks? How should we weight recent versus overall trends?

Linear regression with appropriate controls for trend, seasonality, and recent behavior, remains a common method for forecasting stable time-series with reasonable volatility. This notebook will build a linear model to forecast weekly output for US gasoline products starting in 1991 to 2005. It will focus almost exclusively on the application. For a more in-depth treatment on forecasting in general, see Forecasting: Principles & Practice. In addition, because our dataset is a single time-series, we’ll stick with SageMaker’s Linear Learner algorithm. If we had multiple, related time-series, we would use SageMaker’s DeepAR algorithm, which is specifically designed for forecasting. See the DeepAR Notebook for more detail.

Time: 6:24:13

S3 Setup

Let’s start by specifying:

  • The S3 bucket and prefix that you want to use for training and model data. This should be within the same region as the Notebook Instance, training, and hosting.
  • The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these. Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regexp with a the appropriate full IAM role arn string(s).

I set up a simple s3 bucket like this: 20180520-sage-test-v1-tm

Import the Python libraries.

Got distracted and played with all of the functions.  Time 6:38:07.


Let’s download the data. More information about this dataset can be found here.

You can run some simple plots using Matlab and Pandas.

Sage time series gas plots


Transform Data To Predictive Model

Next we’ll transform the dataset to make it look a bit more like a standard prediction model.

This stage doesn’t look immediately clear. If you were to just click through the buttons, it takes a few seconds. If you want to read through these stages, it will take you a lot longer. In the end, you should have the following files stored on S3.

Note, you can’t review the content from these using a text editor. The data is stored in binary.

Time: 7:02:43

I normally don’t use a lot of notebooks. As a result, this took a little longer because I ran into some problems.


Amazon SageMaker’s Linear Learner actually fits many models in parallel. Each model has slightly different hyper-parameters. The model the best fit is the one used. This functionality is automatically enabled. We can influence this using parameters like:

  • num_models to increase the total number of models run. The specified parameters values, will always be in those models. However, the algorithm also chooses models with nearby parameter values. This is in case a nearby solution is more optimal. In this case, we’re going to use the max of 32.
  • loss which controls how we penalize mistakes in our model estimates. For this case, let’s use absolute loss. We haven’t spent much time cleaning the data. Therefore, absolute loss will adjust less to accommodate outliers.
  • wd or l1 which control regularization. Regularization helps prevent model overfitting. It works by preventing our estimates from becoming too finely tuned to the training data. This is why it is good to make sure your training data is an appropriate sample of the entire data set. In this case, we’ll leave these parameters as their default “auto”.

This part of the demo took a lot longer….

And it worked!

Ended at time: 7:21:54 pm.


The Forecast!

This is what we have all been waiting for!

For our example we’ll keep things simple and use Median Absolute Percent Error (MdAPE), but we’ll also compare it to a naive benchmark forecast (that week last year’s demand * that week last year / that week two year’s ago).

As we can see our MdAPE is substantially better than the naive. Additionally, we actually swing from a forecast that is too volatile to one that under-represents the noise in our data. However, the overall shape of the statistical forecast does appear to better represent the actual data.

Next, let’s generate a multi-step-ahead forecast. To do this, we’ll need to loop over invoking the endpoint one row at a time and make sure the lags in our model are updated appropriately.



It does appear that for pre-built scenarios that AWS’s Sagemaker worked for linear time series prediction!  While it doesn’t make you a master data scientist, it does however give you a simple place to train and practice with data sets.  If you wanted to master time series, you could simply plug in other datasets and conduct the same sort of analysis and cross check your work with other people’s results.  With Sagemaker, you have a complete and working blueprint!

Wrap up time: 8:19:30pm (with some distractions and breaks)


Entity Extraction On a Website | AWS Comprehend

By AWS, Comprehend, Development, PythonNo Comments

Use Case

You want to better understand what entities are embedded in a company’s website so you can understand  what that company is focused on.  You can use a tool like this if you are prospecting, thinking about a partnership, etc.  How do you do this in the most efficient way?  There are some tools that have made this a lot easier.

1. Select Your Target

Here are the steps that we used for  They are a simple squarespace site.  You can see this by checking out

2. Get the data

For entity extraction, raw text is the goal. You want as much as you can get without having duplicates.  Here is how you can pull everything that you need.  Here are some command line arguments to run on a Mac.

  1. For the domain you want to search, change directories to a clean directory labeled YYYYMMDD_the_domain.
  2. Run this command: wget -p -k –recursive
  3. cd into the ./blog directory.
  4. Cat all of the blog articles out using this recursive command: find . -type f -exec cat {} >> ../catted_file “;”

3. Prep Query to an Entity Extraction Engine |  Comprehend

In this simple case, we are going to query a AWS’s Comprehend service.  We will need to write some simple Python3 code.

Since we can’t submit more than 5000 bytes, we need to submit a batched job that break’s up our raw text into simplified batch text.   To do that, I wrote some very simple code:

temp = open('./body_output/catted_file', 'r').read()
strings = temp.split(" ");
counter = 0;
aws_submission = "";
submission_counter = 0;
aws_queued_objects = [] for word in strings:
pre_add_submission = aws_submission
aws_submission = aws_submission + " " + word
if len(aws_submission.encode('utf-8')) >5000:
submission_counter = submission_counter+1
print ("Number = " + str(submission_counter) + " with a byte size of "+\n"+
"+ str(len(pre_add_submission.encode('utf-8'))))
aws_submission = ""

Now,  we have to submit the batched job.  This is very simple, assuming that you have your boto3 library properly installed and your AWS configs running correctly.

response = client.batch_detect_entities(


Now…. all you have to do is visualize the results.  Note, you need to visualize this result outside of the Comprehend tool because there is no way to import data into that viewer.  This snapshot is what it looks like.

More importantly, the key work is to analyze.  We will leave that up to you!


Source Code

It was made to be as simple as possible without over complicating things.



Enabling PostGIS on PostgreSQL with Amazon Web Services(AWS) | Relational Database Service(RDS)

By AWS, Databases, GIS

Part 1 of our PostGIS series is to help developers enable PostGIS for a PostgreSQL database in AWS.

Amazon’s Relational Database Service (RDS) allows users to launch PostgreSQL databases easily and quickly. PostgreSQL allows developers to extend the functionality of the core database via extensions.  For this reason, extensions loaded into the database can function just like features that are built in.

Because this database is on Amazon’s servers, they have limitations on what extensions the will allow to be hosted on their service.  These limitations are mostly for performance reasons.

PostGIS Version

By default, the database has PostGIS installed on it. This isn’t critical to run, but it may be helpful in your troubleshooting efforts.

To get the current version, use this:

SELECT PostGIS_full_version();


POSTGIS=”2.3.2 r15302″ GEOS=”3.5.1-CAPI-1.9.1 r4246″ PROJ=”Rel. 4.9.3, 15 August 2016″ GDAL=”GDAL 2.1.3, released 2017/20/01″ LIBXML=”2.9.1″ LIBJSON=”0.12″ RASTER

Normal Installation Guidance

If you follow PostGIS’s guidance here, you will get the following error.

When you run this command:
CREATE EXTENSION postgis_sfcgal;

ERROR: Extension “postgis_sfcgal” is not supported by Amazon RDS
DETAIL: Installing the extension “postgis_sfcgal” failed, because it is not on the list of extensions supported by Amazon RDS.
HINT: Amazon RDS allows users with rds_superuser role to install supported extensions. See: SHOW rds.extensions;

AWS RDS PosGIS Installation Guidance

AWS covers enabling PosGIS via their common database administrator (DBA) tasks here.

In about 3 minutes, you can run through these commands and enable PosGIS tables in your database.

AWS Supported Extensions

This is a pretty nifty command to see what extensions are displayed on the device.

SHOW rds.extensions;

Here is a list of the Posgresql Extentions that RDS does support:







Alexa Logo

Alexa Push Notifications | When?

By AWS, Development, Health, ToolsNo Comments

Eventually, Amazon will allow developers to deliver Alexa push notifications to their device. It is only a matter of time. This change, while seemingly subtle, will unleash an avalanche of innovations. Parents, imagine you are sitting in your kitchen and suddenly you get a report that your child’s bus will be at the drop off point in 3 minutes. Imagine eliminating time spent standing outside in the cold for a bus, simply by checking a phone interface. What if you could be notified when a storm is coming in your area?

The risk for abuse is real, but it is better off in the hands of the willing. All sorts of annoying advertising services will likely pop up. But for entrepreneurs and nerds like us, this is an exciting time.

Alexa Push Notification Work Arounds

Developers from all walks of life have opined on this topic. They have hacked together some incredibly creative solutions. Here are some of them:

From the Amazon Developer Forums

There has been some brain power expended to solve this without a clean solution from Amazon. It is left up to the reader to determine which is best. The approaches are:

Option 1: Developer Forum | Video

Option 2:  Hack |  Video

Timing Rumors

So far, there are only whispers of this capability on the internet.



Official Response

Here is the official response from Amazon developer services as of 15 Nov 2016.

Question: “I am a software developer and have read that Alexa may soon support push notifications to allow Alexa to trigger a conversation with a human instead of the other way around. Is this true and if so, when is it coming?”

Answer: From Amazon, as of 19 Nov 2016

I’m not sure where you read this, but unfortunately, we have nothing to announce at this time regarding this. I’d suggest keeping an eye on our official blog for any official announcements. 

Integrating Basecamp 2 and Slack

By AWS, Tips & Tutorials

In the world of software development, it is hard to keep track of everything and everyone. This is especially with many teams having remote developers. Two popular tools to help keep teams synced, are Slack and Basecamp. Slack allows teams to quickly chat and share files. Basecamp helps with longer chats, as well as providing a calendar feature. Because Slack is the goto software, most team members will always have Slack open. While Basecamp provides the ability to email the team, not everyone checks messages these messages. Additionally, checking the Basecamp calendar every day to see who is out can be annoying. The solution: Integrate Slack with Basecamp. This is rather simple to do, and results in a message being posted to Slack. The message contains absences for today and tomorrow.


  1. Create a Slack Webhook
    1. Goto and install the app.
    2. Once installed go back to the same url if not already there, and select “Add Configuration”.
    3. Under “Choose a channel”, select any channel, and then select “Add incoming webhooks integration”.
    4. Note the url under “webhook url”, you will need this later.
  2. Gather Your Basecamp 2 ID, and Project ID
    1. Go to main page of your Basecamp project: usually
    2. The first set of numbers is your Basecamp 2 ID and the second set is your Project ID. These values will be needed later.
  3. Generate an Auth token for Basecamp
    1. Using Postman “GET” the following URL: 2 ID/api/v1/projects.json
    2. Under Authorization select “Type: Basic Auth” and enter your username and password
    3. Run the Query, then under Request Headers copy the value under “Authorization: Basic ValueToCopy”
  4. Create a AWS Lambda function, and paste the code found in
  5. In the code replace the following:
    1. YourEmailHere with Your Email
    2. YourAuthHere with the value copied from step 3
    3. YourBaseCampIDHere with Basecamp 2 Id from step 2
    4. YourBaseCampProjectIdHere with Project ID from step 2
    5. YourSlackHookURLHere with the webhook url found in step 1
    6. @YourNameHere with @ your slack name
    7. #YourChannelHere with #the channel you want the notification posted in
  6. Configure a Cloudwatch scheduled event to trigger your lambda function at the time you want to receive the slack message

Basecamp message format:

  • Post needs the following in the title: the word “out” in any case, and either mm/dd or m/d.
  • Examples:
    • JOE OUT | 07/07
    • Steve out | 07/07
    • Smith out | 7/7
    • John will be out on 7/7
  • If the post is on the calendar then no specification is need in the title.
  • For absences spanning multiple days.
    • Post the message on the calendar.
    • Put each specific day in the message title.