Problem

Company “Help-i-copter” allows users to rent a ride in a helicopter for a lift in style. However, Help-i-copter is having trouble creating software that find the best flat surface for the helicopter pilots to land to pick up their customers. A perfect solution would be crawling data from sites and to store it on a PostGreSQL database. What steps should they take?

Crawling

Crawling, in software terms, is the act of copying data from a website using a computer program. This data can be saved to a file, printed to your screen, or put into a database. It entirely depends on the project and its scale. On a much greater scale, where data can be overwhelming, a database would be the best option, which is why it’s important to have one. You could have your grocery list stored online, but you don’t want to login every time to see it. You could instead crawl that information off the web with a command like:

soup.find_all('a', attrs={'class': ‘list'})

  • Write a crawling program that collects data from multiple sources, to get the most accurate data. This program would most likely be written in Python as it has access to modules like Requests and BeautifulSoup.

import requests
from bs4 import BeautifulSoup
url = "http://www.placeforyourhelicopter.com"
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
surfaces = soup.find_all('a', attrs={'class': 'surface'})
for d in surfaces:
#getting all the surface information

  • Store the found data on a PostGreSQL database. This allows the data to be stored and sorted. When it’s sorted, this allows the data to be queried quickly and efficiently.

import psycopg2
endpoint = "surfaces.amazonaws.com"
c = psycopg2.connect(host=endpoint ,database="surfaces", user="admin", password="admin")
cursor = c.cursor()
query = "INSERT INTO surfaces (airtraffic,windspeed,date_produced,lat,lon,state) VALUES (%s,%s,%s,%s,%s,%s);"
data = #all info from surfaces in an array
cursor.execute(query, data)

  • Create an app that allows pilots to enter in a variable they want to query and parameters for that variable. Then use Python and the Psycopg2 module to query the data accordingly. This app with allow non-programmers to access the database without having to learn PostGreSQL.

letter = raw_input("Choose a number: ")
cursor.execute("SELECT * FROM surfaces WHERE airtraffic LIKE " + letter)
for row in cursor:
print str(row) + "\n"

 

Databases

So why is it important to store large amounts of data on a database? Simply, it gives you an easy way to find and query data. For example, let’s say you have a list of employees you got from your company’s website, and added it to a PostGreSQL database. In such a database, finding data like “all employees with a first name that begins with an L” would be much simpler, as databases are well organized. Simple commands like:

where employee_name LIKE 'L%'

would return “Larry, Liz, and Lucy” quickly and efficiently.

Bounding Box

 

This data could be used in a lot of ways. Namely, we could use out latitude and longitude coordinates to create a bounding box and get information about the areas within that box. This could help Help-i-copter in a number of ways. Pilots could use a bounding box to find flat areas near them and sort those by air-traffic, location, etc… It will be essentially asking the pilot the maximum and minimum coordinates of the box and then looking through the database for any surface that fit the description. Here’s what that might look like in Python:

xmin = raw_input("What is the minimum latitude? ")
xmax = raw_input("What is the maximum latitude? ")
ymin = raw_input("What is the minimum longitude? ")
ymax = raw_input("What is the maximum longitude? ")
cursor.execute("SELECT * FROM surfaces WHERE GEOMETRY(lat,lon) && ST_MakeEnvelope("
+ ymin + "," + ymin + "," + xmax + "," + ymax + ", 4326)")
for row in cursor:
print str(row) + "\n"

Conclusion

 

As you can see, crawling and databases can work very well together, especially when crawling large amounts of data. It would be otherwise inefficient and a lot slower to just store data on a normal document, or to do a crawl of the page every time you run the program. Help-i-copter able to efficiently crawl a site, upload data to a web page, and query that data back to a pilot quickly. Thanks to the power of PostGreSQL and Python.