# Milk Projects: miCrawler

Problem: At my work it was decided that we were to do mail drops. Rather than leaving the office and walking around the local area, we hit up local real estates offices etc and found all the addresses of houses for sale, our target audience. As it happened the real estate people were not helpful, there websites were. More helpful then these was the TradeMe website. As it turned out most of the real estate agents also listed the houses for sale on there anyway.
For a while it was my job to go through every listing and put them into a spreadsheet for some processing. After a day of this I thought it would be easier to write a web crawler to do it all for me.

tl;dr We need to get all the addresses off TradeMe, that are local and recent.

Solution:
Having recently done Python at university, I decided to use that. As all I wanted was a CLI program that I could implement quickly, it seemed to be a good fit. The program searches the list of listings for the listing number and then searches each listing for the address. If it has one, it is appended to a txt file. This file can then be added to a spreadsheet using imports.
Credit must go to a couple of places for this:
http://forums.devshed.com/python-programming-11/webcrawler-in-python-73318.html
http://www.daniweb.com/code/snippet216420.html
daniweb is a great source for help, although I can't stand the website. Found it tricky to navigate. Google helped with that.

Code:



#!C:/Program Files/Python/Python 2.7

import urllib, time

print "miCrawler version 0.1"

f = open('addressFile.txt', 'a')

# The below URL is the houses for sale in the manukau area, in list view, sorted by most recent, page 1

# May need to be updated occassionally.

tmManukau1 = 'http://www.trademe.co.nz/browse/property/regionlistings.aspx?mcat=0350-5748-3399-&v=List&key=100055&page='

tmManukau2 = '&sort_order=expiry_desc'

for i in range(0, 60):

    tmManukau = tmManukau1 + str(i) + tmManukau2

    

    page = urllib.urlopen(tmManukau).read()

    # Note that line 637 is the first occurance of a url for a particular listing.

    # The 9 digits following this string will represent a listing.

    

    auctionString = '/Trade-me-property/Residential-property/Houses-for-sale/auction-'

    

    pos = 1

    pos = page.find(auctionString)

    # Loops through this process until there are no more valid links on the page.

    while (pos != -1):

        # This will build a string for the URL of the listing found.

        listingURL = 'http://www.trademe.co.nz' + page[pos : pos + 77]

        listing = urllib.urlopen(listingURL).read()

        # Gets the address from the listing, the 26 skips the <td> and spaces.

        addSPos = listing.find('<td>', listing.find('Location:')) + 26

        addEPos = listing.find('</td>', addSPos)

        address = listing[addSPos : addEPos]

        try:

            int(address[0])

            address = address.replace('<br />', '\t')

            print (address + '\n')

            f.write(address + '\n')

        except ValueError:

            print (address + 'HAS NO NUMBER \n')

        finally:

            # Each address occurs twice, skip the second.

            pos = page.find(auctionString, page.find(auctionString, pos + 78) + 78)

Note:
This is very specific. It only searches Manukau and surrounding areas. Small changes on the urls will need to be made for that to change. The range in the loop determines which pages are to be read. Also I have no idea what that mcat number is and may need to be updated.
I suppose the main key here is this line:



 listing = urllib.urlopen(listingURL).read()

'listing' now contains the entire html code for the page. Performing some search functions on here for the auction numbers allows the program to find the relevent listing and from that page find the address. Simple enough.
The random hard coded numbers are there due to formatting. Easiest way I could think of to get around it. So that's that one. Never used all the 1,180 addresses. Kept the boss off my back for a while though.

Wednesday, 22 December 2010

miCrawler

No comments:

Post a Comment