For a while it was my job to go through every listing and put them into a spreadsheet for some processing. After a day of this I thought it would be easier to write a web crawler to do it all for me.
tl;dr We need to get all the addresses off TradeMe, that are local and recent.
Solution:
Having recently done Python at university, I decided to use that. As all I wanted was a CLI program that I could implement quickly, it seemed to be a good fit. The program searches the list of listings for the listing number and then searches each listing for the address. If it has one, it is appended to a txt file. This file can then be added to a spreadsheet using imports.
Credit must go to a couple of places for this:
http://forums.devshed.com/python-programming-11/webcrawler-in-python-73318.html
http://www.daniweb.com/code/snippet216420.html
daniweb is a great source for help, although I can't stand the website. Found it tricky to navigate. Google helped with that.
Code:
#!C:/Program Files/Python/Python 2.7
import urllib, time
print "miCrawler version 0.1"
f = open('addressFile.txt', 'a')
# The below URL is the houses for sale in the manukau area, in list view, sorted by most recent, page 1
# May need to be updated occassionally.
tmManukau1 = 'http://www.trademe.co.nz/browse/property/regionlistings.aspx?mcat=0350-5748-3399-&v=List&key=100055&page='
tmManukau2 = '&sort_order=expiry_desc'
for i in range(0, 60):
tmManukau = tmManukau1 + str(i) + tmManukau2
page = urllib.urlopen(tmManukau).read()
# Note that line 637 is the first occurance of a url for a particular listing.
# The 9 digits following this string will represent a listing.
auctionString = '/Trade-me-property/Residential-property/Houses-for-sale/auction-'
pos = 1
pos = page.find(auctionString)
# Loops through this process until there are no more valid links on the page.
while (pos != -1):
# This will build a string for the URL of the listing found.
listingURL = 'http://www.trademe.co.nz' + page[pos : pos + 77]
listing = urllib.urlopen(listingURL).read()
# Gets the address from the listing, the 26 skips the <td> and spaces.
addSPos = listing.find('<td>', listing.find('Location:')) + 26
addEPos = listing.find('</td>', addSPos)
address = listing[addSPos : addEPos]
try:
int(address[0])
address = address.replace('<br />', '\t')
print (address + '\n')
f.write(address + '\n')
except ValueError:
print (address + 'HAS NO NUMBER \n')
finally:
# Each address occurs twice, skip the second.
pos = page.find(auctionString, page.find(auctionString, pos + 78) + 78)
Note:
This is very specific. It only searches Manukau and surrounding areas. Small changes on the urls will need to be made for that to change. The range in the loop determines which pages are to be read. Also I have no idea what that mcat number is and may need to be updated.
I suppose the main key here is this line:
listing = urllib.urlopen(listingURL).read()
'listing' now contains the entire html code for the page. Performing some search functions on here for the auction numbers allows the program to find the relevent listing and from that page find the address. Simple enough.
The random hard coded numbers are there due to formatting. Easiest way I could think of to get around it. So that's that one. Never used all the 1,180 addresses. Kept the boss off my back for a while though.
No comments:
Post a Comment