Hello I am a Mac, and I was a PC. :)
Well I guess everybody likes the Apple “Get A Mac” series ads, so do I, but I think watching online sucks since the China-US network is not really good. So I wrote a Python program “Get The Ads” to claw all the “Get A Mac” .mov file url.
1. First version
At the beginning I didn’t know where the Apple guys store the file info in, so my program was going in this way:
class AdsParser(SGMLParser):
def reset(self):
# extend (called from __init__ in ancestor)
# Reset all data attributes
SGMLParser.reset(self)
self.urls = {}
def start_a(self, attrs):
# called for every <a> tag in HTML source
# Find the links
href = [v for k, v in attrs if k=='href']
if href and href[0].rfind('.mov')!=-1 :
l = href[0].rfind('/')+1
r = href[0].rfind('_')
name = href[0][l:r]
self.urls[name] = href[0]
This is the parser class, which extends the SGMLParser. So this class downloads html pages, then parses them, when <a> start tag is found, start_a(attrs) method will be called. attrs is a list storing the attributes in this way:
[('href', '/getamac/works.html'), ('id', 'navmoreswap')]
start_a(attrs) filters the attrs list, find out the link with “.mov” ending, then save into into a dictionary named urls. The purpose of choosing dictionary here is to avoid duplicate urls.
2. Second version
But finally I found Apple guy are storing the “.mov” info in a single xml file, wow, it makes my program much easier, so here is my second version:
#!/usr/bin/env python
__author__ = "Ben Feng(benplusplus#gmail.com)"
__copyright__ = "Copyright (c) 2007 Ben Feng"
import urllib
import sys
import os
class AdsParser:
def __init__(self):
self.site = ""
self.urls = []
def getfile(self):
# Return the xml source
try:
sock = urllib.urlopen(self.site)
source = sock.read()
sock.close()
except:
print "Can not connect to Apple.com, \
please check the internet connection."
sys.exit(2)
return source
def start(self, site):
# parse the resource from getfile() method
self.site = site
source = self.getfile()
import re
self.urls = re.findall('http(?:[^ \n\r\"]+)[.]mov',source)
def output(urls):
lsize = [ ("HD", "848x496"), ("Large", "640x496"), ("Medium", "480x376"), ("Small", "320x256") ]
default = "480x376"
outfile = "output.html"
fsock = open(outfile, 'w')
fsock.write("""
<html>
<head>
<title>Get The Ads</title>
</head>
<body>
<p>Get The Ads<br>-Ben Feng @ 2007<br>-benplusplus#gmail.com</p>
""")
fsock.write("%d Ads" % len(urls))
for (k, v) in lsize:
fsock.write("<p><br><br>%s resolution (%s) :<br></p>" % (k, v))
for link in urls:
link = re.sub('_(?:[^ /]+)\.', '_'+v+'.', link)
fsock.write("<a href=%s target=_blank>%s</a><br />" % (link, link))
fsock.write("</body></html>")
fsock.close()
import webbrowser
s = "file://"+os.getcwd()+"/"+outfile
webbrowser.open(s)
def main():
parser = AdsParser()
print "Connecting...Just a second"
parser.start("http://www.apple.com/getamac/ads.xml")
output(parser.urls)
print "Finished.\n"
if __name__ == "__main__":
s ='\nGet The Ads \
\n-Ben Feng @ 2007\n'
print s
main()
So you can see the AdsParser class has become much more slim. All what I do, is just using this regular expression:
self.urls =
re.findall('http(?:[^ \n\r\"]+) [.]mov', source )
No loop, but all the links I am searching for will be picked out. It just works!
So now you can see now the output(urls) method even has more codes than AdsPasrser class. ulrs is a list, stores the links returned by AdsParser, but they are only the medium resoluton ones.
output(urls) takes care of the output stuffs.
The two for loops are generating the links for other resolutions, and writing the output to a html file. Then with this lines, the html file will be displayed on the browser:
import webbrowser
s = "file://"+os.getcwd()+"/"+outfile
webbrowser.open(s)
OK, the program has been went through, I have to say Python is really good at this!
If you want the file links, you can run this program by yourself, or I attached all the links of Get A Mac in this post, including all resolutions, I like the HD ones, enjoy it! :)
http://www.elesson.com.cn/modules/ipboard/index.php?s=&showtopic=42313
