Unit2-30 get_page function for Python 3.x

In python 3.2 instead of using get_page(url) method form the module called urllib2, we should import the module called urllib and from that module call a class method called urlopen(url, data, timeout) that takes an url as an input and returns an object that we can call a read() and readline() methods on it and convert the result into a string using str() method and finally call the print_all_links method and see the result:
Here is code that I use in python 3.2 which is a bit different from python 2.x in some aspects:

from urllib.request import urlopen

def get_next_link(content):
    '''This function takes a string that represents the HTML content of a webpage
    and return a url of a link it finds in it, and the the index of the last position
    it already parsed'''
    start_index_link = content.find('<a href=')
    if start_index_link == -1 : # if there is no link in a page (link not found)
        return None, 0
    start_index_quote = content.find('"', start_index_link)
    end_ind_quote = content.find('"', start_index_quote + 1)
    url = content[start_index_quote+1:end_ind_quote]
    return url, end_ind_quote

def print_all_links(page):
    while True :
        url, end_pos = get_next_link(page)
        # check if the url is anything but the empty string or None
        if url:
            print(url)
            page = page[end_pos:]
        else:
            break
def test():
    # open the url
    html = urlopen('http://www.xkcd.com/')
    # read from opened url and store in page
    page = html.read()
    # cover page into the string type and pass it to the print_all_links function
    print_all_links(str(page))

test()

Please correct me, if I misunderstand anything or if I did something wrong in my code. I know that it works, but I am not sure whether the way I implemented is the right way or not.

asked 27 Apr '12, 19:00

Cyberax's gravatar image

Cyberax
3313722

accept rate: 85%


7 Answers:

17

This is my implementation of get_page(page) function. When I saw using that function in professor Dave's class, I thought it is standard function, that is implemented in Standard Library.
Later I realized that it is not built-in, and I found urllib2 library and function urlopen() and read(), so I implemented it by myself. I hope it will help someone. Greetings to all starters of Udacity and have a nice and easy learning very useful stuff.

def get_page(page):
    import urllib2
    source = urllib2.urlopen(page)
    return source.read()

link

answered 03 May '12, 09:12

Drazen%20Lazarevic's gravatar image

Drazen Lazar...
27315

Awesome! I had the same question. It's so useful to practice this in the terminal. BTW, I now see why no one uses the mailto html anymore for e-mail addresses, scanning for them in pages is way too easy!

(22 Jan '13, 12:12)

Bruce Baisch Jr

Bruce%20Baisch%20Jr's gravatar image

Thanx for providing the code sample for Python 3.2; Personally I am using Python 2.7.3 since another intro text I am using is avoiding Python 3 for now, as does the Google free class on Python (they just say avoid Python 3 for now). But the next text I will be working through starts with Python 3, so I hope to be able to see the differences after learning them both. I do know with Python 3 that functions now have to have parenthesis, and the way parameters and tuples are handled has changed, so I can't wait to take a look at that sometime. But for now I am plucking along with version 2.7.3 =)

Chad

link

answered 27 Apr '12, 23:29

Chad%20W.%20Sisk's gravatar image

Chad W. Sisk
1747

What are you discussing?

Could you possibly include a little preamble to postings that seem to reach way beyond CS101 Unit 2 level, just saying what you will be discussing? It's difficult to see if there's anything here that would help me with the problem I'm working on (missing test expression and 'else command' as required for Unit2-30 quiz).

link

answered 30 Apr '12, 05:58

Christian%20Michatsch's gravatar image

Christian Mi...
1.6k51941

1

I guess the info I need would be how the problem you are clarifying relates to the problem as stated in the quiz.

(30 Apr '12, 06:01)

Christian Mi...

Christian%20Michatsch's gravatar image
1

This is not specifically about Unit2-30. But if you do a google search for "get_page python 3" it is one of the first things that comes up. I am using it in Unit3-34. At some point during the course you might find yourself trying to write code on your own computer. When you do the get_page function will come in handy.

(12 May '12, 11:42)

Marlen Brunner

Marlen%20Brunner's gravatar image

I use the following function to implement get_page. The main change I made was converting the source/html to a string (note the use of decode [many thanks to http://groups.google.com/group/comp.lang.python/browse_thread/thread/b88239182f368505 for this fix]). Using the "with" keyword ensures the page is closed after you are done reading from it.

def get_page(page):
    from urllib.request import urlopen
    with urlopen(page) as f:
        html = f.read().decode()
    return html

link

answered 12 May '12, 11:36

Marlen%20Brunner's gravatar image

Marlen Brunner
622

edited 12 May '12, 12:51

i tried Marlen Brunner's function but kept getting this error message:


Traceback (most recent call last):
File "vm_main.py", line 26, in <module>
import main
File "/tmp/vmuser_yylzngfzkx/main.py", line 52, in <module>
get_page(page)
File "/tmp/vmuser_yylzngfzkx/main.py", line 45, in get_page
from urllib.request import urlopen
ImportError: No module named request


I also tried CyberAx's and Drazen Lazar's functions and get similar responses. I think it has to do with the imports. if this method is covered in a later unit then Ill just learn it then

link

answered 15 May '12, 15:33

Daniel%20Ferguson's gravatar image

Daniel Ferguson
11

@cyberax:

import urllib.request
from urllib.request import urlopen

will be a better idea?

link

answered 14 Jan '13, 04:01

Paritosh%20Tripathi's gravatar image

Paritosh Tri...
111

@cyberax I tried your code. It doesn't work :/ I don't know why.

link

answered 27 Apr '13, 12:58

adhiti-1's gravatar image

adhiti-1
11

Your answer
Question text:

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "Title")
  • image?![alt text](/path/img.jpg "Title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Tags

×31,175
×651

Asked: 27 Apr '12, 19:00

Seen: 3,660 times

Last updated: 27 Apr '13, 12:58