LogFAQs > #922722816

LurkerFAQs, Active DB, DB1, DB2, DB3, DB4, Database 5 ( 01.01.2019-12.31.2019 ), DB6, DB7, DB8, DB9, DB10, DB11, DB12, Clear
Topic List
Page List: 1
TopicI need some python help. Is anyone strong in web scraping with Python?
CelesMyUserName
06/03/19 6:16:00 AM
#10:


Just some quick things about the requests usage before I sleep - I originally just single urllib pulls one after another to open pages on their own, but then gamefaqs didn't really like that and eventually stopped it from even working that way... that's what lead me to the importance of making a session for all your coming requests - just like you have a "session" whenever you open your browser or w/e.

My Gfaqs requests handler started out just for that, something to easily initialize a session for me to make my requests through.

You can use requests directly to just open a page, like if I only wanted to pull from my homepage with I could do this

import requests
r= requests.get('http://gmun.moe')


... and that's the same r response in my earlier screenshoted exampled I pull the text from.

To make a session though, at least refreshing my memory looking at my gfaqr code, you should do this

import requests
s= requests.session()
r= s.get('http://gmun.moe')


... and then more of your information is retained through the sessions easier (like logins, cookies)

Speaking of, you'll also want to initialize your headers, and following robot rules

Websites don't like unidentified or misidentified bots, and you may get IP banned if you screw the rules like so. To simply fix that, identify yourself in your requests headers.. specifically the 'User-Agent' value of your headers

Here's Google's for example: "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

So before using your session (with the get method), just init once with this

headers = { "User-Agent":'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)' }
self.s.headers.update(headers)


... except not with Google's info, just replace "Googlebot" with your name and uh if you can a reference link to information about the bot is nice

And for being an extra nice bot you should know about following the rules. GameFAQs for example tells bots where they're allowed to go with this

https://gamefaqs.gamespot.com/robots.txt

Sites generally have their own robots.txt page from the root index the same way, so before bots read an url they're supposed to check that to see if they have permission first. To do that, is the robotparser in python's urllib

from urllib import robotparser
rp= robotparser.RobotFileParser()
rp.set_url(host + "/robots.txt")
rp.read()
permission = rp.can_fetch("*", url)
permission &= rp.can_fetch("Wigsbot", url)
(wigsbot = your bot's name in the headers, like "Googlebot"

where 'host' is something like 'https://gamefaqs.gamespot.com'

And that's supposed to be followed up with something like

if permission:
... s.get('url')


Yada-yada-yada.

whoops this wasn't quick at all it's 6am now
---
https://imgtc.com/i/1LkkaGU.jpg
somethin somethin hung somethin horse somethin
... Copied to Clipboard!
Topic List
Page List: 1