LogFAQs > #922722816

LurkerFAQs, Active DB, DB1, DB2, DB3, DB4, Database 5 ( 01.01.2019-12.31.2019 ), DB6, DB7, DB8, DB9, DB10, DB11, DB12, Clear

Topic List	Page List: 1
Topic	I need some python help. Is anyone strong in web scraping with Python?
CelesMyUserName 06/03/19 6:16:00 AM #10:	Just some quick things about the requests usage before I sleep - I originally just single urllib pulls one after another to open pages on their own, but then gamefaqs didn't really like that and eventually stopped it from even working that way... that's what lead me to the importance of making a session for all your coming requests - just like you have a "session" whenever you open your browser or w/e. My Gfaqs requests handler started out just for that, something to easily initialize a session for me to make my requests through. You can use requests directly to just open a page, like if I only wanted to pull from my homepage with I could do this `import requests r= requests.get('http://gmun.moe')` ... and that's the same r response in my earlier screenshoted exampled I pull the text from. To make a session though, at least refreshing my memory looking at my gfaqr code, you should do this `import requests s= requests.session() r= s.get('http://gmun.moe')` ... and then more of your information is retained through the sessions easier (like logins, cookies) Speaking of, you'll also want to initialize your headers, and following robot rules Websites don't like unidentified or misidentified bots, and you may get IP banned if you screw the rules like so. To simply fix that, identify yourself in your requests headers.. specifically the 'User-Agent' value of your headers Here's Google's for example: "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" So before using your session (with the get method), just init once with this `headers = { "User-Agent":'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)' } self.s.headers.update(headers)` ... except not with Google's info, just replace "Googlebot" with your name and uh if you can a reference link to information about the bot is nice And for being an extra nice bot you should know about following the rules. GameFAQs for example tells bots where they're allowed to go with this https://gamefaqs.gamespot.com/robots.txt Sites generally have their own robots.txt page from the root index the same way, so before bots read an url they're supposed to check that to see if they have permission first. To do that, is the robotparser in python's urllib `from urllib import robotparser rp= robotparser.RobotFileParser() rp.set_url(host + "/robots.txt") rp.read() permission = rp.can_fetch("", url) permission &= rp.can_fetch("Wigsbot", url)` (wigsbot = your bot's name in the headers, like "Googlebot" where 'host' is something like 'https://gamefaqs.gamespot.com' And that's supposed to be followed up with something like `if permission: ... s.get('url')` Yada-yada-yada. whoops this wasn't quick at all it's 6am now --- https://imgtc.com/i/1LkkaGU.jpg somethin somethin hung somethin horse somethin <cite>CelesMyUserName posted [p:922722816] in I need som... [t:77758131]...</cite> <quote>Just some quick things about the requests usage before I sleep - I originally just single urllib pulls one after another to open pages on their own, but then gamefaqs didn't really like that and eventually stopped it from even working that way... that's what lead me to the importance of making a session for all your coming requests - just like you have a "session" whenever you open your browser or w/e. My Gfaqs requests handler started out just for that, something to easily initialize a session for me to make my requests through. You can use requests directly to just open a page, like if I only wanted to pull from my homepage with I could do this import requests r= requests.get('http://gmun.moe') ... and that's the same r response in my earlier screenshoted exampled I pull the text from. To make a session though, at least refreshing my memory looking at my gfaqr code, you should do this import requests s= requests.session() r= s.get('http://gmun.moe') ... and then more of your information is retained through the sessions easier (like logins, cookies) Speaking of, you'll also want to initialize your headers, and following robot rules Websites don't like unidentified or misidentified bots, and you may get IP banned if you screw the rules like so. To simply fix that, identify yourself in your requests headers.. specifically the 'User-Agent' value of your headers Here's Google's for example: "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" So before using your session (with the get method), just init once with this headers = { "User-Agent":'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)' } self.s.headers.update(headers) ... except not with Google's info, just replace "Googlebot" with your name and uh if you can a reference link to information about the bot is nice And for being an extra nice bot you should know about following the rules. GameFAQs for example tells bots where they're allowed to go with this https://gamefaqs.gamespot.com/robots.txt Sites generally have their own robots.txt page from the root index the same way, so before bots read an url they're supposed to check that to see if they have permission first. To do that, is the robotparser in python's urllib from urllib import robotparser rp= robotparser.RobotFileParser() rp.set_url(host + "/robots.txt") rp.read() permission = rp.can_fetch("", url) permission &= rp.can_fetch("Wigsbot", url) (wigsbot = your bot's name in the headers, like "Googlebot" where 'host' is something like 'https://gamefaqs.gamespot.com' And that's supposed to be followed up with something like if permission: ... s.get('url') Yada-yada-yada. whoops this wasn't quick at all it's 6am now --- https://imgtc.com/i/1LkkaGU.jpg somethin somethin hung somethin horse somethin</quote> ... Copied to Clipboard!
Topic List	Page List: 1