LogFAQs > #922722405

LurkerFAQs, Active DB, DB1, DB2, DB3, DB4, Database 5 ( 01.01.2019-12.31.2019 ), DB6, DB7, DB8, DB9, DB10, DB11, DB12, Clear

Topic List	Page List: 1
Topic	I need some python help. Is anyone strong in web scraping with Python?
CelesMyUserName 06/03/19 5:39:47 AM #9:	Yeap the lfaqs bot that archives through gamefaqs boards/topics is all python so I have a few things to say You're on the right page with requests and soup, although frankly I've only recently been bothering to parse through soup but it's absolutely much nicer tool to help, it's just when I started I only wanted to get things up and running fast so I stuck with just indexing the returned text string of pages directly. I'll just describe the process of scraping GameFAQs posts which you can understand as the articles you'd be traversing instead of posts -- first describing it all using only indexing, and then explaining how soup makes it easier in a GameFAQs topic, all posts are within a <table></table> element I pull out first So like just basic python indexing of a string you can using, `stringname[ start_index : end_index ]`, which would mean like for pulling that table... `table_string= fullpage[ fullpage.index('<table ') : fullpage.index('</table') ]`, outputting the string chunk containing all posts and no further (since topics pages also use similar tags later on for the "More topics from this board..." thing, you only want contents of that first <table> which ends by the first </table> And then each post is contained within their own <tr></tr> element and you can traverse as an array of occurrences by splitting the table string by the </tr> tag with python's split method for strings so like for posttext in table_string.split('</tr>'): ..... [ in here you scrape the bits of info you want the individual posts you're traversing ] and within that for loop traversing the individual posts you'd use the same idea to pull the posts's author info, date posted, msg contents, etc and collect the data, the next loop does the next post yada-yada Instead of indexing tho... BeautifulSoup traversal! Soup makes it so you don't have to index around for these tags, once you convert a page to a soup for example you can just get to its first <table> element with soup.table... and that result is treated basically like a sub-soup you can pull its contained tags from... such as the <tr> tags in the <tr> So instead of searching for and cutting out the index of the string, you can just directly pull out the table element, then pull out its <tr>s You can also easily pull out the attribute info (like the 'href' value of a <a> link, or the tag's 'class' info) or the text contained by the tag with [tag].get(attr_name) and the tag's [tag].string field... in some case you might need to use next_element instead to get the inner contents, or maybe just use that to begin with <_< Anyhow I made a quick example run in python cmd line of me just traversing this topic with soup So in that I pull the first occurrences of tags, like soup.table, pull an array of all occurrences with find_all('tr'), and pull equal-level following siblings of tags with next_sibling... from those methods you can get the gist Of course, this is after you've pulled the text of a page after using requests first which I made a specific GameFAQs handler for because of reasons >_> I probably should've covered that first but it's late and this is more specific to scraping / parsing the data --- https://imgtc.com/i/1LkkaGU.jpg somethin somethin hung somethin horse somethin <cite>CelesMyUserName posted [p:922722405] in I need som... [t:77758131]...</cite> <quote>Yeap the lfaqs bot that archives through gamefaqs boards/topics is all python so I have a few things to say You're on the right page with requests and soup, although frankly I've only recently been bothering to parse through soup but it's absolutely much nicer tool to help, it's just when I started I only wanted to get things up and running fast so I stuck with just indexing the returned text string of pages directly. I'll just describe the process of scraping GameFAQs posts which you can understand as the articles you'd be traversing instead of posts -- first describing it all using only indexing, and then explaining how soup makes it easier in a GameFAQs topic, all posts are within a <table></table> element I pull out first So like just basic python indexing of a string you can using, stringname[ start_index : end_index ], which would mean like for pulling that table... table_string= fullpage[ fullpage.index('<table ') : fullpage.index('</table') ], outputting the string chunk containing all posts and no further (since topics pages also use similar tags later on for the "More topics from this board..." thing, you only want contents of that first <table> which ends by the first </table> And then each post is contained within their own <tr></tr> element and you can traverse as an array of occurrences by splitting the table string by the </tr> tag with python's split method for strings so like for posttext in table_string.split('</tr>'): ..... [ in here you scrape the bits of info you want the individual posts you're traversing ] and within that for loop traversing the individual posts you'd use the same idea to pull the posts's author info, date posted, msg contents, etc and collect the data, the next loop does the next post yada-yada Instead of indexing tho... BeautifulSoup traversal! Soup makes it so you don't have to index around for these tags, once you convert a page to a soup for example you can just get to its first <table> element with soup.table... and that result is treated basically like a sub-soup you can pull its contained tags from... such as the <tr> tags in the <tr> So instead of searching for and cutting out the index of the string, you can just directly pull out the table element, then pull out its <tr>s You can also easily pull out the attribute info (like the 'href' value of a <a> link, or the tag's 'class' info) or the text contained by the tag with [tag].get(attr_name) and the tag's [tag].string field... in some case you might need to use next_element instead to get the inner contents, or maybe just use that to begin with <_< Anyhow I made a quick example run in python cmd line of me just traversing this topic with soup So in that I pull the first occurrences of tags, like soup.table, pull an array of all occurrences with find_all('tr'), and pull equal-level following siblings of tags with next_sibling... from those methods you can get the gist Of course, this is after you've pulled the text of a page after using requests first which I made a specific GameFAQs handler for because of reasons >_> I probably should've covered that first but it's late and this is more specific to scraping / parsing the data --- https://imgtc.com/i/1LkkaGU.jpg somethin somethin hung somethin horse somethin</quote> ... Copied to Clipboard!
Topic List	Page List: 1