Board 8 > I need some python help. Is anyone strong in web scraping with Python?

Topic List
Page List: 1
WiggumFan267
06/02/19 10:47:07 PM
#1:


I have a task I have to do, basically it is to scrape a reddit-like page (https://news.ycombinator.com/) and pull together a list of articles with:
-Article name
-URL
-Submitter
-Submission time
-Number of upvote points
-Number of comments

I am a Python semi-newb and completely new to anything like web scraping, so wanted to know if anyone could help?

After researching some articles, I got as far as importing the packages "requests" and "BeautifulSoup", but I'm really not sure at all how to use these packages to get what I need.

I see that , for example:
article name always appears betweens the string 'class="storylink">' and '</a>
URL is between '<a href=' and 'class="storylink">'
Submitter between 'class="hnuser">' and '</a>'
Submission time/Upvote points/comments im not sure because they seem to always precede a unique ID

so i don't know if this idea is the right track or not, and if it is, how to go about it in general, or if not a better way to do it? Like I guess I would just search for each group of text strings and find what is between them, but I'm not sure if that's right, or entirely how to do that.

any help would be appreciated
---
~Wigs~ 3-Time Consecutive Fantasy B8 Baseball Champion
2015 NATIONAL LEAGUE CHAMPION NEW YORK METS
... Copied to Clipboard!
WiggumFan267
06/02/19 10:48:54 PM
#2:


feel free to shout me on discord too if thats easier for you
---
~Wigs~ 3-Time Consecutive Fantasy B8 Baseball Champion
2015 NATIONAL LEAGUE CHAMPION NEW YORK METS
... Copied to Clipboard!
PIayer_0
06/03/19 12:23:33 AM
#3:


Not a Python user, but if you want to scrape from major websites, it's always easiest to check if they have a public API first and just use that. You're in luck here, as I found https://github.com/HackerNews/API

For example, I can get the front page from https://hacker-news.firebaseio.com/v0/topstories.json?print=pretty

At the time of posting, the top article was 20079671, so I can go to https://hacker-news.firebaseio.com/v0/item/20079671.json?print=pretty

I believe requests is your standard HTTP client, so that's definitely a good place to start. BeautifulSoup seems to be for cleaning up "ugly" HTML and XML pages, but if you have access to a nice API you might as well use that instead. You'll probably need import json if you want to go this route.
---
-Abraham Lincoln
... Copied to Clipboard!
Not Dave
06/03/19 12:35:32 AM
#4:


i've used python and beautifulsoup to scrape, but have honestly just stumbled my way through it using stackoverflow answers and trial and error, so i'm not a big help

once you get it working, it works really well, though
---
ND
... Copied to Clipboard!
WiggumFan267
06/03/19 12:36:02 AM
#5:


the assignment here is specifically to scrape using Python.
newbie has helped me along a solid way, so I'm getting closer to getting what I need now but not 100% there yet
---
~Wigs~ 3-Time Consecutive Fantasy B8 Baseball Champion
2015 NATIONAL LEAGUE CHAMPION NEW YORK METS
... Copied to Clipboard!
banananor
06/03/19 2:11:52 AM
#6:


If you're specifically required to scrape, I'm surprised your instructor didn't teach you how to do that
---
You did indeed stab me in the back. However, you are only level one, whilst I am level 50. That means I should remain uninjured.
... Copied to Clipboard!
Ngamer64
06/03/19 2:59:34 AM
#7:


GMUN scrapes B8, could be a good resource.


---
Advo bested even The Show hosts in 2018, yowza!
board8.wikia.com | thengamer.com/xstats
... Copied to Clipboard!
Steiner
06/03/19 3:43:09 AM
#8:


banananor posted...
If you're specifically required to scrape, I'm surprised your instructor didn't teach you how to do that


this isn't college it's work, you just get given stuff to do that seems vaguely in your wheelhouse and have to act like a pro!
---
Advokaiser makes me feel eternal. All this pain is an illusion.
... Copied to Clipboard!
CelesMyUserName
06/03/19 5:39:47 AM
#9:


Yeap the lfaqs bot that archives through gamefaqs boards/topics is all python so I have a few things to say

You're on the right page with requests and soup, although frankly I've only recently been bothering to parse through soup but it's absolutely much nicer tool to help, it's just when I started I only wanted to get things up and running fast so I stuck with just indexing the returned text string of pages directly.

I'll just describe the process of scraping GameFAQs posts which you can understand as the articles you'd be traversing instead of posts -- first describing it all using only indexing, and then explaining how soup makes it easier

in a GameFAQs topic, all posts are within a <table></table> element I pull out first

So like just basic python indexing of a string you can using, stringname[ start_index : end_index ], which would mean like for pulling that table... table_string= fullpage[ fullpage.index('<table ') : fullpage.index('</table') ], outputting the string chunk containing all posts and no further (since topics pages also use similar tags later on for the "More topics from this board..." thing, you only want contents of that first <table> which ends by the first </table>

And then each post is contained within their own <tr></tr> element and you can traverse as an array of occurrences by splitting the table string by the </tr> tag with python's split method for strings

so like

for posttext in table_string.split('</tr>'):
..... [ in here you scrape the bits of info you want the individual posts you're traversing ]

and within that for loop traversing the individual posts you'd use the same idea to pull the posts's author info, date posted, msg contents, etc and collect the data, the next loop does the next post yada-yada

Instead of indexing tho... BeautifulSoup traversal!

Soup makes it so you don't have to index around for these tags, once you convert a page to a soup for example you can just get to its first <table> element with soup.table... and that result is treated basically like a sub-soup you can pull its contained tags from... such as the <tr> tags in the <tr>

So instead of searching for and cutting out the index of the string, you can just directly pull out the table element, then pull out its <tr>s

You can also easily pull out the attribute info (like the 'href' value of a <a> link, or the tag's 'class' info) or the text contained by the tag with [tag].get(attr_name) and the tag's [tag].string field... in some case you might need to use next_element instead to get the inner contents, or maybe just use that to begin with <_<

Anyhow I made a quick example run in python cmd line of me just traversing this topic with soup

cB2dKuE

So in that I pull the first occurrences of tags, like soup.table, pull an array of all occurrences with find_all('tr'), and pull equal-level following siblings of tags with next_sibling... from those methods you can get the gist

Of course, this is after you've pulled the text of a page after using requests first which I made a specific GameFAQs handler for because of reasons >_> I probably should've covered that first but it's late and this is more specific to scraping / parsing the data
---
https://imgtc.com/i/1LkkaGU.jpg
somethin somethin hung somethin horse somethin
... Copied to Clipboard!
CelesMyUserName
06/03/19 6:16:00 AM
#10:


Just some quick things about the requests usage before I sleep - I originally just single urllib pulls one after another to open pages on their own, but then gamefaqs didn't really like that and eventually stopped it from even working that way... that's what lead me to the importance of making a session for all your coming requests - just like you have a "session" whenever you open your browser or w/e.

My Gfaqs requests handler started out just for that, something to easily initialize a session for me to make my requests through.

You can use requests directly to just open a page, like if I only wanted to pull from my homepage with I could do this

import requests
r= requests.get('http://gmun.moe')


... and that's the same r response in my earlier screenshoted exampled I pull the text from.

To make a session though, at least refreshing my memory looking at my gfaqr code, you should do this

import requests
s= requests.session()
r= s.get('http://gmun.moe')


... and then more of your information is retained through the sessions easier (like logins, cookies)

Speaking of, you'll also want to initialize your headers, and following robot rules

Websites don't like unidentified or misidentified bots, and you may get IP banned if you screw the rules like so. To simply fix that, identify yourself in your requests headers.. specifically the 'User-Agent' value of your headers

Here's Google's for example: "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

So before using your session (with the get method), just init once with this

headers = { "User-Agent":'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)' }
self.s.headers.update(headers)


... except not with Google's info, just replace "Googlebot" with your name and uh if you can a reference link to information about the bot is nice

And for being an extra nice bot you should know about following the rules. GameFAQs for example tells bots where they're allowed to go with this

https://gamefaqs.gamespot.com/robots.txt

Sites generally have their own robots.txt page from the root index the same way, so before bots read an url they're supposed to check that to see if they have permission first. To do that, is the robotparser in python's urllib

from urllib import robotparser
rp= robotparser.RobotFileParser()
rp.set_url(host + "/robots.txt")
rp.read()
permission = rp.can_fetch("*", url)
permission &= rp.can_fetch("Wigsbot", url)
(wigsbot = your bot's name in the headers, like "Googlebot"

where 'host' is something like 'https://gamefaqs.gamespot.com'

And that's supposed to be followed up with something like

if permission:
... s.get('url')


Yada-yada-yada.

whoops this wasn't quick at all it's 6am now
---
https://imgtc.com/i/1LkkaGU.jpg
somethin somethin hung somethin horse somethin
... Copied to Clipboard!
CelesMyUserName
06/03/19 6:34:36 AM
#11:


Eh

Basically, my gfaqr just initializes that basic requests stuff for me and I just ask it to open whatever url and it returns the response just as the most simplistic requests.get(url) did.

in my Gfaqr() class's init function then (well, showing the general, relevant code)

class Gfaqr():
..... def __init__(self):
.......... self.s= requests.session()
.......... headers = {"User-Agent":'Mozilla/5.0 (compatible; Wigsbot/5.0; +https://penisland.net/bot.html)' }
.......... self.s.headers.update(headers)
.......... self._rp= robotparser.RobotFileParser()

..... def permission(self,host,stem):
.......... url= host+stem
.......... self._rp.set_url(host + "/robots.txt")
.......... self._rp.read()
.......... permission = self._rp.can_fetch("*", url)
.......... permission &= self._rp.can_fetch("Wigsbot", url)
.......... return permission

..... def sget(self,host,stem):
.......... if not self.permission(host,stem):
............... return 1
.......... r= self.s.get(host+stem)
.......... return r


And then all you need is to just import you class and use its own get method instead of requests.get or even s.get. Like I use Gfaqr in that earlier screencap, assigning it to something like fq= Gfaqr() and using it with fq.sget(url)

... Of course, mine has other stuff as well plus defaults the host to gamefaqs since that's all it's for and it has some specific functions for browsing gamefaqs pages built in but yeah
---
https://imgtc.com/i/1LkkaGU.jpg
somethin somethin hung somethin horse somethin
... Copied to Clipboard!
WiggumFan267
06/03/19 11:28:43 AM
#12:


that is helpful thanks. I wound up mostly finishing up the code last night in BeautifulSoup, my main issue coming with when there wasn't a good way to distinguish some bits of the HTML from another, like when the same tags are used. there usually happened to be a workaround but there wasn't always.
---
~Wigs~ 3-Time Consecutive Fantasy B8 Baseball Champion
2015 NATIONAL LEAGUE CHAMPION NEW YORK METS
... Copied to Clipboard!
foolm0r0n
06/03/19 12:04:05 PM
#13:


Use an HTML parser to build a DOM tree from the HTML text
Query the DOM tree for your data

That's it
---
_foolmo_
2 + 2 = 4
... Copied to Clipboard!
foolm0r0n
06/03/19 12:09:32 PM
#14:


WiggumFan267 posted...
that is helpful thanks. I wound up mostly finishing up the code last night in BeautifulSoup, my main issue coming with when there wasn't a good way to distinguish some bits of the HTML from another, like when the same tags are used. there usually happened to be a workaround but there wasn't always.

There's always some way to distinguish elements. At the least, there's is the element's position in the tree (first child, second child, etc). If that position isn't useful for whatever reason, then you can look at the contents. Maybe you want the element that has 1 text node and 2 span node children.

If the elements are all structured exactly the same and that's not possible, then you can go even further by checking the content. So if you have 2 text elements that are in randomized order, one for name and one for phone number, then you check the element for text that looks like a phone number vs looks like a name.

If that's not enough to distinguish the elements, then they are logically equivalent so it doesn't make sense to distinguish them anyways.
---
_foolmo_
2 + 2 = 4
... Copied to Clipboard!
WiggumFan267
06/03/19 2:53:05 PM
#15:


so i had a block that was like:

https://cdn.discordapp.com/attachments/280814718069899285/585175733739978753/2019-06-03_1439.png
(also posted the code plaintext, but gfaqs doesn't keep tabbing apparently)


<tr class="athing" id="20083793">
<td align="right" class="title" valign="top">
<span class="rank">
30.
</span>
</td>
<td class="votelinks" valign="top">
<center>
<a href="vote?id=20083793&amp;how=up&amp;goto=news" id="up_20083793">
<div class="votearrow" title="upvote">
</div>
</a>
</center>
</td>
<td class="title">
<a class="storylink" href="item?id=20083793">
Ask HN: Who wants to be hired? (June 2019)
</a>
</td>
</tr>
<tr>
<td colspan="2">
</td>
<td class="subtext">
<span class="score" id="score_20083793">
60 points
</span>
by
<a class="hnuser" href="user?id=whoishiring">
whoishiring
</a>
<span class="age">
<a href="item?id=20083793">
3 hours ago
</a>
</span>
<span id="unv_20083793">
</span>
|
<a href="hide?id=20083793&amp;goto=news">
hide
</a>
|
<a href="item?id=20083793">
84 comments
</a>
</td>
</tr>


I wasn't sure able to extract the "84 comments" here because of the same href appearing in 3 spots in <a>. Assume this was properly set to get anything that started with that "item?" string, not the particular example I had listed, but let's just say I care about this singular block for now.

For comments string, I had something like:
for row in soup.findAll("a", {"href" : "item?id=20083793"}):
print(row.get_text())


By comparison, for submission_time I had:
for row in soup.findAll("span", {"class" , "age"}):
print(row.get_text())


I had to have the first one there as a colon, and the 2nd one as a comma- so I'm not sure why that is for one part of my issue.

Either way, once I had a colon on that first one, it returned all 3 lines - the Article Title, "3 hours ago", and "84 comments". I tried adding to the first one:

if not row["class"] in ["age", "storylink"]:

but this didn't work (I don't remember why... I think it threw an error, but I didn't save that).

So what I wound up doing was just grabbing the whole subtext line and parsing out by the "|"

for row in soup.findAll("td", {"class" , "subtext"}):
if row != "":
comments = row.get_text()
print(comments.split("| hide | ",1)[1])


and this worked just fine, if not the greatest way to do it.

But I am curious what would be the way to isolate it the "normal" way?
---
~Wigs~ 3-Time Consecutive Fantasy B8 Baseball Champion
2015 NATIONAL LEAGUE CHAMPION NEW YORK METS
... Copied to Clipboard!
CelesMyUserName
06/03/19 3:52:54 PM
#17:


*reposts since I made too many edits and missed 1 last copy/paste fail >_>*

WiggumFan267 posted...
I had to have the first one there as a colon, and the 2nd one as a comma- so I'm not sure why that is for one part of my issue.

That's because of the way classes are specifically handled for html, they're given multiple "classes" so they can follow a combination of different rules provided by CSS for those classes

Like for example looking at this page on GameFAQs again, the div containing post message text has two classes

<div class="msg_body newbeta" data-msgnum="2" name="922713781">
..... feel free to shout me on discord too if thats easier for you
..... [other div stuff for sigs]
</div>


The class is an array of two values - ["msg_body", "newbeta"], rather than just one value you'd expect as the string "msg_body newbeta" which is the way all other fields are generally handled ("href" is just that string "item?id=20083793")

With that in mind you can look for the class using a semicolon just like with the a href, you just need to put the "age" value within an array

so instead of

for row in soup.findAll("span", {"class" : "age" }):
or
for row in soup.findAll("span", {"class" , "age" }):
it's
for row in soup.findAll("span", {"class" : ["age"] }):

*edit*... except I think it's actually "find_all()"?

Because the class value is an array, which would also be what's throwing off your attempt later

if not row["class"] in ["age", "storylink"]:

So what's going on here? Well testing an element with the class "age"...

You're Expecting:
..... "age" in ["age", "storylink"]:
..... True, the string "age" is in the array of strings ["age", "storylink"]

What's Happening:
..... ["age"] in ["age", "storylink"]:
..... False, the array ["age"] is not in the array of strings ["age", "storylink"]


Instead of comparing the class to an array of strings values you're filtering out, you'd have to compare it to an array of array values you're filtering out

row['class'] in [ ["age"], ["storylink"] ]:

and when provided the element with a class "age"... True, the array ["age"] is in [ ["age"], ["storylink"] ]
---
https://imgtc.com/i/1LkkaGU.jpg
somethin somethin hung somethin horse somethin
... Copied to Clipboard!
foolm0r0n
06/03/19 4:27:16 PM
#18:


WiggumFan267 posted...
I wasn't sure able to extract the "84 comments" here because of the same href appearing in 3 spots in <a>

Pick the last <a> inside <td class="subtext">. It should always be the comments link.
---
_foolmo_
2 + 2 = 4
... Copied to Clipboard!
foolm0r0n
06/03/19 4:34:19 PM
#19:


soup.find('td', {class: 'subtext'}).find_all('a')[-1].string

This library's syntax sucks though
---
_foolmo_
2 + 2 = 4
... Copied to Clipboard!
Xiahou Shake
06/03/19 5:00:04 PM
#20:


Man, I'm so fascinated by programming and would love to learn it but I have no idea how to start or even why I'd be doing it. Like, most of the appeal to me is in cracking logical puzzles and there are a shitload of programming games on Steam so maybe I should just play those.

I have nothing to contribute here but just wanted to express wonder at the talent and knowledge we have here on the board.~
---
Let the voice of love take you higher,
With this gathering power, go beyond even time!
... Copied to Clipboard!
foolm0r0n
06/03/19 5:26:19 PM
#21:


The latest zachtronics games are exactly programming. It's not even a metaphor anymore. Baba Is You is also really great if you do want a more metaphorical intro.
---
_foolmo_
2 + 2 = 4
... Copied to Clipboard!
WiggumFan267
06/03/19 5:27:39 PM
#22:


ahh that makes more sense, I think @ foolmo and GMUN. thanks.
---
~Wigs~ 3-Time Consecutive Fantasy B8 Baseball Champion
2015 NATIONAL LEAGUE CHAMPION NEW YORK METS
... Copied to Clipboard!
Ngamer64
06/03/19 7:44:07 PM
#23:


Yeah XShake try out some of those programming games, they're good for getting you in the right headspace and once it starts to click for you you'll have the option of branching out into different languages.


---
Advo bested even The Show hosts in 2018, yowza!
board8.wikia.com | thengamer.com/xstats
... Copied to Clipboard!
Topic List
Page List: 1