Topic List |
Page List:
1 |
---|---|
AwesomeTurtwig 11/25/18 2:58:46 AM #1: |
For the love of god, how the hell do I scape this webstie.
https://www.streetinsider.com/dividend_history.php?q=DNKN I've tried many different methods but everytime the site know's I'm a damn bot. I tried changing user agent. Nah didn't work. Tried using a headless browser. NOPE. Can some someone break the code and figure out how to snag the content off this site? I might have a reward for you if you can figure it out. Robots for reference. https://www.streetinsider.com/robots.txt --- http://potders.com/sigs/1.gif My flash games: http://bit.ly/kO1BDd Rule free PotD: http://www.potders.com/forum/ ... Copied to Clipboard!
|
Yellow 11/25/18 3:38:24 AM #2: |
Are you doing a neural network sort of deal?
Do they allow proxies? Why not write a program that actually hijacks your own "trusted" browser, then open multiple instances of that hijacked browser connected directly to individual proxies? If you find a good proxy you should be able to scape at a good rate. If you still fail at that point you should be able to spoof certain inputs to make your robot seem more "human" anyway. That's my "I don't use Python so I don't know what you're doing wrong" approach. The only scraper I wrote was in VB, and I didn't fight sites that didn't want me there. :P --- ... Copied to Clipboard!
|
AwesomeTurtwig 11/25/18 3:55:09 AM #3: |
It looks like it sets a key in the users cookies. This is not trivial to parse. Finding anyother database might be easier.
--- http://potders.com/sigs/1.gif My flash games: http://bit.ly/kO1BDd Rule free PotD: http://www.potders.com/forum/ ... Copied to Clipboard!
|
Yellow 11/25/18 4:06:27 AM #4: |
AwesomeTurtwig posted...
It looks like it sets a key in the users cookies. This is not trivial to parse. Finding anyother database might be easier. South Africa https://www.kaggle.com/geoffnoble/sadividends 1.4 MB https://www.kaggle.com/jonnylangefeld/explore-dividends Can I ask why you need these dividends? I'm assuming you're trying to use the data to help predict stock price changes? --- ... Copied to Clipboard!
|
AwesomeTurtwig 11/25/18 4:07:02 AM #5: |
It's easier to scrape this.
https://www.nasdaq.com/symbol/dnkn/dividend-history No div yield, or dollar amount. But you can calculate it with the data in that table at the very least. --- http://potders.com/sigs/1.gif My flash games: http://bit.ly/kO1BDd Rule free PotD: http://www.potders.com/forum/ ... Copied to Clipboard!
|
Yellow 11/25/18 4:08:27 AM #6: |
... Copied to Clipboard!
|
AwesomeTurtwig 11/25/18 4:10:09 AM #7: |
Yellow posted...
Can I ask why you need these dividends? I'm assuming you're trying to use the data to help predict stock price changes? Homework for my datamining course. --- http://potders.com/sigs/1.gif My flash games: http://bit.ly/kO1BDd Rule free PotD: http://www.potders.com/forum/ ... Copied to Clipboard!
|
Yellow 11/25/18 4:21:54 AM #8: |
My inner evil mastermind cringes at the idea of giving your mined financial data set to your professor for free.
--- ... Copied to Clipboard!
|
Sahuagin 11/25/18 4:47:49 AM #9: |
hmmm, yes if I do a an http get request from code, I get back a "we think you're a bot" page.
a suggestion here is to use "selenium web driver" (hmmm, though that's probably what you mean by "headless browser") https://security.stackexchange.com/questions/71869/bot-detection-via-browser-fingerprinting maybe you could do it with a userscript? for example, the div that holds the main table has id "content", so you can pull out its html with $("#content").html(); you could then use ajax to write the html to a server, and then process it however you like (or you could do your processing in javascript, pull data out of the html and send that back to the server) --- ... Copied to Clipboard!
|
Yellow 11/25/18 4:56:44 AM #10: |
Fight bots with bots. Make a neural network to generate human input. Simple enough...?
Sahuagin posted... "selenium web driver" I was wondering if something like that existed. This is why you're my favorite poster. --- ... Copied to Clipboard!
|
Judgmenl 11/25/18 7:17:31 AM #11: |
Use a headless browser:
https://duo.com/decipher/driving-headless-chrome-with-python --- Judge, Nostalgia is a hell of a drug. You're a regular Jack Kerouac ... Copied to Clipboard!
|
AwesomeTurtwig 11/25/18 9:29:12 AM #12: |
Sahuagin posted...
a suggestion here is to use "selenium web driver" (hmmm, though that's probably what you mean by "headless browser") Yeah, I've tried Selenium. But it doesn't work. --- http://potders.com/sigs/1.gif My flash games: http://bit.ly/kO1BDd Rule free PotD: http://www.potders.com/forum/ ... Copied to Clipboard!
|
Sahuagin 11/25/18 12:14:12 PM #13: |
some info here about why selenium can still be detected:
https://stackoverflow.com/questions/33225947/can-a-website-detect-when-you-are-using-selenium-with-chromedriver for the userscript solution, you should be able to run a local webserver and use http post to send any data you've extracted from the page. not something I've done before so don't know for sure it'd work. --- ... Copied to Clipboard!
|
Topic List |
Page List:
1 |