• Unfortunately, we have experienced significant hard drive damage that requires urgent maintenance and rebuilding. The forum will be a state of read only until we install our new drives and rebuild all the configurations needed. Please follow our Facebook page for updates, we will be back up shortly! (The forum could go offline at any given time due to the nature of the failed drives whilst awaiting the upgrades.) When you see an Incapsula error, you know we are in the process of migration.

Crawling a Site

Master Summoner
Joined
Sep 9, 2008
Messages
582
Reaction score
12
Hello RZ,

Getting the source code of a site based on a logged in account.
Any examples(php)? I think this works via logging in with headers then getting the source code.

Next is I will crawl all pages of the site and get all specific file type.

Please advice, Thank you.

Regards,
M. Siddiqui
 
Last edited:
Master Summoner
Joined
Sep 9, 2008
Messages
582
Reaction score
12
This isn't really a spider or crawler, I just want to get it on one site. I'm just trying to farm our specific filetypes on a specific site. :)

And I wish to code it in PHP.

...There is this site than has links, "links" that can only be viewed by members.
So I need for the script to be logged as a user with predefined username and password, and get all the links.
 
Newbie Spellweaver
Joined
Jun 27, 2012
Messages
66
Reaction score
14
This isn't really a spider or crawler, I just want to get it on one site. I'm just trying to farm our specific filetypes on a specific site. :)

And I wish to code it in PHP.

...There is this site than has links, "links" that can only be viewed by members.
So I need for the script to be logged as a user with predefined username and password, and get all the links.

Login and "farm" the links?
 
Master Summoner
Joined
Sep 9, 2008
Messages
582
Reaction score
12
There is too many to do manually. And it is found on thousands of different pages.

I have found that the login part can be done with cURL.
Now, I need to get all the links from alot of pages, any advices?

I have done getting a link from one specified page. But I want it to be done automatically, and get all links, from all pages.
 
Ginger by design.
Loyal Member
Joined
Feb 15, 2007
Messages
2,340
Reaction score
653
I have found that the login part can be done with cURL.
Now, I need to get all the links from alot of pages, any advices?

I have done getting a link from one specified page. But I want it to be done automatically, and get all links, from all pages.

That's exactly what spiders do.
 
Master Summoner
Joined
Sep 9, 2008
Messages
582
Reaction score
12
That's exactly what spiders do.

Can you point me to the right direction? I searched google for a few php spiders and it doesnt have much docs/manuals.
Sorry I'm not that good on PHP.

and I'll only need the links with the format of videos.

Thanks.

PS.> I am also willing to pay for someone to create this simple spider/crawler.
 
Last edited:
• ♠️​ ♦️ ♣️ ​♥️ •
Joined
Mar 25, 2012
Messages
909
Reaction score
464
to get a source of a page u can use file_get_contents:
PHP:
$html = file_get_contents('http://forum.ragezone.com/');
echo '<pre>'.$html.'</pre>';
the problem is, u must login first with the www account of your wamp / xampp on the server (localhost) or something. ive never done it, but once u get the source of your site, you can track URLs via regular expressions and save them anywhere.


PHP:
preg_match_all('#\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))#i', $html, $links, PREG_PATTERN_ORDER);
print_r($links);

not tested, tho.
 
Elite Diviner
Joined
May 30, 2011
Messages
443
Reaction score
95
may be want you're looking for, if the task isn't too complex.
 
• ♠️​ ♦️ ♣️ ​♥️ •
Joined
Mar 25, 2012
Messages
909
Reaction score
464
may be want you're looking for, if the task isn't too complex.

he wants to do it in php, although im not sure u can tell DownloadThemAll to login automatically for u by pasting logindata into the inputs and execute the submit button on another specific page before scanning the site.

as i kno DownloadThemAll is just useful to download content, not to just track and store the links.
 
Back
Top