Welcome!

Join our community of MMO enthusiasts and game developers! By registering, you'll gain access to discussions on the latest developments in MMO server files and collaborate with like-minded individuals. Join us today and unlock the potential of MMO server development!

Join Today!

Web Crawler

Junior Spellweaver
Joined
Jan 14, 2009
Messages
131
Reaction score
1
Hi guys,

I'm working on a school project whereby I need to extract data from existing websites and put it onto my own offline website(which is my project) and show it off like a compiled list.

I downloaded a Web Crawler coded by someone else somewhere and have trouble implementing it.

First problem starting with the values in which I need to input, next is the implementation.

Sorry guys, I'm not very good at this but I need to get my project done. I can be reached through Skype if you need a copy of the Web Crawler to understand more of the codes. There are quite a few .cs and .csproj files in there which I don't think I understand them as well...

Thanks in advance people...



Code:
<configuration>
  <appSettings>
    <add key="url" value="www.dbs.com.sg"/> <!-- What site do you want to crawl? -->
    <add key="authority" value="23.74.187.241"/> <!-- The authority is the server dns hostname or ip address. -->
    <add key="logTextFileName" value="D:\asd.htm"/> <!-- Point this to some location that exists on your machine -->
  </appSettings>
</configuration>
 
Joined
Oct 11, 2007
Messages
1,706
Reaction score
517
I'd say you're more looking to scrape these websites rather than a webcrawler.
It's quite easy to do with c# (like, you can do it yourself in a couple of hours).
 
Put Community First
Loyal Member
Joined
Oct 2, 2014
Messages
1,115
Reaction score
833
So, you're basically needing to rip sites, or am I misunderstanding? If so you could download HTTrack which is a website copier and makes full websites available for offline viewing and grab any data from there.
 
Junior Spellweaver
Joined
Jan 14, 2009
Messages
131
Reaction score
1
So, you're basically needing to rip sites, or am I misunderstanding? If so you could download HTTrack which is a website copier and makes full websites available for offline viewing and grab any data from there.

I'd say you're more looking to scrape these websites rather than a webcrawler.
It's quite easy to do with c# (like, you can do it yourself in a couple of hours).

Thanks for replying! Anyway, is it possible to add you guys on Skype or something.

Anyway I forgot to mention the fact that the data cannot be static.
Justei I believe scraping may work!
fallenfate HTTrack is quite useful but I the data downloaded will be static.

I actually need to extract Bank loans' information from banks' websites and then compiled them into a list for comparison on my own.

eg.

ratesNfees - Web Crawler - RaGEZONE Forums

So as you can see above, the information in the data is all supposed to be extracted from existing bank websites.
 

Attachments

You must be registered for see attachments list
Joined
Oct 11, 2007
Messages
1,706
Reaction score
517
Right, so you would need to code some type of scraping tool and then parse the text (most likely divs, or tables if it's an older website) and then simply just parse it whatever way suits you and insert it into a excel documenet (or whatever you want to save it in).

If it can't be static, then my guess is that it's going to be re-used multiple times to update your data at multiple times? Otherwise I'd just do it manually (unless the tables are super massive).

The reason I say manually, is because you seem to have trouble understanding the code necessary to do this (not trying to be rude) and therefore I think it will most likely take you more time to make this work than to just do it manually :).

Ps. it's especially easy to do all this if you don't even need to log in anywhere. You can just use simple PHP and this library:
 
Junior Spellweaver
Joined
Jan 14, 2009
Messages
131
Reaction score
1
Right, so you would need to code some type of scraping tool and then parse the text (most likely divs, or tables if it's an older website) and then simply just parse it whatever way suits you and insert it into a excel documenet (or whatever you want to save it in).

If it can't be static, then my guess is that it's going to be re-used multiple times to update your data at multiple times? Otherwise I'd just do it manually (unless the tables are super massive).

The reason I say manually, is because you seem to have trouble understanding the code necessary to do this (not trying to be rude) and therefore I think it will most likely take you more time to make this work than to just do it manually :).

Ps. it's especially easy to do all this if you don't even need to log in anywhere. You can just use simple PHP and this library:

Yup, I really am not very good at this.

The reason for saying the data cannot be static... Actually I only need to prove to my lecturer that once the data on the site that I copied, changes, the same data that is on my website will have to be changed automatically.

Anyway, I have not touched PHP before so please guide me along!

I'll take a look at the link you gave now.
 
Joined
Oct 11, 2007
Messages
1,706
Reaction score
517
Yup, I really am not very good at this.

The reason for saying the data cannot be static... Actually I only need to prove to my lecturer that once the data on the site that I copied, changes, the same data that is on my website will have to be changed automatically.

Anyway, I have not touched PHP before so please guide me along!

I'll take a look at the link you gave now.

Well it's really not that incredibly hard to do, but yeah you will have to learn to script a bit.
I get that you just want this done and everything, unfortunately I don't have the time to code this or hold your hand through the process of learning and coding this. So you're just going to have to study to learn how to do it.

Since the library does all the heavy lifting, you should be able to figure it out if you put the effort and time in :).
 
Joined
Dec 15, 2009
Messages
1,387
Reaction score
236
This could useful for certain things. What it does is to scrap the entire website and sort the data into JSON accordingly. You might find it useful.



PHP does the job as well.
 
Newbie Spellweaver
Joined
Jun 7, 2009
Messages
12
Reaction score
1
If you were on L1nux, wg3t would be the best option... Wind0ws version is vwget available free

Found a link:

Tutorial:
nO4h1kf - Web Crawler - RaGEZONE Forums


VirusTotal Scan:
 

Attachments

You must be registered for see attachments list
Last edited:
Junior Spellweaver
Joined
Jan 14, 2009
Messages
131
Reaction score
1
If you were on L1nux, wg3t would be the best option... Wind0ws version is vwget available free

Found a link:

Tutorial:
nO4h1kf - Web Crawler - RaGEZONE Forums


VirusTotal Scan:

Well.. I believe this is just a content retriever that lets me download the HTML resources... Right?

I need to retrieve the data then display it onto my own website in which the data cannot be static.

This could useful for certain things. What it does is to scrap the entire website and sort the data into JSON accordingly. You might find it useful.



PHP does the job as well.

Hey thanks for suggesting this but I've looked through the YQL and have no idea how to implement it...

To be honest I've never done any of this before.

Is it possible for you to like make an example by scraping data from an existing website and how it can be displayed on an offline website?

Sorry for being burden...
 

Attachments

You must be registered for see attachments list
Joined
Dec 15, 2009
Messages
1,387
Reaction score
236
YQL Query:
select * from html where url='http://www.timeanddate.com/time/zones/bst' and xpath='//span[@class="ctm-hrmn"] | //span[@class="ctm-sec"] | //p[@class="ctm-date"]'

A very simple example with the use of javascript and jquery library.
 
Junior Spellweaver
Joined
Jan 14, 2009
Messages
131
Reaction score
1
YQL Query:


A very simple example with the use of javascript and jquery library.

Ohh! I get it!

One question, is there a way to pinpoint the exact location of the data in which I want to retrieve on the webpage?

Like for your example, the date is nested in a <p></p> with a class "ctm-date" so you are able to retrieve the date in there knowing the class name.


Edit: Okay I found a way to pinpoint the locations.

Another question, what if the data I want to retrieve is in a <div></div> and the data isn't just texts?

I've seen from your example that you put the date and time texts into variables display it. What if the data retrieved is a whole <div></div>, do I also store it in a variable and if yes, how do I display it?
 
Last edited:
Joined
Dec 15, 2009
Messages
1,387
Reaction score
236
Ohh! I get it!

One question, is there a way to pinpoint the exact location of the data in which I want to retrieve on the webpage?

Like for your example, the date is nested in a <p></p> with a class "ctm-date" so you are able to retrieve the date in there knowing the class name.


Edit: Okay I found a way to pinpoint the locations.

Another question, what if the data I want to retrieve is in a <div></div> and the data isn't just texts?

I've seen from your example that you put the date and time texts into variables display it. What if the data retrieved is a whole <div></div>, do I also store it in a variable and if yes, how do I display it?
Pure javascript:
document.getElementById('target').innerHTML = "<div></div>";
 
Junior Spellweaver
Joined
Jan 14, 2009
Messages
131
Reaction score
1
Pure javascript:
document.getElementById('target').innerHTML = "<div></div>";

YQL Query:

select * from html where url='https://www.citibank.com.sg/gcb/landing_page/clickforcash/clickcash/click_for_cash.htm' and xpath='//div[@class="faqShowHideArea"]'

Javascript:

function refresh() {
$.ajax({
url: "https://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20html%20where%20url%3D'https%3A%2F%2Fwww.citibank.com.sg%2Fgcb%2Flanding_page%2Fclickforcash%2Fclickcash%2Fclick_for_cash.htm'%20and%20xpath%3D'%2F%2Fdiv%5B%40class%3D%22faqShowHideArea%22%5D'%20&format=json&env=store%3A%2F%2Fdatatables.org%2Falltableswithkeys",
dataType: 'json',
success: function(data) {
var query = data.query.results;

var data = query.div.content;


document.getElementById('data').innerHTML = "<div></div>";
}
}).then( function() {
setTimeout(refresh, 1000)
})
}

refresh();

I tried something like this but nothing is shown, please tell me what's wrong.
 
Last edited:
Joined
Dec 15, 2009
Messages
1,387
Reaction score
236
YQL Query:

select * from html where url='https://www.citibank.com.sg/gcb/landing_page/clickforcash/clickcash/click_for_cash.htm' and xpath='//div[@class="faqShowHideArea"]'



I tried something like this but nothing is shown, please tell me what's wrong.
What is not showing?

 
Junior Spellweaver
Joined
Jan 14, 2009
Messages
131
Reaction score
1
What is not showing?



asdasdasd - Web Crawler - RaGEZONE Forums

Here!

I tried to retrieve data in a <div></div> from this website called citibank. And then query it.

Then I have no idea how to display it
 

Attachments

You must be registered for see attachments list
Junior Spellweaver
Joined
Jan 14, 2009
Messages
131
Reaction score
1
Google json and arrays.
Examine thoroughly to the example given.

Thanks man, I managed to retrieve the data in the table with Arrays. But I met with another problem.

citibankTable - Web Crawler - RaGEZONE Forums

^I want to retrieve this table.

I managed to retrieve the 1st row. Below is the codes I used.



But when I tried to retrieve the second row, it doesn't seems to work.




Here is the HTML codes for the table


Basically the data is within "tr". Nested in "tr" there is 1 "th" and 2 "td". So basically I need to retrieve the data from the 1st "td" and within the Array, I wasn't able to retrieve the data from the 1st "td" as there is 2 "td"s.

Ultimately I need to retrieve all 3 rows and append into a single table.
"div": {
"class": "tableWrap tableCitiPre TablePadd tableWrapSmallFnt last",
"table": {
"tbody": {
"tr": [
{
"class": "whiteBg",
"th": [
{
"class": "tblSixColumn tenuValuone tenuValu",
"content": "Tenure (months)"
},
{
"class": "txtCenter tenuValu ",
"content": "12"
},
{
"class": "txtCenter tenuValu",
"content": "24"
},
{
"class": "txtCenter tenuValu",
"content": "36"
},
{
"class": "txtCenter tenuValu",
"content": "48"
},
{
"class": "txtCenter tenuValu",
"content": "60"
}
]
},
{
"class": "whiteBg",
"td": [
"Nominal Interest Rates",
{
"class": "txtCenter",
"content": "8.76%"
},
{
"class": "txtCenter",
"content": "7.76%"
},
{
"class": "txtCenter",
"content": "7.39%"
},
{
"class": "txtCenter",
"content": "7.79%"
},
{
"class": "txtCenter",
"content": "7.92%"
}
]
},
{
"class": "whiteBg",
"td": [
"Effective Interest Rates",
{
"class": "txtCenter",
"content": "15.80%"
},
{
"class": "txtCenter",
"content": "14.25%"
},
{
"class": "txtCenter",
"content": "13.50%"
},
{
"class": "txtCenter",
"content": "14.00%"
},
{
"class": "txtCenter",
"content": "14.00%"
}
]
}
]
}
}
}
 

Attachments

You must be registered for see attachments list
Last edited:
Joined
Dec 15, 2009
Messages
1,387
Reaction score
236
for (var j = 0; j < rows.th.length; j++)
for (var j = 0; j < rows.td[0].length; j++)

Read console and recheck your json. I figured out your mistake.

--
zYzb1cQ - Web Crawler - RaGEZONE Forums


I had to rely on a few if and else statements.
Also, to please your readings, use
Or you could use console.log() command.
 

Attachments

You must be registered for see attachments list
Last edited:
Junior Spellweaver
Joined
Jan 14, 2009
Messages
131
Reaction score
1
for (var j = 0; j < rows.th.length; j++)
for (var j = 0; j < rows.td[0].length; j++)

Read console and recheck your json. I figured out your mistake.

--
zYzb1cQ - Web Crawler - RaGEZONE Forums


I had to rely on a few if and else statements.
Also, to please your readings, use
Or you could use console.log() command.


Sorry, I don't get it.

The 2 lines of "for" codes you gave, I used them and it didn't seem to work as well.

How did you manage to retrieve all 3 rows, is it possible for you to show me the codes?


Sorry, I'm not very good with JSON... Or programming...
 

Attachments

You must be registered for see attachments list
Back
Top