Need assistance with simple script (hopefully)

  • CPT
  • Born
  • Born
  • CPT
  • Posts: 4

Post 3+ Months Ago

Hello all,

I run a website with a few hundred directory-type listings, all with their own sub-webpage. The sub-pages are individual html pages.

And now my central database (MS access) with all the members details has been corrupted. So now I need to collect the contact details from the static html pages.

But with hundred of members, I can't do it by hand. So I am looking for a webcrawler that can go through all my html pages and retrieve certain data, and post it to a mysql DB, or perhaps just print it into tables.

I do not want to specify my website, as the emails are not protected against spambots. But to give you an idea of what I am looking for, I need something to visit each page and collect data between specifed tags, e.g.

Collect "business title" and "email" from the html source: <div class="title">BUSINESS TITLE</div><br/><div class="email">EMAIL ADDRESS</div>

So in this instance, the crawler would print:

BUSINESS TITLE | EMAIL ADDRESS

And it would do this automatically for all the pages (I can specify pages, as they follow a logic order, e.g. /001/index.html /002/index.html)

Please could someone help me with a simple script, as the only things I can find on Google are dirty Microsoft Windows programs that are built for spammers, and can not collect things other than emails.

Kind regards,
Stephen P
  • Anonymous
  • Bot
  • No Avatar
  • Posts: ?
  • Loc: Ozzuland
  • Status: Online

Post 3+ Months Ago

  • Bigwebmaster
  • Site Admin
  • Site Admin
  • User avatar
  • Posts: 9090
  • Loc: Seattle, WA & Phoenix, AZ

Post 3+ Months Ago

To me it looks like you need to write a script to automatically go through each page and extract the info.
  • CPT
  • Born
  • Born
  • CPT
  • Posts: 4

Post 3+ Months Ago

Indeed. Could someone point me in the direction of a template of some sort... Then I could work off that...
  • CPT
  • Born
  • Born
  • CPT
  • Posts: 4

Post 3+ Months Ago

bump, please

Could someone please let know what I should be googling, as I have not found anything close to what I am looking for.
  • SpooF
  • ٩๏̯͡๏۶
  • Bronze Member
  • User avatar
  • Posts: 3422
  • Loc: Richland, WA

Post 3+ Months Ago

Well, here's a bit of code to play with:

PHP Code: [ Select ]
preg_match_all('/<div class="title">\s?(.*?)\s?<\/div>\s?.*?\s?<div class="email">\s?(.*?)\s?<\/div>*/si',$contents,$tmp);
$business_name = $tmp[1];
$email_address = $tmp[2];
foreach($business_name as $name)
{
   echo trim($name)."\n";
}
foreach($email_address as $email)
{
   echo trim($email)."\n";
}
  1. preg_match_all('/<div class="title">\s?(.*?)\s?<\/div>\s?.*?\s?<div class="email">\s?(.*?)\s?<\/div>*/si',$contents,$tmp);
  2. $business_name = $tmp[1];
  3. $email_address = $tmp[2];
  4. foreach($business_name as $name)
  5. {
  6.    echo trim($name)."\n";
  7. }
  8. foreach($email_address as $email)
  9. {
  10.    echo trim($email)."\n";
  11. }


This will match and extract the anything that matches what you supplied above, basically it will return anything between the <div class="title>[Get this stuff]</div> and <div class="email">[Get this stuff]</div> it doesn't care whats on either side.

You can replace $contents with something like file_get_contents('http://somewebaddress.com/page.html') and it will read the page and find all matches. I'll post an example in a bit. You could also split that expression a bit because right now it it looks for the first class title and email to match.
  • SpooF
  • ٩๏̯͡๏۶
  • Bronze Member
  • User avatar
  • Posts: 3422
  • Loc: Richland, WA

Post 3+ Months Ago

PHP Code: [ Select ]
$pattern = '/<div class="headlinelink">.*?<a href="(.*?)">(.*?)<\/a>.*?<em>"(.*?)"<\/em>.*?Started by(.*?)in/si';
 
preg_match_all($pattern,file_get_contents('http://ozzu.com'),$tmp);
 
$url = $tmp[1];
$title = $tmp[2];
$desc = $tmp[3];
$starter = $tmp[4];
 
for($i=0;$i<count($tmp[1]);$i++)
{
   echo "URL: ".$url[$i]."<br>TITLE: ".$title[$i]."<br>DESC: ".$desc[$i]."<br>STARTER: ".$starter[$i]."<br><br>";
}
  1. $pattern = '/<div class="headlinelink">.*?<a href="(.*?)">(.*?)<\/a>.*?<em>"(.*?)"<\/em>.*?Started by(.*?)in/si';
  2.  
  3. preg_match_all($pattern,file_get_contents('http://ozzu.com'),$tmp);
  4.  
  5. $url = $tmp[1];
  6. $title = $tmp[2];
  7. $desc = $tmp[3];
  8. $starter = $tmp[4];
  9.  
  10. for($i=0;$i<count($tmp[1]);$i++)
  11. {
  12.    echo "URL: ".$url[$i]."<br>TITLE: ".$title[$i]."<br>DESC: ".$desc[$i]."<br>STARTER: ".$starter[$i]."<br><br>";
  13. }


If you run that code it will extract the headline information from the front page of Ozzu. Its basically the same thing the code I supplied above can do for the snippet you provided, this code is just a working example on a live webpage.
  • CPT
  • Born
  • Born
  • CPT
  • Posts: 4

Post 3+ Months Ago

@SpooF

Thank you so much. That is exactly what I was looking for.

I've now integrated a loop, and I am now getting it to work on my site.

Thanks again,
CPT

Post Information

  • Total Posts in this topic: 7 posts
  • Users browsing this forum: No registered users and 55 guests
  • You cannot post new topics in this forum
  • You cannot reply to topics in this forum
  • You cannot edit your posts in this forum
  • You cannot delete your posts in this forum
  • You cannot post attachments in this forum
 
 

© 1998-2014. Ozzu® is a registered trademark of Unmelted, LLC.