How do I read document headers with PHP/cURL?

  • RedBMedia
  • Proficient
  • Proficient
  • User avatar
  • Joined: May 01, 2007
  • Posts: 315
  • Status: Offline

Post June 11th, 2009, 8:57 am

I am using cURL to grab some data from some different sites. However, the pages that I am crawling are dynamically selected based on user input. Because, of this, i need a way to check if the page exist prior to extracting the data from the markup. Is there a way to get the HTTP status code in the doc header with cURL so I can check for 404 errors? Heres the cURL implementation that I am executing:

Code: [ Select ]
 
  $userAgent = "IE 7 - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30)";
  $target_url = "http://www.somewebsite.com/";
  $ch = curl_init();
  curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
  curl_setopt($ch, CURLOPT_URL, $target_url);
  curl_setopt($ch, CURLOPT_FAILONERROR, true);
  curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
  curl_setopt($ch, CURLOPT_AUTOREFERER, true);
  curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
  curl_setopt($ch, CURLOPT_TIMEOUT, 180);
  $html = curl_exec($ch);
  curl_close($ch);
 
  1.  
  2.   $userAgent = "IE 7 - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30)";
  3.   $target_url = "http://www.somewebsite.com/";
  4.   $ch = curl_init();
  5.   curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
  6.   curl_setopt($ch, CURLOPT_URL, $target_url);
  7.   curl_setopt($ch, CURLOPT_FAILONERROR, true);
  8.   curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
  9.   curl_setopt($ch, CURLOPT_AUTOREFERER, true);
  10.   curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
  11.   curl_setopt($ch, CURLOPT_TIMEOUT, 180);
  12.   $html = curl_exec($ch);
  13.   curl_close($ch);
  14.  
Joe Hall
  • Anonymous
  • Bot
  • No Avatar
  • Joined: 25 Feb 2008
  • Posts: ?
  • Loc: Ozzuland
  • Status: Online

Post June 11th, 2009, 8:57 am

  • joebert
  • Sledgehammer
  • Genius
  • No Avatar
  • Joined: Feb 10, 2004
  • Posts: 13455
  • Loc: Florida
  • Status: Offline

Post June 12th, 2009, 1:39 am

http://www.php.net/curl_getinfo

CURLINFO_HTTP_CODE
Quote:
Last received HTTP code


http://www.php.net/curl_setopt

CURLOPT_HEADERFUNCTION
Quote:
The name of a callback function where the callback function takes two parameters. The first is the cURL resource, the second is a string with the header data to be written. The header data must be written when using this callback function. Return the number of bytes written.
Strong with this one, the sudo is.
  • RedBMedia
  • Proficient
  • Proficient
  • User avatar
  • Joined: May 01, 2007
  • Posts: 315
  • Status: Offline

Post June 12th, 2009, 7:19 am

Thanks man! I love you like a fat kids loves cake!
Joe Hall
  • joebert
  • Sledgehammer
  • Genius
  • No Avatar
  • Joined: Feb 10, 2004
  • Posts: 13455
  • Loc: Florida
  • Status: Offline

Post June 12th, 2009, 8:03 am

If you're going to spoof the user-agent, you might as well give it a pool of agents to select from randomly.

That is, unless you like contributing to artificially inflating the popularity of one browser to the point that nobody really knows what people are using. :)
Strong with this one, the sudo is.
  • RedBMedia
  • Proficient
  • Proficient
  • User avatar
  • Joined: May 01, 2007
  • Posts: 315
  • Status: Offline

Post June 19th, 2009, 9:15 pm

Have a list of user agents that I can add to a pool?
Joe Hall
  • joebert
  • Sledgehammer
  • Genius
  • No Avatar
  • Joined: Feb 10, 2004
  • Posts: 13455
  • Loc: Florida
  • Status: Offline

Post June 20th, 2009, 7:59 am

http://www.useragentstring.com/pages/Browserlist/

You know I can't help but wonder, how much of the audience that appears to be browsing on things like IE6, is actually old tools using spoofed User-Agents that haven't updated the User-Agent their tool uses because "it aint broke". :scratchhead:
Strong with this one, the sudo is.
  • RedBMedia
  • Proficient
  • Proficient
  • User avatar
  • Joined: May 01, 2007
  • Posts: 315
  • Status: Offline

Post June 26th, 2009, 7:10 am

Ha, I had never thought of that, it might be funny to only use IE4 on everything you do for that very reason! btw, thanks for the list!
Joe Hall
  • Rabid Dog
  • Web Master
  • Web Master
  • User avatar
  • Joined: May 21, 2004
  • Posts: 3229
  • Loc: South Africa
  • Status: Offline

Post June 30th, 2009, 4:05 pm

Why not create your own agent :) I am gonna make one called "turbo charged monkey"
Watch me grow
  • joebert
  • Sledgehammer
  • Genius
  • No Avatar
  • Joined: Feb 10, 2004
  • Posts: 13455
  • Loc: Florida
  • Status: Offline

Post June 30th, 2009, 4:47 pm

In case anyone's wondering, there's
http://www.ietf.org/rfc/rfc2068.txt

Section 3.8 wrote:
Product tokens are used to allow communicating applications to
identify themselves by software name and version. Most fields using
product tokens also allow sub-products which form a significant part
of the application to be listed, separated by whitespace. By
convention, the products are listed in order of their significance
for identifying the application.

product = token ["/" product-version]
product-version = token

Examples:

User-Agent: CERN-LineMode/2.15 libwww/2.17b3
Server: Apache/0.8.4

Product tokens should be short and to the point -- use of them for
advertising or other non-essential information is explicitly
forbidden. Although any token character may appear in a product-
version, this token SHOULD only be used for a version identifier
(i.e., successive versions of the same product SHOULD only differ in
the product-version portion of the product value).
Strong with this one, the sudo is.
  • Rabid Dog
  • Web Master
  • Web Master
  • User avatar
  • Joined: May 21, 2004
  • Posts: 3229
  • Loc: South Africa
  • Status: Offline

Post June 30th, 2009, 4:50 pm

OH MY! Someone refered to an rfc for clarification! Joebert you are close to becoming my favourite person!

Surely this info is available on any http response? or do you need curl to read that response? never used it
Watch me grow
  • joebert
  • Sledgehammer
  • Genius
  • No Avatar
  • Joined: Feb 10, 2004
  • Posts: 13455
  • Loc: Florida
  • Status: Offline

Post June 30th, 2009, 4:57 pm

I'm confused about what you're asking RD. :scratchhead:
Strong with this one, the sudo is.
  • Rabid Dog
  • Web Master
  • Web Master
  • User avatar
  • Joined: May 21, 2004
  • Posts: 3229
  • Loc: South Africa
  • Status: Offline

Post June 30th, 2009, 5:09 pm

Just asking if curl handles the http response or if you could just use straight php to retrieve the header values?

Oh and congratulating you on the rfc link. nice to see :)
Watch me grow
  • joebert
  • Sledgehammer
  • Genius
  • No Avatar
  • Joined: Feb 10, 2004
  • Posts: 13455
  • Loc: Florida
  • Status: Offline

Post June 30th, 2009, 5:21 pm

The manual page for fsockopen seems to suggest that cURL isn't required if you want to deal with headers of a response.

And yes, RFC documents are nice. :D
Strong with this one, the sudo is.
  • Rabid Dog
  • Web Master
  • Web Master
  • User avatar
  • Joined: May 21, 2004
  • Posts: 3229
  • Loc: South Africa
  • Status: Offline

Post June 30th, 2009, 5:31 pm

Yeah I figured you wouldn't need additional libraries. After all it is the Hypertext Pre Processor isn't it LOL
Watch me grow

Post Information

  • Total Posts in this topic: 14 posts
  • Users browsing this forum: No registered users and 282 guests
  • You cannot post new topics in this forum
  • You cannot reply to topics in this forum
  • You cannot edit your posts in this forum
  • You cannot delete your posts in this forum
  • You cannot post attachments in this forum
 
 

© 2011 Unmelted, LLC. Ozzu® is a registered trademark of Unmelted, LLC.