How do I read document headers with PHP/cURL?

  • RedBMedia
  • Proficient
  • Proficient
  • User avatar
  • Posts: 315

Post 3+ Months Ago

I am using cURL to grab some data from some different sites. However, the pages that I am crawling are dynamically selected based on user input. Because, of this, i need a way to check if the page exist prior to extracting the data from the markup. Is there a way to get the HTTP status code in the doc header with cURL so I can check for 404 errors? Heres the cURL implementation that I am executing:

Code: [ Select ]
 
  $userAgent = "IE 7 - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30)";
  $target_url = "http://www.somewebsite.com/";
  $ch = curl_init();
  curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
  curl_setopt($ch, CURLOPT_URL, $target_url);
  curl_setopt($ch, CURLOPT_FAILONERROR, true);
  curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
  curl_setopt($ch, CURLOPT_AUTOREFERER, true);
  curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
  curl_setopt($ch, CURLOPT_TIMEOUT, 180);
  $html = curl_exec($ch);
  curl_close($ch);
 
  1.  
  2.   $userAgent = "IE 7 - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30)";
  3.   $target_url = "http://www.somewebsite.com/";
  4.   $ch = curl_init();
  5.   curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
  6.   curl_setopt($ch, CURLOPT_URL, $target_url);
  7.   curl_setopt($ch, CURLOPT_FAILONERROR, true);
  8.   curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
  9.   curl_setopt($ch, CURLOPT_AUTOREFERER, true);
  10.   curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
  11.   curl_setopt($ch, CURLOPT_TIMEOUT, 180);
  12.   $html = curl_exec($ch);
  13.   curl_close($ch);
  14.  
  • Anonymous
  • Bot
  • No Avatar
  • Posts: ?
  • Loc: Ozzuland
  • Status: Online

Post 3+ Months Ago

  • joebert
  • Fart Bubbles
  • Genius
  • User avatar
  • Posts: 13503
  • Loc: Florida

Post 3+ Months Ago

http://www.php.net/curl_getinfo

CURLINFO_HTTP_CODE
Quote:
Last received HTTP code


http://www.php.net/curl_setopt

CURLOPT_HEADERFUNCTION
Quote:
The name of a callback function where the callback function takes two parameters. The first is the cURL resource, the second is a string with the header data to be written. The header data must be written when using this callback function. Return the number of bytes written.
  • RedBMedia
  • Proficient
  • Proficient
  • User avatar
  • Posts: 315

Post 3+ Months Ago

Thanks man! I love you like a fat kids loves cake!
  • joebert
  • Fart Bubbles
  • Genius
  • User avatar
  • Posts: 13503
  • Loc: Florida

Post 3+ Months Ago

If you're going to spoof the user-agent, you might as well give it a pool of agents to select from randomly.

That is, unless you like contributing to artificially inflating the popularity of one browser to the point that nobody really knows what people are using. :)
  • RedBMedia
  • Proficient
  • Proficient
  • User avatar
  • Posts: 315

Post 3+ Months Ago

Have a list of user agents that I can add to a pool?
  • joebert
  • Fart Bubbles
  • Genius
  • User avatar
  • Posts: 13503
  • Loc: Florida

Post 3+ Months Ago

http://www.useragentstring.com/pages/Browserlist/

You know I can't help but wonder, how much of the audience that appears to be browsing on things like IE6, is actually old tools using spoofed User-Agents that haven't updated the User-Agent their tool uses because "it aint broke". :scratchhead:
  • RedBMedia
  • Proficient
  • Proficient
  • User avatar
  • Posts: 315

Post 3+ Months Ago

Ha, I had never thought of that, it might be funny to only use IE4 on everything you do for that very reason! btw, thanks for the list!
  • Rabid Dog
  • Web Master
  • Web Master
  • User avatar
  • Posts: 3245
  • Loc: South Africa

Post 3+ Months Ago

Why not create your own agent :) I am gonna make one called "turbo charged monkey"
  • joebert
  • Fart Bubbles
  • Genius
  • User avatar
  • Posts: 13503
  • Loc: Florida

Post 3+ Months Ago

In case anyone's wondering, there's
http://www.ietf.org/rfc/rfc2068.txt

Section 3.8 wrote:
Product tokens are used to allow communicating applications to
identify themselves by software name and version. Most fields using
product tokens also allow sub-products which form a significant part
of the application to be listed, separated by whitespace. By
convention, the products are listed in order of their significance
for identifying the application.

product = token ["/" product-version]
product-version = token

Examples:

User-Agent: CERN-LineMode/2.15 libwww/2.17b3
Server: Apache/0.8.4

Product tokens should be short and to the point -- use of them for
advertising or other non-essential information is explicitly
forbidden. Although any token character may appear in a product-
version, this token SHOULD only be used for a version identifier
(i.e., successive versions of the same product SHOULD only differ in
the product-version portion of the product value).
  • Rabid Dog
  • Web Master
  • Web Master
  • User avatar
  • Posts: 3245
  • Loc: South Africa

Post 3+ Months Ago

OH MY! Someone refered to an rfc for clarification! Joebert you are close to becoming my favourite person!

Surely this info is available on any http response? or do you need curl to read that response? never used it
  • joebert
  • Fart Bubbles
  • Genius
  • User avatar
  • Posts: 13503
  • Loc: Florida

Post 3+ Months Ago

I'm confused about what you're asking RD. :scratchhead:
  • Rabid Dog
  • Web Master
  • Web Master
  • User avatar
  • Posts: 3245
  • Loc: South Africa

Post 3+ Months Ago

Just asking if curl handles the http response or if you could just use straight php to retrieve the header values?

Oh and congratulating you on the rfc link. nice to see :)
  • joebert
  • Fart Bubbles
  • Genius
  • User avatar
  • Posts: 13503
  • Loc: Florida

Post 3+ Months Ago

The manual page for fsockopen seems to suggest that cURL isn't required if you want to deal with headers of a response.

And yes, RFC documents are nice. :D
  • Rabid Dog
  • Web Master
  • Web Master
  • User avatar
  • Posts: 3245
  • Loc: South Africa

Post 3+ Months Ago

Yeah I figured you wouldn't need additional libraries. After all it is the Hypertext Pre Processor isn't it LOL

Post Information

  • Total Posts in this topic: 14 posts
  • Users browsing this forum: No registered users and 147 guests
  • You cannot post new topics in this forum
  • You cannot reply to topics in this forum
  • You cannot edit your posts in this forum
  • You cannot delete your posts in this forum
  • You cannot post attachments in this forum
 
 

© 1998-2014. Ozzu® is a registered trademark of Unmelted, LLC.