Duplication of page content question

  • rtm223
  • Mastermind
  • Mastermind
  • User avatar
  • Posts: 1855
  • Loc: Uk

Post 3+ Months Ago

well As I understand it , the two urls's
http://sub.domain.tld/path/to/file.ext
http://sub.domain.tld/path/to/file.ext?var=value

are counted as two <b>different</b> pages by google. My site uses the GET variables to swap stylesheets and then refresh into the same page. The content is <b>exactly the same</b>, even though the url is different.

Every page has links to the stylesheet switcher on it, so in theory, google is going to start seeing multiple duplicates of every single page on my site. So two questions:<ol><li>will I get penalised?</li><li>What is the best method of preventing google from reaching these pages?</li></ol>
I was thinking of including
Code: [ Select ]
<meta name="robots" content="noindex" />

ONLY if there are GET variables detected. would this be the best way?
  • Anonymous
  • Bot
  • No Avatar
  • Posts: ?
  • Loc: Ozzuland
  • Status: Online

Post 3+ Months Ago

  • rtchar
  • Expert
  • Expert
  • User avatar
  • Posts: 606
  • Loc: Canada

Post 3+ Months Ago

From my understanding Google will only index one page if it sees duplicates.

Since the content is the same ... and only the formatting has changed ... is this a bad thing?
  • nuclei
  • Graduate
  • Graduate
  • User avatar
  • Posts: 147
  • Loc: On a mountain

Post 3+ Months Ago

Lately google has been penalising sites for duplicate content. I would suggest using robots.txt to disallow the url with the ?stuff to robots. This should alleviate any issues you may have down the road.
  • rtm223
  • Mastermind
  • Mastermind
  • User avatar
  • Posts: 1855
  • Loc: Uk

Post 3+ Months Ago

Thanks for the replies guys, I implimented the robots META tag last night, although I've got a little room to move a google has only just started indexing me, and I've got no chance of getting into serps yet lol.

rtchar, I did think along those lines, but I figure it is better to be safe than sorry TBH Plus google could change it's mind about duplicates in the future.

nuclei wrote:
Lately google has been penalising sites for duplicate content. I would suggest using robots.txt to disallow the url with the ?stuff to robots. This should alleviate any issues you may have down the road.

can you even do that? I just looked through info on robots.txt and there is no wildcard for the dissallow field. THerefore there is no way to do it except to list <b>every</b> file with <b>every</b> possible ?s= suffix. This is one hell of a pain in the ass, as i intend to get a lot of files in there and increase the number of different styles.

Would it not be enough to use the meta tag?
  • nuclei
  • Graduate
  • Graduate
  • User avatar
  • Posts: 147
  • Loc: On a mountain

Post 3+ Months Ago

From: http://www.searchengineworld.com/robots ... torial.htm

Code: [ Select ]
Disallow:
The second part of a record consists of Disallow: directive lines. These lines specify files and/or directories. For example, the following line instructs spiders that it can not download email.htm:

Disallow: email.htm

You may also specify directories:
Disallow: /cgi-bin/

Which would block spiders from your cgi-bin directory.
There is a wildcard nature to the Disallow directive. The standard dictates that /bob would disallow /bob.html and /bob/indes.html (both the file bob and files in the bob directory will not be indexed).
  1. Disallow:
  2. The second part of a record consists of Disallow: directive lines. These lines specify files and/or directories. For example, the following line instructs spiders that it can not download email.htm:
  3. Disallow: email.htm
  4. You may also specify directories:
  5. Disallow: /cgi-bin/
  6. Which would block spiders from your cgi-bin directory.
  7. There is a wildcard nature to the Disallow directive. The standard dictates that /bob would disallow /bob.html and /bob/indes.html (both the file bob and files in the bob directory will not be indexed).


So it would seem to be able to disallow ?s= and match anything after the s= by default.
  • rtm223
  • Mastermind
  • Mastermind
  • User avatar
  • Posts: 1855
  • Loc: Uk

Post 3+ Months Ago

nuclei wrote:
Code: [ Select ]
There is a wildcard nature to the Disallow directive. The standard dictates that /bob would disallow /bob.html and /bob/indes.html (both the file bob and files in the bob directory will not be indexed).
So it would seem to be able to disallow ?s= and match anything after the s= by default.

no, because there is no wildcard Character, you can't control the wildcard nature. As I understand it that would dissallow:

/?s=thenSomeStuffHere

but NOT:

/file.php?s=thenSomeStuffHere

it would only dissallow urls that <b>started</b> with a ?s= . Basically I would still have to write a dissallow rule for every single page, although not for every style:

Code: [ Select ]
Disallow: /?s=
Disallow: /onefile.html?s=
Disallow: /anotherfile.html?s=
......
  1. Disallow: /?s=
  2. Disallow: /onefile.html?s=
  3. Disallow: /anotherfile.html?s=
  4. ......


Which is still a nightmare. I still don't get why the meta won't work though, is it just less effective?
  • nuclei
  • Graduate
  • Graduate
  • User avatar
  • Posts: 147
  • Loc: On a mountain

Post 3+ Months Ago

I dont actually know if it is less effective or not to be honest. I DO know all the engines follow robots.txt, but I have never had to use the meta equiv. Some engines may ignore that.
  • rtm223
  • Mastermind
  • Mastermind
  • User avatar
  • Posts: 1855
  • Loc: Uk

Post 3+ Months Ago

Quote:
I dont actually know if it is less effective or not to be honest. I DO know all the engines follow robots.txt, but I have never had to use the meta equiv. Some engines may ignore that.


Well google caims to respect it, I don't know about the others I'll have to have a look. I think I will keep the meta tag (it does not harm!) and add a dissallow rule for the first page with GET variables. As the site grows I will probably add <b>key</b> pages to the robots.txt rather than everything.

Thank you again for your input nuclei
  • rtchar
  • Expert
  • Expert
  • User avatar
  • Posts: 606
  • Loc: Canada

Post 3+ Months Ago

Another option might be to bury the offending links in Javascript which will not be followed by search engines ....

document.write('<a href="?s=0'">');

Might be a little more exact than banning robots from your site entirely.
  • nuclei
  • Graduate
  • Graduate
  • User avatar
  • Posts: 147
  • Loc: On a mountain

Post 3+ Months Ago

rtchar wrote:
Another option might be to bury the offending links in Javascript which will not be followed by search engines ....

document.write('<a href="?s=0'">');

Might be a little more exact than banning robots from your site entirely.


This is what I would have done a few months ago. However, if you have not noticed a Googlebot/Test entry in your logfiles, Google is now playing with a bot that can read javascript.
  • rtm223
  • Mastermind
  • Mastermind
  • User avatar
  • Posts: 1855
  • Loc: Uk

Post 3+ Months Ago

rtchar wrote:
Another option might be to bury the offending links in Javascript which will not be followed by search engines ....

document.write('<a href="?s=0'">');

Might be a little more exact than banning robots from your site entirely.


Well on top of what nuclei said, the whole site is supposed to be promoting website design best practices, and writing content with javascript is not something I want to promote. I'd consider it under other circumstances but not for this site.

In fact my intention is <b>not</b> to ban robots from the site entirely, I will be using meta tags to ban from the <b>duplicate</b> pages ONLY. To see what I mean, have a quick look at the source for:

http://www.caffeinefuelled.net
http://www.caffeinefuelled.net/?s=0

That is nicely handled server-side, and then I have just added:

Code: [ Select ]
Disallow: /?s=

in the robots.txt, for any robots that do not respect the meta tags. This will at least mean my front page will always be considered unique by <b>every</b> bot that respects robots.txt
  • rtchar
  • Expert
  • Expert
  • User avatar
  • Posts: 606
  • Loc: Canada

Post 3+ Months Ago

Quote:
the whole site is supposed to be promoting website design best practices


I just have to worry about getting the job done. :)

I am not sure what scripts are being followed as yet . I have some javascript, and some perl scripts on my site ... with mixed results.

Good Luck with your site, it sounds like you have some direction now!

Post Information

  • Total Posts in this topic: 12 posts
  • Users browsing this forum: No registered users and 7 guests
  • You cannot post new topics in this forum
  • You cannot reply to topics in this forum
  • You cannot edit your posts in this forum
  • You cannot delete your posts in this forum
  • You cannot post attachments in this forum
 
cron
 

© 1998-2014. Ozzu® is a registered trademark of Unmelted, LLC.