.PDF files and Google

  • phaugh
  • Professor
  • Professor
  • User avatar
  • Posts: 796

Post 3+ Months Ago

Is there any SEO benefit to offering copies of your pages in pdf format. I know it's convenient for printing purposes and to send to people who may not have internet access but want to take the info with them...say like on a plane.

Will these get picked up as dupes? Does google crawl the content inside the pdf file...will it follow any links?
  • madmonk
  • Mastermind
  • Mastermind
  • madmonk
  • Posts: 2115
  • Loc: australia

Post 3+ Months Ago

google crawls the contents in pdf. the links will be followed as far as i know..

as for dupes, I was just wondering about the same question a few days ago.. what you trying to get is:
g
-etting twice the number of pages/contents and being seen by google ?

lets hear what the others have to say abt this..
  • phaugh
  • Professor
  • Professor
  • User avatar
  • Posts: 796

Post 3+ Months Ago

"what you trying to get is: getting twice the number of pages/contents and being seen by google ? " Not my main goal but that would be a nice benefit ;)

I do high end real estate sites and want to be able to let surfers take the content with them to places that don have an internet connection....these clients tend to be Jet Setters and don't sit at thier PC's for hours on end.....most of them don't even live in these homes...they buy them as investments, hold on to them for a few years and sell them to the next investor.

My big fear is that the pages will be caught in the dupe filter and hurt the ranking of my sites.
  • seobook
  • Beginner
  • Beginner
  • seobook
  • Posts: 38
  • Loc: we are Penn State!!!

Post 3+ Months Ago

many people who fear the duplicate problem offer the full pdf version and then a shorter more keyword dense version as an abstract on their site
  • rtchar
  • Expert
  • Expert
  • User avatar
  • Posts: 606
  • Loc: Canada

Post 3+ Months Ago

Sound advice seobook ...

Seems I am always learning something new around here ... keep it up. :)
  • phaugh
  • Professor
  • Professor
  • User avatar
  • Posts: 796

Post 3+ Months Ago

"many people who fear the duplicate problem offer the full pdf version and then a shorter more keyword dense version as an abstract on their site"...but I want to just make copies of the page. Is there a way to stop google from indexing the pdf files?
  • rtchar
  • Expert
  • Expert
  • User avatar
  • Posts: 606
  • Loc: Canada

Post 3+ Months Ago

Couldn't you just gather them in a directory and use your robots.txt file to disallow indexing?
  • phaugh
  • Professor
  • Professor
  • User avatar
  • Posts: 796

Post 3+ Months Ago

good idea...but would that still allow me to link to them from the actual pages....something like.."download this page in pdf format"?
  • phaugh
  • Professor
  • Professor
  • User avatar
  • Posts: 796

Post 3+ Months Ago

here's some info on formats supported by google: http://www.google.com/help/faq_filetypes.html

Although it doesn't address the issues with linking and dupe content.
  • madmonk
  • Mastermind
  • Mastermind
  • madmonk
  • Posts: 2115
  • Loc: australia

Post 3+ Months Ago

it will be nice to generate more pages this way :-)

but, I think linking them pdf as links - in pdf format , will not work out. when it comes to google i mean...even if it may sound right to offer users a pdf format...

the spiders will still see them as duplicates.

It will be interesting and user friendly if you are to pull out key points of yr html contents and make a summary (pdf format of cuz..)

maybe this will work :wink:
  • phaugh
  • Professor
  • Professor
  • User avatar
  • Posts: 796

Post 3+ Months Ago

I found this in the google webmaster guildlines:
"I don't want Google to index non-HTML file types on my site.

To disallow a specific file type,simply modify the Disallow command in your robots.txt file. This works for all of the types of files Googlebot crawls,including HTML, GIFs and .docs. For example, to disallow Microsoft Word files with the ".doc" extension, you would add the following lines to your robots.txt file:"

User-agent: Googlebot
Disallow: /*.doc$
so I guess the same could be used for /*.pdf$

Then I could do the pdf thing and not worry about dupes....but I won't get any extra pages :(
  • madmonk
  • Mastermind
  • Mastermind
  • madmonk
  • Posts: 2115
  • Loc: australia

Post 3+ Months Ago

yeah man, you wont get extra pages if you use disallow .lol.

If I want to offer pdf files on the site and these pdf files are user manuals... I found the user manuals offered in another site too.

Will the pdf files be recognised by google as dupes?

They are user manuals and produced by the companies/manufacturers themselves..
  • madmonk
  • Mastermind
  • Mastermind
  • madmonk
  • Posts: 2115
  • Loc: australia

Post 3+ Months Ago

anybody? :?:
  • madmonk
  • Mastermind
  • Mastermind
  • madmonk
  • Posts: 2115
  • Loc: australia

Post 3+ Months Ago

Quote:
If I want to offer pdf files on the site and these pdf files are user manuals... I found the user manuals offered in another site too.

Will the pdf files be recognised by google as dupes? these are offered to consumers with purchase of products..



My first time with pdf & manuals. so...
*bump* :wink:
  • phaugh
  • Professor
  • Professor
  • User avatar
  • Posts: 796

Post 3+ Months Ago

Quote:
I found the user manuals offered in another site too. Will the pdf files be recognised by google as dupes?


My guess is yes....but does it matter? I don't think pdf files get PR or backlink credit...think of it this way. MS releases a patch I post that patch on my site for people to download from my server...does that mean my page will get banned...I don't think so...although SE's can't read a patch (.exe) they can read inside a pdf.....

I'm dealing with the same issue if I try to add pdfs file that are exactly the same as my asp pages content....they would be identical and may be viewed as a dupe....I'm using the noindex code to disallow them.
  • phaugh
  • Professor
  • Professor
  • User avatar
  • Posts: 796

Post 3+ Months Ago

Would the manufacturer allow you to link to their manuals on their server...that would not expose you to any penalty...and they would get a boost from your site.

Anyone know if linking to a file vs a page will leak PR?
  • rtchar
  • Expert
  • Expert
  • User avatar
  • Posts: 606
  • Loc: Canada

Post 3+ Months Ago

Generally links to files are considered rank sinks .... IF there are no links in the file.

With a rank sink the PR passed to the link is lost from the system (rank leak).

Since the link is to an external site it is a rank leak anyway.
  • madmonk
  • Mastermind
  • Mastermind
  • madmonk
  • Posts: 2115
  • Loc: australia

Post 3+ Months Ago

the pdfs are not for purposes of google or index. But it will be more convenient for users to have them all compiled in one site. Ready for download! :-)

1) i am more concerned about them being found as dupes by google

2) on top of that, it can be "viewed" as "the site 's resources" if I am to offer them without linking them to manufacturers' sites.

I guess I will have to use no-index for them files then.....
  • phaugh
  • Professor
  • Professor
  • User avatar
  • Posts: 796

Post 3+ Months Ago

"rank sinks " - I've never heard of this...is this a term you made up or is it an official SEO term....is it the same as leaking PR?
  • madmonk
  • Mastermind
  • Mastermind
  • madmonk
  • Posts: 2115
  • Loc: australia

Post 3+ Months Ago

rank sink is what rtchar calls it - PR lost in pages that are "isolated" and have no links out. right?
I think it is an good term:-)
  • rtchar
  • Expert
  • Expert
  • User avatar
  • Posts: 606
  • Loc: Canada

Post 3+ Months Ago

madmonk - you have the concept correct ... sounds like you have a handle on PR calculation. :o

I wish I could take credit for the term 'rank sink' but I lifted it from
The PageRank Citation Ranking:
Bringing Order to the Web


Quote:
There is a small problem with this simplified ranking function. Consider two web pages that point to each other but to no other page.

And suppose there is some web page which points to one of them.

Then, during iteration, this loop will accumulate rank but never distribute any rank (since there are no outedges). The loop forms a sort of trap which we call a rank sink.
  • madmonk
  • Mastermind
  • Mastermind
  • madmonk
  • Posts: 2115
  • Loc: australia

Post 3+ Months Ago

Quote:
madmonk - you have the concept correct ... sounds like you have a handle on PR calculation.


not really. lol. I think i am guessing most of the times.

also, Dont you think that google would have changed its PR algo by now? I have tha feeling that it is more complicated than just that,.....

:roll:
  • rtchar
  • Expert
  • Expert
  • User avatar
  • Posts: 606
  • Loc: Canada

Post 3+ Months Ago

I don't think the original formula has changed much ...

Google can change the weight of page rank, when determining search engine result position, but the basic PR formula pretty much has to stay the same. I have read plenty of rumors and myths about linking, and many people confuse PR and SERP.

Filters used by Google have more to do with positioning than PR.

Like the original topic of this thread. :)

Duplicate page filters remove pages before PR is ever calculated. This is done with a fingerprint algorithm.

:idea: Personally, I don't think a PDF file will generate the same fingerprint as an HTML file ... even if they contain duplicate content.
  • madmonk
  • Mastermind
  • Mastermind
  • madmonk
  • Posts: 2115
  • Loc: australia

Post 3+ Months Ago

Not gonna debate what you have just said.
:-)
except, from my interpretation of the term algo.- it means a recursive formula to compute and produce a result
in this case, the result will be the calculation of PR.

hence, to influence the PR calculation. factors have to be inputted into the algo.. am I right to say that?

I am not basing this , that google has changed its PR algo-,on any myth, coz i havent come across any. :wink:

am i making sense in here, or sleep is overcoming me? heheh..
  • rtchar
  • Expert
  • Expert
  • User avatar
  • Posts: 606
  • Loc: Canada

Post 3+ Months Ago

I think you need some sleep. :lol:

PR is a simple mathematical formula, that produces a number representing the number and quality of incoming links.

Algorithms perform tasks, solve problems, and make decisions.
  • phaugh
  • Professor
  • Professor
  • User avatar
  • Posts: 796

Post 3+ Months Ago

So "rank sink" is like a dead end on a site...noway out
  • madmonk
  • Mastermind
  • Mastermind
  • madmonk
  • Posts: 2115
  • Loc: australia

Post 3+ Months Ago

yeah, i needed some sleep :D
  • rtchar
  • Expert
  • Expert
  • User avatar
  • Posts: 606
  • Loc: Canada

Post 3+ Months Ago

phaugh

Rank sinks are pages with no outbound links ... they do not pass PR to other pages.

You did not respond to an earlier statement?

:idea: Personally, I don't think a PDF file will generate the same fingerprint as an HTML file ... even if they contain duplicate content.

Any thoughts?

Post Information

  • Total Posts in this topic: 28 posts
  • Users browsing this forum: No registered users and 7 guests
  • You cannot post new topics in this forum
  • You cannot reply to topics in this forum
  • You cannot edit your posts in this forum
  • You cannot delete your posts in this forum
  • You cannot post attachments in this forum
 
cron
 

© 1998-2014. Ozzu® is a registered trademark of Unmelted, LLC.