I wrote a fun little post on Wednesday that brought in a lot of traffic and seeded a remarkable discussion about the editorial experience in WordPress.
The interesting thing about the traffic was where it came from. The largest chunk, as always, came from ambiguous “encrypted search terms.” The next largest chunk due to many of my presentations being syndicated on seoslid.es. Then there were referrals from Twitter and Post Status.
What came next surprised me: referrals from a private blog.
I have no idea what’s on the blog, who wrote it, or under what context my site comes up. I just know a private site is sending traffic to me – and I can track which articles on this site are sending traffic thanks to the site leaking information.
Web Requests and Data Leakage
I’ve already written on the dangers of data leakage on mixed-content websites.[ref]Websites that load over HTTPS but load some resources over HTTP.[/ref] The data leakage on this site is due to a deficiency in understanding regarding the nature of web requests.
When you click a link on a page, your browser takes that page’s URL and sends it along with you as the “Referer” header. The site you end up on can then track exactly which sites are sending traffic. If your site is private (i.e. an internal company blog) and you link to an external site, you are advertising to that external resource the URL of the page you linked from.
Since post titles are converted to URL slugs in WordPress, this often means you’re leaking two pieces of data: the URL and the post title.
Masking the Referrer
On most sites, leaking your post URL (and perhaps an article title) isn’t an issue. If the site is private, you might accidentally leak information about embargoed product releases, new company initiatives, or inappropriate internal commentary. Understanding how to plug these leaks is important.
Actually, both Twitter and Google are already using this approach.
When you click a link on Twitter, you’re really clicking a link to their internal URL shortener: t.co. As a result, any incoming traffic appears as if referred by t.co rather than specific tweets.
Likewise, when you click a link on a Google results page, you’re not clicking through to the result. A search for “Eric Mann” returns this site among the top results. The actual URL used, however, looks less like [cci]http://eamann.com[/cci] and more like [cci]https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source= web&cd=8&cad=rja&ved=0CDkQFjAH&url=http%3A%2F%2Feamann.com%2F &ei=E-78Usu9OMT7oAS3xYGYCA&usg=AFQjCNEBbSsjwgkvcv8FIFsERBFfFILuxw &sig2=oU_wrlU2UG5OgsMNVEVSXg[/cci].[ref]If Google really just passed traffic along to my site, the “encrypted search terms” issue wouldn’t be an issue since the search results page includes the original search terms in the query string. Google is masking referrals in this way specifically to upsell content creators who want to peek at these “encrypted” search terms that aren’t really encrypted, just intentionally obfuscated.[/ref]
Rather than linking directly to external resources, both Twitter and Google link to an internal property and pass along some identifier for the target link. The internal property logs the click, and redirects the browser to the target link without setting a Referer value that matches the original source page.
How to Mask the Referrer
Masking the referrer on a public site isn’t a good idea. Typically, you want people to know you’ve linked to their content. Masking on a private site is fairly straight-forward.
First, parse the rendered page’s links and, if a link’s href doesn’t match the current domain, replace it with something like [cci]http://site.com/link.php?url=http%3A%2F%2Ftarget-site.com[/cci]. You will, of course, want to urlencode the URL parameter and, if you don’t have a link.php set up, point this at the appropriate resource – but you get the general idea.
Inside the handler function, you can log the outgoing click. Then use PHP to set a location header for the target site and the visitor will be redirected on their merry way.
It’s quick, transparent, and because you use a location header it won’t expose the URL (or any information, really) of the original page containing the link.
Do you run a private site? Is it unintentionally leaking information to the outside world?