Content theft is significantly less damaging than a data breach, but it is far more common and still costly-duplicate content can impact a website’s search ranking, reduce traffic and, for businesses that rely on their Website for ecommerce or lead-generation, hurt revenue. Plagiarism on the Web can be so damaging, website administrators are even watchful of their own inadvertent content thievery that might result in duplicate content.
But policing duplicate content, both incoming and outgoing, has been a time-consuming, laborious process. Now some Website administrators are finding a shortcut in the cloud called Copyscape. Copyscape provides cloud app developers with a way to ensure that content uploaded to the cloud does not exist elsewhere on the web.
Copyscape is an API that lets site administrators integrate a verification system into their CMS that validates the content uploaded to a site as original and not already indexed by Google and Yahoo. Copyscape queries both search engines for duplicate content, so an IT manager can verify content submissions and nab any scrapers caught stealing.
Copyscape API Basics
Copyscape performs two types of searches: a search for text entered into a web form or uploaded document and a more thorough review that compares a search query to the content on a provided URL. Because Copyscape uses Google and Yahoo, the system can find any websites scraping or copying content that are indexed by those search engines (which covers most of the Web).
Copyscape requires users to deposit a payment in their account before the API is enabled. Each query for duplicate content costs $.05. No sandbox for the Copyscape API exists, so each query (even during testing) requires $.05.
After sign-up for the premium service, Copyscape provides the cloud app with a generated API key. Use this key with the login created during signup to query the API.
Build a URL for the Search
All searches use the same URL, but the system must build a URL with specific parameters based on the type of search needed. The following is the Copyscape base URL:
After the URL is created, Copyscape requires a “search type” parameter and an “operation” parameter. The search type tells the API to search either from a URL (“q” parameter) or from text (“t” parameter). The operation parameter type tells the API where to search. “Csearch” queries Google and Yahoo indexes. “Psearch” queries the private index, and “cpsearch” queries both private and search engine indexes. For instance, the following URL queries the search engines and queries a URL named “myurl.com”:
All URLs must be encoded, which means the system translates characters such as “:” and “//” into a URL encoded string. In PHP, the system can use the “urlencode” function. For instance, the following PHP code translates the “http://myurl.com” parameter:
$sitesearch = urlencode(“http://myurl.com”);
Querying the API Using CURL
The cURL plugin is a part of PHP5, which is the latest version of the PHP language. The cURL class manages the call to the Copyscape API and retrieves the XML data returned by the API. The following PHP code uses the URL created above to query the Copyscape API:
$sitesearch = http://www.copyscape.com/api/?u=<username>&k=<apiKey>&o=csearch&q=http://myurl.com
curl_setopt($curl, CURLOPT_URL, $sitesearch);
curl_setopt($curl, CURLOPT_TIMEOUT, 60);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
The XML returned by Copyscape is stored in the “$response” variable. If an error is returned, Copyscape returns an “error” XML element. If the search is successful and duplicate content is found, Copyscape returns a list of sites in the “result” XML element.
Parsing Through the XML to Find Duplicate Content
With the XML returned, the user can run a step through on each result and display the duplicates. PHP develoeprs use the “Simple XML” class included with PHP, which makes parsing XML a breeze. The following PHP code parses through the XML response from Copyscape:
$sites = simplexml_load_string ($response);
foreach ($sites->result as $xml):
$html_results = $html_results.$title.’<br>’.$foundurl.’<br>’.$textmatched.’<br>’;
The code above loops through each element and displays it as HTML. The “$foundurl” variable is the URL in which duplicate content was found. The “$title” has the title of the page, and the “$textmatched” has the matched text found on the page. “$wordsmatched” is the number of words found on the website that were found to be duplications from another Web page. Place the above Copyscape code in any cloud application to run automatically, when uploading content, or on-command using a “Search” form on the business website. The Copyscape API returns as many results as needed, but only the first three results return a full text result. To get a list of each Copyscape parameter and option, view the Copyscape API documentation.
Article by Jennifer Marsh
Jennifer Marsh is a software developer, programmer and technology writer and occasionally blogs for Rackspace Hosting.