As artificial intelligence continues to be developed and incorporated into our everyday lives, skeptics continue to grow. AI can do a lot of things that can help us save time. They can send out thousands of emails at once, write and edit content for us, answer questions for customers, and etc. But one thing that continues to be a problem with computers is their lack of common sense.
This can be found in many places in artificial intelligence like chat bots and in algorithms. Many chat bots you find online can only answer questions that match word for word what they’re taught to answer. Say anything else and they are confused and have no answer. Another problem is the algorithms. The algorithms that search engines use often times cannot distinguish unique content from duplicate content. They mistake similar things for duplicates and this can have a negative effect on websites.
Take a look at this article from MarTech Today for more information.
Human vs machine intelligence: how to win when ‘duplicate’ content is unique
Sometimes humans and machines disagree about what content is duplicate content. Here’s why–and how to beat the system when it happens.
Sponsored Content: OnCrawl on December 11, 2018 at 7:30 am
As impressive as machine learning and algorithm-based intelligence can be, they often lack something that comes naturally to humans: common sense.
It’s common knowledge that putting the same content on multiple pages produces duplicate content. But what if you create pages about similar things, with differences that matter? Algorithms flag them as duplicates, though humans have no problem telling pages like these apart:
- E-commerce: similar products with multiple variants or critical differences
- Travel: hotel branches, destination packages with similar content
- Classifieds: exhaustive listings for identical items
- Business: pages for local branches offering the same services in different regions
How does this happen? How can you spot issues? What can do you about it?
The danger of duplicate content
Duplicate content interferes with your ability to make your site visible to search users through:
- Loss of ranking for unique pages that unintentionally compete for the same keywords
- Inability to rank pages in a cluster because Google chose one page as a canonical
- Loss of site authority for large quantities of thin content
How machines identify duplicate content
Google uses algorithms to determine whether two pages or parts of pages are duplicate content, which Google defines as content that is “appreciably similar“.
Google’s similarity detection is based on their patented Simhash algorithm, which analyzes blocks of content on a web page. It then calculates a unique identifier for each block, and composes a hash, or “fingerprint”, for each page.
Because the number of webpages is colossal, scalability is key. Currently, Simhash is the only feasible method for finding duplicate content at scale.
Simhash fingerprints are:
- Inexpensive to calculate. They are established in a single crawl of the page.
- Easy to compare, thanks to their fixed length.
- Able to find near-duplicates. They equate minor changes on a page with minor changes in the hash, unlike many other algorithms.
This last means that the difference between any two fingerprints can be measured algorithmically and expressed as a percentage. To reduce the cost of evaluating every single pair of pages, Google employs techniques such as:
- Clustering: by grouping sets of sufficiently similar pages together, only fingerprints within a cluster need to be compared, since everything else is already classified as different.
- Estimations: for exceptionally large clusters, an average similarity is applied after a certain number of fingerprint pairs are calculated.
Finally, Google uses a weighted similarity rate that excludes certain blocks of identical content (boilerplate: header, navigation, sidebars, footer; disclaimers…). It takes into account the subject of the page using n-gram analysis to determine which words on the page occur most frequently, and – in the context of the site – are most important.
Analyzing duplicate content with Simhash
We’ll be looking at a map of content clusters flagged as similar using Simhash. This chart from OnCrawl overlays an analysis of your duplicate content strategy on clusters of duplicate content.
OnCrawl’s content analysis also includes similarity ratios, content clusters, and n-gram analysis. OnCrawl is also working on an experimental heatmap indicating similarity per content block that can be overlaid on a webpage.
Validating clusters with canonicals
Using canonical URLs to indicate the main page in a group of similar pages is a way of intentionally clustering pages. Ideally, the clusters created by canonicals and those established by Simhash should be identical.
Based in Rochester, New York, Netsville is an Internet Property Management company specializing in managing the Digital Marketing, Technical, and Business Solutions for our customers since 1994. For more information, please click here.