As artificial intelligence continues to be developed and incorporated into our everyday lives, skeptics continue to grow. AI can do a lot of things that can help us save time. They can send out thousands of emails at once, write and edit content for us, answer questions for customers, and etc. But one thing that continues to be a problem with computers is their lack of common sense.

This can be found in many places in artificial intelligence like chat bots and in algorithms. Many chat bots you find online can only answer questions that match word for word what they’re taught to answer. Say anything else and they are confused and have no answer. Another problem is the algorithms. The algorithms that search engines use often times cannot distinguish unique content from duplicate content. They mistake similar things for duplicates and this can have a negative effect on websites.

Take a look at this article from MarTech Today for more information.

If you’re looking for help with content or anything in the digital realm, Netsville can help! Contact us today for more info on how to get started for FREE.

Human vs machine intelligence: how to win when ‘duplicate’ content is unique

Sometimes humans and machines disagree about what content is duplicate content. Here’s why–and how to beat the system when it happens.

Sponsored Content: OnCrawl on December 11, 2018 at 7:30 am

As impressive as machine learning and algorithm-based intelligence can be, they often lack something that comes naturally to humans: common sense.

It’s common knowledge that putting the same content on multiple pages produces duplicate content. But what if you create pages about similar things, with differences that matter? Algorithms flag them as duplicates, though humans have no problem telling pages like these apart:

E-commerce: similar products with multiple variants or critical differences
Travel: hotel branches, destination packages with similar content
Classifieds: exhaustive listings for identical items
Business: pages for local branches offering the same services in different regions

How does this happen? How can you spot issues? What can do you about it?

The danger of duplicate content

Duplicate content interferes with your ability to make your site visible to search users through:

Loss of ranking for unique pages that unintentionally compete for the same keywords
Inability to rank pages in a cluster because Google chose one page as a canonical
Loss of site authority for large quantities of thin content

How machines identify duplicate content

Google uses algorithms to determine whether two pages or parts of pages are duplicate content, which Google defines as content that is “appreciably similar“.

Google’s similarity detection is based on their patented Simhash algorithm, which analyzes blocks of content on a web page. It then calculates a unique identifier for each block, and composes a hash, or “fingerprint”, for each page.

Because the number of webpages is colossal, scalability is key. Currently, Simhash is the only feasible method for finding duplicate content at scale.

Simhash fingerprints are:

Inexpensive to calculate. They are established in a single crawl of the page.
Easy to compare, thanks to their fixed length.
Able to find near-duplicates. They equate minor changes on a page with minor changes in the hash, unlike many other algorithms.

This last means that the difference between any two fingerprints can be measured algorithmically and expressed as a percentage. To reduce the cost of evaluating every single pair of pages, Google employs techniques such as:

Clustering: by grouping sets of sufficiently similar pages together, only fingerprints within a cluster need to be compared, since everything else is already classified as different.
Estimations: for exceptionally large clusters, an average similarity is applied after a certain number of fingerprint pairs are calculated.

Comparing page fingerprints. Source: Near-duplicate document detection for web crawling (Google patent)

Finally, Google uses a weighted similarity rate that excludes certain blocks of identical content (boilerplate: header, navigation, sidebars, footer; disclaimers…). It takes into account the subject of the page using n-gram analysis to determine which words on the page occur most frequently, and – in the context of the site – are most important.

Analyzing duplicate content with Simhash

We’ll be looking at a map of content clusters flagged as similar using Simhash. This chart from OnCrawl overlays an analysis of your duplicate content strategy on clusters of duplicate content.

OnCrawl’s content analysis also includes similarity ratios, content clusters, and n-gram analysis. OnCrawl is also working on an experimental heatmap indicating similarity per content block that can be overlaid on a webpage.

Mapping a website by content similarity. Each block represents a cluster of similar content. Colors indicate the coherence of the canonicalization strategy for each cluster. Source: OnCrawl.

Validating clusters with canonicals

Using canonical URLs to indicate the main page in a group of similar pages is a way of intentionally clustering pages. Ideally, the clusters created by canonicals and those established by Simhash should be identical.