What is Web Spam? Completely Guide
Search engines such as Google, Bing, Yandex, Yahoo and others have literally become a window on the Internet, because work in it begins with finding the necessary sites. And cybercriminals take advantage of this by bringing their websites with low-quality content to the top of search results, which makes users search for the information they need longer. Companies are fighting this by using sophisticated tools, including psychology, to better understand the motives of attackers.
What is web spam
At first glance, the definition of web spam is quite simple: it is a web page whose owner uses black promotion technologies (SEO). Thanks to them, he tries to outwit the search engine algorithms and get top positions in his search rankings. Thanks to this, attackers receive a large flow of visitors to their resources. The latter click on ads or infect their PC with malware.
In reality, it turns out to be more difficult to identify web spam because there is a fine line between legal and black promotion techniques. In addition, even if the owner of a web page is abusing SEO tools, it is often difficult to tell if he is doing it on purpose or by accident. It also happens that even obvious spam pages are valuable to users and therefore cannot be blocked like the rest.
The need to filter out spam
Clearing the search rankings from web spam is necessary to improve the quality of the search engine. After all, such web pages often have low-quality content, but they occupy top places in the search rankings. Therefore, high-quality and useful pages are lost at the bottom of the list, and users cannot access them.
Another reason to filter web spam is that such sites often contain malware that infects visitors’ computers. In addition, reducing the number of such sites improves the Internet ecosystem, reducing the volume of traffic and the attractiveness of such activity among cyber hackers.
It should also not be forgotten that search rankings take up hardware resources that cost money and have limited storage space. Removing web spam from it allows you to optimize your system and free up space for useful web pages.
At Google, anti-spam protection consists of two parts: an automated system and a team of experts who manually clean up missing sites. The number of the latter is a secret, however, specialists from the search quality department (also known as the anti-spam team) Kaspar Szymanski and Fili Weisz said in an interview that their department is located in several parts of the world and every minute one of their colleagues cleans the Internet from the garbage.
Moreover, not only computer specialists work in this department. Wise says his peers also include kite surfers, marathon runners, scuba divers, skippers, sommeliers, combat pilots and even submarine captains. They continually send their comments and opinions to the automated filtering department, and in this way, Google improves search results.
Number of sites checked by Google workers, depending on the type of spam
Bing, for example, relies more on automated filters, for which, before starting to filter out spammers in search, its creators try to understand their motivation. Knowing it will be easier to determine whether a web page belongs to spam.
How spam is defined
The search engine’s fight against web spam is like a sword-and-shield rivalry. Cybercriminals constantly disguise themselves and protect their sites, and search engines develop methods to identify them, and also make it difficult for cybercriminals to bypass search algorithms. That is why search engines keep the exact specifics of their work and the functioning of protispam filters secret.
In total, cybercriminals are forced to create spam pages because this is a business. Sometimes there are exceptions to this rule – some cybercriminals work for political or other reasons. However, most of them try to make money this way. The most popular way to get funds from spam sites is by displaying ads. The more ads the visitors of such a web resource see, the more profit. After all, what percentage of users will click on the ad, which will bring money to the attacker.
The average length of time a site stays in the top Yandex depends on the volume of ads on the site (advertising aggressiveness)
Knowing why cyber-hackers create spam sites, it will be easier to determine the usefulness of a web resource. Search engines analyze the following parameters:
quality of content. Since the spammer wants to make money from advertising, he needs the content of the page itself enough to achieve this goal. Therefore, they do not create high-quality texts, but try to satisfy the requirements of search engines and increase the site’s ranking. In most cases, this means that visitors to such spam resources will not find the things they need on them. Determining the usefulness of a web page is made with hundreds of parameters, including the number of words on the page, the uniqueness of the content, etc.
the presence of advertising. Almost every web page on the Internet today has advertisements, but that doesn’t automatically make it spam. An indicator of this is the number of ads on the screen, their type (banners, windows, pop-ups, etc.), as well as their intrusiveness;
layout. Placing content and ads on a web page can also say a lot about a site. For example, ads can take up the main screen space or be neatly separated from content;
social signals. When the content is of high quality, then readers discuss it on their social networks. This tells search engines that the site is not spam;
personal photos. Search engines trust more those web pages on which you can find information about the author of the content: his photo, profiles in social networks, etc.
The next goal of spammers is to make more money. When cyber hackers already have several lucrative web pages, they want to maximize their earnings. To do this, they often use black methods of website promotion and abuse.
To maximize their online presence, attackers have various approaches that allow them to quickly and cheaply create a large number of their own web pages. To do this, they can, for example, copy someone else’s content completely or make minor changes to it, use automatic text generation programs, and popularize pages with non-unique content.
There are dozens of methods to increase your search engine page rank as well. These include: saturating the site with keywords, manipulating links using appropriate resources, uniting networks, abusing forums, adding content invisible to users.
Search engines struggle with them, changing the algorithms of their work, you can easily notice yourself. When a search engine gives a different result for the same query, it means that it has changed the principle of its work. Yandex, for example, introduced new ranking rules this summer. Google has been refreshed for about a year, and Bing has been updated this spring.
Spammers try to protect themselves from these methods because a search engine’s definition of a site as spam means a decrease in profits. To do this, they use redirects, hiding content, disguise it as legitimate and use dynamic texts.
The number of sites with aggressive advertising in the Russian part of the network has halved in 2 years (according to Yandex)
Where does web spam occur?
You can see a spam page anywhere, although in some segments of the Internet (downloading programs, music, etc.) they are more common. Spam can be found on familiar sites, forums, social networks, personal blogs, and even in advertisements shown by search engines.
Google, for example, reported the largest drop in the number of infectious links at the top of its SERPs in 2011. Then their number was reduced by 50%, for which the search giant spent millions of dollars to refine its system. In numbers, this decrease meant the loss of 130 million malicious sites from search results.
Search engines cannot completely get rid of harmful sites, although they are working on it. The safest way to search today is Google – it only shows 272 harmful sites per 10 million web resources. For comparison, Bing shows users 1285 dangerous pages, and Yandex – 3330.