Table of Contents

Web Crawling

Web crawlers or internet bots are helping search engines to update their content or index of the web content of thousand of websites know as Web Crawling.

How do they do that? User-Agent,Crawlers identify themselves to a web server by using the request header in an HTTP request, and each crawler has their own unique identifier. By visiting those websites that they have on a list of URLs (also called seeds ) and are copying all of the hyperlinks on them.

As there are millions of content, web sites and web apps, crawlers are reading parts from web sites. They are scanning mainly most popular spots on the web site. These spots are relevant and have good links internal and external. Some spiders normalize the URLs and store them in a predefined format to avoid duplicate content.

Because Search Engine Optimization prioritizes content and articles that are new and up to date, some crawlers visit pages where content is updated on a regular basis.

But also there are crawlers that are crawling web sites no matter are updates or any changes on site.

It all depends on crawlers algorithms and the purpose they were made. Since there are crawlers that are archiving web sites and content they save web sites as cached copies.

How to web crawling ? Every time crawler steps to our web site at first moment he communicate with the web server. He identify as a crawler and he is looking for robots.txt file. That file contains all permission and restrictions that web admins had written before. For example which page to crawled and which one to be not. Or there is part of web site that contains sensitive information’s like payment transactions.

Here is example for robots.txt file and what is allowed and what is not allowed for crawlers.

User-Agent: *
Disallow: /cgi-bin
Disallow: /wp-
Disallow: /?s=
Disallow: *&s=
Disallow: /search
Disallow: /author/
Disallow: *?attachment_id=
Disallow: */feed
Disallow: */rss
Disallow: */embed
Allow: /wp-content/uploads/
Allow: /wp-content/themes/
Allow: /*/*.js
Allow: /*/*.css
Allow: /wp-*.png
Allow: /wp-*.jpg
Allow: /wp-*.jpeg
Allow: /wp-*.gif
Allow: /wp-*.svg
Allow: /wp-*.pdf

Sometimes servers are crawled continuously, so admins can limit the crawling and will decide wich pages to be crawled for SEO purposes.

There are thousand of web crawlers and bots here is a list of most famous.

  • GoogleBot
  • Bingbot
  • DuckDuckBot
  • Yandex Bot
  • Sogou Spider
  • Exabot
  • Baiduspider
  • Slurp Bot
  • Facebook external hit
  • Alexa crawler

Googlebot is a name for two different bots one for desktop version other one for mobile version. Web sites are usually crawled with both types. This bot runs on thousand of machines-servers for better performance. These machines are spread all over the world so it depends where your site is hosted from there google bot will crawl you.If you want to reduce Googlebot or to spot crawling some page you can do that from robots.txt. Read more about googlebot.

Bing uses BingBot, AdIdxBot and BingPreview. Bingbot is usual crawler that handles most of our crawling jobs every day. Bingbot uses a couple of different user agent strings.
AdIdxBot is dedicated for Bing Ads. AdIdxBot job is for crawling ads and following through to websites from those ads for quality control purposes. BingPreview generates pages snapshots.All of them have desktop and mobile versions. Read more for BingBots.

Indexing

Indexing is done by search-engine companies. There are many methodologies for the way of indexing, it is an algorithms that are used by search-engines and they are similar but never same. Every search-engine company has own algorithm and it is a secret.

While indexing, search engine are looking for web sites and collect, store and parse data for visitors to get information in short time. While we search for some information on browser the information is already stored in search engine, because of that we are getting information quickly.

Related Posts