Web indexes for the most part experience two fundamental stages to make content accessible for users in search result: crawling and indexing. Crawing is when web crawlers get to freely accessible site pages. All in all, this includes taking a gander at the pages and emulating the connections on those pages, generally as a human client would. Indexing includes assembling data around a page so it can be made accessible (“served”) through search results.
Before getting started on this tutorial read my previous post on App Indexing to connect your android app Part1 and Part2
The methods described in this tutorial helps you control aspects of both crawling and indexing, so you can decide how you would favor your content to be gotten to by crawlers and additionally how you would like your content to be introduced to different clients in search results.
In a few circumstances, you may not have any desire to permit crawlers to get to zones of a server. This could be the situation if getting to those pages utilizes the constrained server assets, or if the issues with the URL and connecting structure would make a limitless number of URL’s if every one of them were to be taken after.
The robots.txt is a content document that permits you to detail how you would like your site to be crawled. Before crawling a site, crawlers will demand the robots.txt from the server. Inside the robots.txt, you can incorporate segments for particular (or all) crawlers with guidelines (“orders”) that let them know which parts can or can’t be crawled.
Location of the robots.txt file
The robots.txt file must be located on the root of the website host that it should be valid for. Case in point, so as to control crawling on all Urls beneath http://www.example.com/, the robots.txt must be found at http://www.example.com/robots.txt. A robots.txt document can be set on subdomains or on non-standard ports, yet it can’t be set in a subdirectory. There are more insights in regards to the area in the determinations.
Content of the robots.txt file
You can use any editor to make a robots.txt document. Just make sure it should have the capacity to make standard ASCII or UTF-8 content documents; make sure not to use a word processor. A general robots.txt document may resemble this:
User-agent: Googlebot Disallow: /nogooglebot/ User-agent: * Disallow: /onlygooglebot/ Sitemap: http://www.example.com/sitemap.xml
Assuming this file is located at http://example.com/robots.txt, it specifies the following directives:
No Googlebot crawlers ought to crawl http://example.com/nogooglebot/ and all contained Urls. The line “User-agent: Googlebot” begins the segment with mandates for Googlebots.
No different crawlers ought to crawl http://example.com/onlygooglebot/ and all contained Urls. The line “User-agent: *” begins the area for all crawlers not generally pointed out.
The site’s Sitemap record is placed at http://www.example.com/sitemap.xml
Some sample robots.txt files
These are some simple samples to help get started with the robots.txt handling.
Allow crawling of all content
User-agent: * Disallow:
User-agent: * Allow: /
The example above is substantial, however indeed on the off chance that you need all your content to be crawled, you needn’t bother with a robots.txt document whatsoever (and we prescribe that you don’t utilize one). On the off chance that you don’t have a robots.txt document, confirm that your hoster gives back a 404 “Not found” HTTP result code when the URL is asked.
Disallow crawling of the whole website
User-agent: * Disallow: /
Remember that in a few circumstances Urls from the site may in any case be filed, regardless of the fact that they haven’t been crawled.
Disallow crawling of certain parts of the website
User-agent: * Disallow: /calendar/ Disallow: /junk/
Keep in mind that you shouldn’t utilize robots.txt to piece access to private content: use legitimate validation. Urls prohibited by the robots.txt may even now be filed without being crawled, and the robots.txt document can be seen by anybody, potentially revealing the area of your private content.
Allowing access to a single crawler
User-agent: Googlebot-news Disallow: User-agent: * Disallow: /
Allowing access to all but a single crawler
User-agent: Unnecessarybot Disallow: / User-agent: * Disallow:
Controlling indexing and serving
Indexing can be controlled on a page-by-page premise utilizing straightforward data that is sent with each one page as it is crawled. For indexing control, you can utilize either:
1.) an unique meta label that can be installed in the highest point of HTML pages
2.) an unique HTTP header component that can be sent with all content served by the site
Using the robots meta tag
The robots meta tag can be added to the top of a HTML page, in the section, for instance:
<!DOCTYPE html> <html><head> <meta name="robots" value="noindex" /> ...
In this case, robots meta tag is tagging that no web search tools ought to list this specific page (noindex). The name robots applies to search tools. In the event that you need to piece or permit a particular internet searcher, you can tag a client specialists name in the spot of robots.
Using the X-Robots-Tag HTTP header
In a few circumstances, non-HTML substance, (for example, archive records) can likewise be crawled and indexed via search tools. In these cases, its unrealistic to add a meta tag to the individual pages—rather, a HTTP header component can be sent with the reaction. This header component is not specifically visible to users as its not a piece of the content straightforwardly.
The X-Robots-Tag is incorporated with the other HTTP header labels. You can see these by checking the HTTP headers, for instance using”curl”:
$ curl -I "http://www.google.com/support/forum/p/Webmasters/search?hl=en&q=test" HTTP/1.1 200 OK X-Robots-Tag: noindex Content-Type: text/html; charset=UTF-8 (...)
Most sites won’t have to set up confinements for crawling, indexing or serving, so beginning is basic: you don’t need to do anything. There’s no compelling reason to alter your pages on the off chance that you might want to have them listed. There’s no compelling reason to make a robots.txt record if all Urls may be crawled via web search tool.
Reference: Code samples linked to developer.google.com webmasters.
5 Best Resources to Get Started with Android Nougat
Part2 – App Indexing To Connect Your Android App
App Indexing To Connect Your Android App
Top 10 Android App Development Trends | 2020 Guide
Android Studio Introduction
Services – An Android Component
Applying MediaCodec On An Open Source Android Audio Player
5 Most Used Android Testing Frameworks