Recently, while setting up robots.txt on AEM as a Cloud Service, Our team encountered several issues that required extensive research to find appropriate solutions. As AEM does not provide Robots.txt as Out of Box. Since there was no or little information available on Robots.txt, we decided to put it on single place. Let’s dive in
What is Robots.txt
Wikipedia says: “robots.txt is the filename used for implementing the Robots Exclusion Protocol, a standard used by websites to indicate to visiting web crawlers and other web robots which portions of the website they are allowed to visit. The “robots.txt” file can be used in conjunction with sitemaps, another robot inclusion standard for websites.”
AEM CaaS and Robots.txt
AEM CaaS, does not have out of the box setup for Robots.txt and in my opinion, its not needed to be there. There are several approaches that can be taken to implement Robots.txt in AEM CaaS. Some of them are listed below:
1.) Adding hardcoded Robots.txt via DAM 2.) Serve it via Path Servlet 3.) Keeping it as Page
We opted for first option “Adding hardcoded Robots.txt via DAM”, since robots.txt would not be changed a lot, it will be available for Content Manager(SEO) to simply upload a new robots.txt to override exiting one.
In this article, we are going to explore the first option only for option 2 and 3 we are going to share via future posts
Robots.txt via AEM DAM
Here is step-by-step guide to implement robots.txt via AEM DAM. Follow these steps to get up and running
1.) Upload a Robots.txt file to DAM
First we are going to create a basic robots.txt file for our example with following content
User-agent: *
Allow: /
This Robots.txt file allows all user-agents and allows our website to be crawled for all pages from root. Next upload this file to AEM CaaS, log into AEM, On landing page, click on “Assets” icon, next click on “Files”, next click on project folder.
Now click on “Create” button seen on top right corner, select “Files” from dropdown, select robots.txt and upload. Once uploaded, select that file to edit and fill in mandatory properties (i.e name)
2.) Create a filter rule for dispatcher
Now, once robots.txt is place and accessible. Next we need to create dispatcher to let dispatcher allow access to robots.txt file. To do this, navigate to dispatcher/src/conf.dispatcher.d/filters
and look for filters.any
(create one if it does not exist)
Once done, add following line to filters.any
.
/0111 { /type "allow" /extension "txt" /path "/content/dam/<projectfolder>/*" }
Note: Replace
<projectfolder>
with project name and0111
to next numeral
Make sure uploaded robots.txt is accessible via /content/dam/<projectfolder>/robots.txt
Note: Replace
with name of your project is AEM CaaS for which you are uploading robots.txt
3.) Create Dispatcher rule to access at /robots.txt
Google or other search engine look for robots.txt
at root level, but our robots.txt is accessible at /content/dam/<projectfolder>/robots.txt
which does not serve any purpose.
To force AEM to allow robots.txt
be accessed at /robots.txt
, we need to add rewrite rule for dispatcher, so whenever robots.txt
is accessed, content from /content/dam/<projectfolder>/robots.txt
is served. To do this, we need to add following rewrite rules into dispatcher/src/conf.d/rewrites/rewrite.rules
# Rewrite for robots.txt
RewriteRule ^/robots.txt$ /content/dam/<projectfolder>/robots.txt [PT,L]
Once done and deployed. Now try to access robots.txt
. Does it show file inline or download. If it is dowanloding robots.txt instead of displaying in browser, we need to fix Content-Disposition
.
4.) Add Content-Disposition header
To add Content-Disposition header
, open dispatcher/src/conf.d/includes/<projectfolder>/commonconfigs.conf
. If you can not locate this file, you can add this rule under dispatcher/src/conf.d/available_vhosts
as well. Add following lines
<LocationMatch "^\/content\/dam.*\.(txt).*">
Header unset "Content-Disposition"
Header set Content-Disposition inline
</LocationMatch>
This rule, when added to the vhost file, will unset the existing Content-Disposition header and set it to inline, ensuring that the robots.txt file is displayed in-browser rather than being downloaded.
Conclusion
We finally reached end of this article. We hope you have found it useful so far and learnt something new today. There’s lots of exciting work going on in AEM space. So stay tuned. See you later
Further Reading
About The Author
I am Pankaj Baagwan, a System Design Architect. A Computer Scientist by heart, process enthusiast, and open source author/contributor/writer. Advocates Karma. Love working with cutting edge, fascinating, open source technologies.
To consult Pankaj Bagwan on System Design, Cyber Security and Application Development, SEO and SMO, please reach out at me[at]bagwanpankaj[dot]com
For promotion/advertisement of your services and products on this blog, please reach out at me[at]bagwanpankaj[dot]com
Stay tuned <3. Signing off for RAAM