In this article we are going to learn how to configure sitemaps in AEM CaaS in step-by-step guide.
What is Sitemap
According to Google Developer
A sitemap is a file where you provide information about the pages, videos, and other files on your site, and the relationships between them. Search engines like Google read this file to crawl your site more efficiently. A sitemap tells search engines which pages and files you think are important in your site, and also provides valuable information about these files. For example, when the page was last updated and any alternate language versions of the page.
A sitemap helps search engines identify the list of urls eligible for crawling. Search engine craws these pages to list them and allow other people to find your content on search engine
Search engine look for sitemaps.xml
at the site root path before crawling the site (i.e https://blog.bagwanpankaj.com/sitemap.xml)
Below is a sitemap snippet example that contains the page path and last modified date.
<urlset xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
<script/>
<url>
<loc>https://blog.bagwanpankaj.com/aem/aem-caas-how-to-configure-robots-txt</loc>
<lastmod>2025-02-16T16:18:36+05:30</lastmod>
<changefreq>weekly</changefreq>
<priority>1.0</priority>
</url>
<url>
<loc>https://blog.bagwanpankaj.com/rust/introduction-to-webassembly-using-rust</loc>
<lastmod>2025-02-16T16:18:36+05:30</lastmod>
<changefreq>weekly</changefreq>
<priority>1.0</priority>
</url>
<url>
<loc>https://blog.bagwanpankaj.com/rust/best-top-6-rust-framework-to-watch-out-in-2023</loc>
<lastmod>2025-02-16T16:18:36+05:30</lastmod>
<changefreq>weekly</changefreq>
<priority>1.0</priority>
</url>
<url>
<loc>https://blog.bagwanpankaj.com/architecture/12-design-principles-you-can-implement-in-rust</loc>
<lastmod>2025-02-16T16:18:36+05:30</lastmod>
<changefreq>weekly</changefreq>
<priority>1.0</priority>
</url>
<url>
<loc>https://blog.bagwanpankaj.com/javascript/bun-sh-an-introduction-another-js-runtime</loc>
<lastmod>2025-02-16T16:18:36+05:30</lastmod>
<changefreq>weekly</changefreq>
<priority>1.0</priority>
</url>
</urlset>
As per AEM Documentation, thereare different ways to create Sitemap for author and publish.
Author Sitemap Implementation
Using the below OSGI configuration and setting allOnDemand
to true
will allow us to create a sitemap. Place the below configuration as part of the author configuration file config.author:
org.apache.sling.sitemap.impl.SitemapGeneratorManagerImpl~practice.cfg.json
{
"allOnDemand": true
}
Note: allOnDemand configuration is having a drawback as it process data and generate sitemap everytime we hit the URL to generate a site map.
Common Configuration for both author and publish
Below is the common configuration for both author and publish, which allows us to include the last modified date and represent data in XML format.
Place the below configuration as part of the CONIG file to apply to both authors and publish:
com.adobe.aem.wcm.seo.impl.sitemap.PageTreeSitemapGeneratorImpl.cfg.json
{
"enableLastModified": true,
"lastModifiedSource": "cq:lastModified",
"enableLanguageAlternates": false
}
Page Properties Update
To generate a sitemap it is mandatory to enable sitemap as part of root page property under advanced tab as shown below.
Exclude Page and apply some other properties
We can apply the below noindex
property as part of robot tags if we don’t want that page to get indexed.
We can also apply other Root Tags property options specific to that page, such as follow, nofollow, noarchive, etc.
Generate Sitemap on author
Hit below URL to generate a sitemap as part of author:
http://localhost:4502/content/pracitce/en.sitemap.xml
Publish Sitemap Implementation
Above allowOnDemand, cannot be use in publish to generate a sitemap because every time it process and generate a sitemap.
- Create below configuration as part of publish config which will run a scheduler depending on give time or period to generate a sitemap considering below sitemap.
org.apache.sling.sitemap.impl.SitemapScheduler~practice.cfg.json
{
"scheduler.name": "Practice Daily Sitemap Scheduler",
"scheduler.expression": "0 0 2 1/1 * ? *",
"searchPath": "/content/practice/us"
}
Note: Every time scheduler runs will generate sitemap folder inside /var/sitemaps folder like /var/sitemaps/content/practice/us/sitemap.xml hierarchy.
- Generating a site will also require below publish configurations to consider extension, resourcet ype and selectors to generate a sitemap.
org.apache.sling.sitemap.impl.SitemapServlet~practice.cfg.json
{
"sling.servlet.extensions": "xml",
"sling.servlet.resourceTypes": [
"pracitce/components/structure/homepage",
"practice/components/structure/profile",
"practice/components/ea/structure/search"
],
"sling.servlet.selectors": [
"sitemap",
"sitemap-index"
]
}
Generate Sitemap on author
Enable sitemap as part of advanced tab page properties and publish page.
Hit below URL to generate a sitemap as part of author:
http://localhost:4503/content/pracitce/en.sitemap.xml
Dispatcher Update
It will require below small amount of updates within dispatcher to access/render sitemap
- Allow below entry as part of dispatcher/src/conf.dispatcher.d/filters/filters.any file.
/0200 { /type "allow" /path "/content/*" /selectors '(sitemap-index|sitemap)' /extension "xml" }
- Allow .xml extension as part of rewrite rules dispatcher/src/conf.d/rewrites/rewrite.rules file.
RewriteCond %{REQUEST_URI} (.html|.jpe?g|.png|.svg|.xml)$
Further Reading
About The Author
I am Pankaj Baagwan, a System Design Architect. A Computer Scientist by heart, process enthusiast, and open source author/contributor/writer. Advocates Karma. Love working with cutting edge, fascinating, open source technologies.
To consult Pankaj Bagwan on System Design, Cyber Security and Application Development, SEO and SMO, please reach out at me[at]bagwanpankaj[dot]com
For promotion/advertisement of your services and products on this blog, please reach out at me[at]bagwanpankaj[dot]com
Stay tuned <3. Signing off for RAAM