Using a web crawler to index external content

All Suites

Growth, Professional, Enterprise, or Enterprise Plus

Support with

Guide Professional or Enterprise

The web crawler lets you crawl and index external content for use in help center search and generative search. The web crawler lets you implement federated search without developer resources. You can set up multiple crawlers to crawl and index different content in the same or different websites.

When users perform a search in your help center, relevant external content discovered by the crawler is ranked and presented on the search results page, where users can filter the results and click the links to view the external content link in another browser tab.

This article covers the following topics:

About the web crawler
Setting up a web crawler

About the web crawler

You can set up one or more web crawlers to crawl and index external content in the same or different websites that you want to make available to help center search and generative search. External sites that you want to crawl must have a sitemap that lists the pages for the web crawler. In addition, the pages you want to crawl must be public (non-authenticated).

When they are configured, crawlers are scheduled to run once every 30 minutes, visiting the pages in the sitemap you specified during setup and ingesting content from those sources into the help center search indexes. Web crawlers index content that is in the page source on the initial page load, even if it's hidden by a UI element, such as an accordion. However, since crawlers do not run JavaScript, they do not crawl content that is rendered by JavaScript or other content that is dynamically rendered after the initial page load.

Web crawlers also don't crawl links on the pages they visit; they only visit the pages in the sitemap that they are configured to use. If the crawler fails to collect information from a website during a regularly scheduled crawl (for example, if the website is down or if there are network issues), the help center will retain the results from the previous crawl, which will continue to be searchable in the help center.

Setting up a web crawler

The web crawler lets you implement federated search in your help center without developer resources. You can set up multiple crawlers in your help center to crawl and index different content in the same or different websites.

When setting up a web crawler, consider the following:

The web crawler does not work with websites that use gzip file compression encoding. You will not see search results from these sites.
A crawl-delay will not be respected by the web crawler when set on external site robots.txt records.
The changefreq tag doesn't affect the web crawler in any way.

Note: You are responsible for using the help center web crawler in compliance with all applicable laws and the terms and conditions of the relevant websites. You should only add sitemaps where you own the domain associated with such sitemaps. By using the help center web crawler, you confirm that you own the domains for all sitemaps added to the crawler and that you have the right to crawl such websites.

To set up the web crawler

In Knowledge admin, click Settings () in the sidebar.
Click Search settings.
Under Crawlers, click Manage.
Click Add Crawler.
In Name this crawler, enter the following:
- Name that you want to assign to the crawler. This is an internal name that identifies your web crawler on the crawler management list.
- Owner who is the Knowledge admin responsible for crawler maintenance and troubleshooting. By default, the crawler owner is the user creating the crawler. However, you can change this name to any Knowledge admin.
  Crawler owners receive email notifications both when the crawler runs successfully and when there are error notifications, such as problems with domain verification, processing the sitemap, or crawling pages.
In Add the website you want to crawl, configure the following:
- Website URL - Enter the URL of the website that you want to crawl.
- I confirm that I have permission to crawl this website - Read the information under this checkbox, and then select to confirm you have permission to crawl this website.
In Add a sitemap, in Sitemap URL, enter the URL for the sitemap you want the crawler to use when crawling your site.
The sitemap must follow the sitemaps XML protocol and contain a list of all pages within the site that you want to crawl. The sitemap can be the standard sitemap containing all the pages of the site or it can be a dedicated sitemap that lists the pages that you want to crawl. All sitemaps must be hosted on the domain that the crawler is configured to crawl. The web crawler does not support sitemap indexes.

You can set up multiple crawlers on the same site that each use different sitemaps defining the pages you want the web crawler to crawl.
In Add filters to help people find this content, configure the source and type filters used to filter search results by your end users. Source refers to the origin of the external content, such as a forum, issue tracker, or learning management system. Type refers to the kind of content, such as blog post, tech note, or bug report.
- Source - Click the arrow and select a source from the list, or select + Create new source to add a name that describes where this content lives.
- Type - Click the arrow and select a type from the list, or select + Create new type to add a name that describes what kind of content this is.
Click Finish.
The web crawler is created and pending. Within 24 hours, the crawler will verify ownership of the domain and then fetch and parse the specified sitemap. Once the sitemap processing succeeds, the crawler begins to crawl the pages and index its content. If the crawler fails either during domain verification or while processing the sitemap, the crawler owner will receive an email notification with troubleshooting tips to help resolve the issue. The crawler will try again in 24 hours.
Note: Zendesk/External-Content is the user agent for the web crawler. To prevent the crawler from failing due to a firewall blocking requests, whitelist (or allow-list) Zendesk/External-Content.

If you're setting up a web crawler to pull in external content for:

Help center search, then you need to select the content that you want to include and exclude in help center search results. See Including external content in your help center search results.
Knowledge section of the context panel for agents, see Configuring Knowledge in the context panel.