Troubleshooting the search crawler

Available on Suite Enterprise plans and above

Available on Guide Enterprise plans

Zendesk Federated Search enables your end users to see content in your help center search results that lives externally from your help center, such as external knowledge bases, learning management software, blogs, and pages of your website. You can implement Federated search using either the Zendesk Federated Search API or the search crawler.

The search crawler is available in the search settings of your help center and lets you implement Federated search in your help center without developer resources (see Setting up the search crawler). You can use this article to troubleshoot crawler setup and page errors that you may encounter while setting up the search crawler in your application.

This article contains the following topics:

Crawler setup errors
Record errors

Crawler setup errors

Crawler setup errors are generated when the search crawler cannot run successfully due to errors in domain ownership verification or sitemap processing. Crawler setup errors generate an email notification that is sent to the crawler owner configured during search crawler setup.

Domain ownership could not be verified

The search crawler attempts to verify domain ownership each time it runs, which can take up to 24 hours. Although the crawler owner is notified by email if the domain verification fails, you can test verification instantly on the edit search crawler page. See Managing search crawlers.

To troubleshoot domain verification errors, verify the following:

The homepage of your website (otherwise known as the index or root page) is up and publicly available. The page should not have any user login, password, IP restrictions, or other authentication requirements.
You've implemented the correct tag from your crawler. It is free of typos and is implemented in the <head> section of the homepage of the website you want to crawl. The domain verification tag should always be placed on the homepage of the site, even if your crawler is configured to crawl a subset of pages. You can edit the crawler to view the current domain verification information. See Managing search crawlers.
Note: You can have several verification tags for different crawlers on the same domain.

Example

The following example illustrates the correct implementation of the domain verification tag in your site.

<html>
<head>
 <meta name="zd-site-verification" content="crawler-verification-token">
 <title>Title</title>
<style>
 <!-- style info here -->
 </style>
 </head>
 <body>
 <!-- body of the page here -->
 </body>
</html>

Sitemap could not be processed

The search crawler uses the sitemap defined at crawler setup each time it runs. If the sitemap cannot be processed, the crawler owner receives an email notification and the crawler will not run. If this happens, verify the following:

The search crawler is pointing to the correct sitemap URL and can locate it successfully. You can edit the crawler to view the current sitemap URL. See Managing search crawlers.
The sitemap is served and publicly available. The page should not be restricted by any user login, password, IP restrictions, or other authentication.
The sitemap is an XML URL sitemap that follows the Sitemaps XML protocol.

Record errors

Record errors occur when there are no setup errors, but the search crawler cannot successfully scrape and index the pages defined in the crawler sitemap (see Setting up the search crawler). When a record error occurs, an email notification is sent to the crawler owner with a link to a CSV file that lists the affected pages and their associated errors.

Locale not detected

The error "Locale not detected" indicates that the search crawler could not detect any locale or the detected locale does not match any current help center locales.

To determine the locale of a record, the crawler tries the following approaches. The first successful strategy determines the locale of the records.

Extract the locale from the lang attribute in the <html> tag
Extract the locale from the Content-Language header
Extract the locale from the <meta> tag
Perform textual analysis of the content (CLD - Compact Language Detection)

The "Locale not detected" error results from of one of the following issues:

The locale or language identified does not match a locale or language configured in any help center in your account. To see which languages are configured in each help center in your account, see Configuring your help center to support multiple languages. Find the locale codes for your configured languages in Zendesk language support by product.
The search crawler could not determine a locale or language.

To resolve this issue, verify the following:

The lang attribute in the html tag matches a locale from the help center.
The HTTP Content-Language header matches a locale from the Help center.
The meta element with the Content-Language set in the http-equiv attribute matches a locale from the help center.

See Understanding search crawler locales.

Title not detected

The error "Title not detected" indicates that the search crawler could not detect the title of a record. The search crawler uses the following approaches to determine the title of a record:

Extract the content of the <title> tag
Extract the content of the <h1> tag
Extract the textual content from the <body> tag.

The first successful strategy determines the locale of the records. The crawler indexes the first 255 characters of the extracted content as the record title if one of first two approaches are successful. If these strategies do not determine a title, the record is not indexed.

To resolve this issue, make sure the affected page has one of the tags listed above.

Body not found

The error "Body not found" indicates that the search crawler could not detect the body of a page. To resolve this error, make sure the affected page is properly marked with the <body> tag.

HTTP [status code]

If the error code field in the CSV for a record contains HTTP and a status code, it means that the page could not be indexed because the page could not be accessed. If the page could be successfully indexed (HTTP 2xx) you will not receive an HTTP status code error.

The most common error codes are:

404 - Page not found - The page either does not exist or was moved to another URL. To resolve this issue, make sure the sitemap that the crawler is using is current and that all URLs in the sitemap point to existing pages.
403 - Forbidden - The crawler is restricted from accessing the page due to some access control mechanism, such as it being behind a log in or IP address restriction. To resolve this issue, verify the following:
- You have added Zendesk/External-Content, the search crawler user agent, to your allowlist.
- The pages you want to index are publicly accessible, as the crawler cannot crawl pages with restricted access. If the pages you want to crawl and index cannot be made publicly accessible then you should explore indexing them using the Federated Search (External Content) API. See Setting up the Zendesk Federated Search API.
5xx - Server error - The page could not be crawled due to a server error. The site may be temporarily unavailable. To resolve this issue, visit the one or more of the pages with this error to make sure the site is up. If the site is down, contact the site administrator. When the error is fixed, wait for the crawler to run again within it's regular cadence (every 12-24 hours).

Invalid URL domain

The error "Invalid URL domain" indicates the URL of the page listed in the sitemap is not on the domain you configured during crawler setup.

To resolve this issue, verify that the domain of the page that triggered the error is on the same domain as is defined for your search crawler. If the page linked in your sitemap is pointing to a page that is hosted on a different domain from the one configured during crawler setup, you can do one of the following:

Set up a new search crawler for the affected page

Move the page from the external domain to the domain configured for the search crawler.

Undetermined

The error "Undetermined" may be caused by one or more of the following:

You have exceeded the external records limit for your instance - The search crawler has a limit of 50,000 external records. If you've exceeded the 50,000 external records limit, the latest external records in excess of the limit will not be indexed or updated. To view the number of external records your crawler has used, review the search crawler information. See Managing search crawlers.To resolve this issue you can do one or more of the following:
- Delete some of your crawlers, whereby the external record of those pages is deleted in your instance and then the pages previously not indexed due to hitting the limit can be indexed. See Managing search crawlers.
- Delete individual records via the Federated Search API. However, if the crawler indexing this page is still active or if a custom API integration that adds this page is active, the page will reappear next time the crawler runs or the integration syncs.
- Remove pages that one or more crawlers are using from the sitemap. The next time the crawler runs it will re-index the remaining pages and delete the ones removed from the sitemap.
- Point one or more crawlers to a sitemap with fewer pages. The next time the crawler runs it will re-index the remaining pages and delete the ones removed from the sitemap.
The page is using JavaScript location redirects - The search crawler does not observe JavaScript location redirects. If the page uses JavaScript location redirects, the crawler cannot reach the content of the page.
To resolve this issue, do one of the following:
- Make sure the sitemap points directly to the page you want to index.
- Implement HTTP redirects.