Zendesk Federated Search enables your end users to see content in your help center search results that lives externally from your help center, such as external knowledge bases, learning management software, blogs, and pages of your website. You can implement Federated search using either the Zendesk Federated Search API or the search crawler.
The search crawler is available in the search settings of your help center and lets you implement Federated search in your help center without developer resources (see Setting up the search crawler). You can use this article to troubleshoot crawler setup and page errors that you may encounter while setting up the search crawler in your application.
This article contains the following topics:
Crawler setup errors
Crawler setup errors are generated when the search crawler cannot run successfully due to errors in domain ownership verification or sitemap processing. Crawler setup errors generate an email notification that is sent to the crawler owner configured during search crawler setup.
Domain ownership could not be verified
The search crawler attempts to verify domain ownership each time it runs, which can take up to 24 hours. Although the crawler owner is notified by email if the domain verification fails, you can test verification instantly on the edit search crawler page. See Managing search crawlers.
To troubleshoot domain verification errors, verify the following:
- The homepage of your website (otherwise known as the index or root page) is up and publicly available. The page should not have any user login, password, IP restrictions, or other authentication requirements.
- You've implemented the correct tag from your crawler. It is free of typos and is implemented in the <head> section of the homepage of the website you want to crawl. The domain verification tag should always be placed on the homepage of the site, even if your crawler is configured to crawl a subset of pages. You can edit the crawler to view the current domain verification information. See Managing search crawlers.
Note: You can have several verification tags for different crawlers on the same domain.
<meta name="zd-site-verification" content="crawler-verification-token">
<!-- style info here -->
<!-- body of the page here -->
Sitemap could not be processed
The search crawler uses the sitemap defined at crawler setup each time it runs. If the sitemap cannot be processed, the crawler owner receives an email notification and the crawler will not run. If this happens, verify the following:
- The search crawler is pointing to the correct sitemap URL and can locate it successfully. You can edit the crawler to view the current sitemap URL. See Managing search crawlers.
- The sitemap is served and publicly available. The page should not be restricted by any user login, password, IP restrictions, or other authentication.
- The sitemap is an XML URL sitemap that follows the Sitemaps XML protocol.
Record errors
Record errors occur when there are no setup errors, but the search crawler cannot successfully scrape and index the pages defined in the crawler sitemap (see Setting up the search crawler). When a record error occurs, an email notification is sent to the crawler owner with a link to a CSV file that lists the affected pages and their associated errors.
Locale not detected
The error "Locale not detected" indicates that the search crawler could not detect any locale or the detected locale does not match any current help center locales.
To determine the locale of a record, the crawler tries the following approaches. The first successful strategy determines the locale of the records.
- Extract the locale from the lang attribute in the <html> tag
- Extract the locale from the Content-Language header
- Extract the locale from the <meta> tag
- Perform textual analysis of the content (CLD - Compact Language Detection)
The "Locale not detected" error results from of one of the following issues:
- The locale or language identified does not match a locale or language configured in any help center in your account. To see which languages are configured in each help center in your account, see Configuring your help center to support multiple languages. Find the locale codes for your configured languages in Zendesk language support by product.
- The search crawler could not determine a locale or language.
To resolve this issue, verify the following:
- The lang attribute in the html tag matches a locale from the help center.
- The HTTP Content-Language header matches a locale from the Help center.
- The meta element with the Content-Language set in the http-equiv attribute matches a locale from the help center.
Title not detected
The error "Title not detected" indicates that the search crawler could not detect the title of a record. The search crawler uses the following approaches to determine the title of a record:
- Extract the content of the <title> tag
- Extract the content of the <h1> tag
- Extract the textual content from the <body> tag.
The first successful strategy determines the locale of the records. The crawler indexes the first 255 characters of the extracted content as the record title if one of first two approaches are successful. If these strategies do not determine a title, the record is not indexed.
To resolve this issue, make sure the affected page has one of the tags listed above.
Body not found
The error "Body not found" indicates that the search crawler could not detect the body of a page. To resolve this error, make sure the affected page is properly marked with the <body> tag.
HTTP [status code]
If the error code field in the CSV for a record contains HTTP and a status code, it means that the page could not be indexed because the page could not be accessed. If the page could be successfully indexed (HTTP 2xx) you will not receive an HTTP status code error.
The most common error codes are:
- 404 - Page not found - The page either does not exist or was moved to another URL. To resolve this issue, make sure the sitemap that the crawler is using is current and that all URLs in the sitemap point to existing pages.
403 - Forbidden - The crawler is restricted from accessing the page due to some access control mechanism, such as it being behind a log in or IP address restriction. To resolve this issue, verify the following:
- You have added Zendesk/External-Content, the search crawler user agent, to your allowlist.
- The pages you want to index are publicly accessible, as the crawler cannot crawl pages with restricted access. If the pages you want to crawl and index cannot be made publicly accessible then you should explore indexing them using the Federated Search (External Content) API. See Setting up the Zendesk Federated Search API.
- 5xx - Server error - The page could not be crawled due to a server error. The site may be temporarily unavailable. To resolve this issue, visit the one or more of the pages with this error to make sure the site is up. If the site is down, contact the site administrator. When the error is fixed, wait for the crawler to run again within it's regular cadence (every 12-24 hours).
Invalid URL domain
The error "Invalid URL domain" indicates the URL of the page listed in the sitemap is not on the domain you configured during crawler setup.
To resolve this issue, verify that the domain of the page that triggered the error is on the same domain as is defined for your search crawler. If the page linked in your sitemap is pointing to a page that is hosted on a different domain from the one configured during crawler setup, you can do one of the following:
Set up a new search crawler for the affected page
Move the page from the external domain to the domain configured for the search crawler.
The error "Undetermined" may be caused by one or more of the following:
You have exceeded the external records limit for your instance - The search crawler has a limit of 50,000 external records. If you've exceeded the 50,000 external records limit, the latest external records in excess of the limit will not be indexed or updated. To view the number of external records your crawler has used, review the search crawler information. See Managing search crawlers.To resolve this issue you can do one or more of the following:
- Delete some of your crawlers, whereby the external record of those pages is deleted in your instance and then the pages previously not indexed due to hitting the limit can be indexed. See Managing search crawlers.
- Delete individual records via the Federated Search API. However, if the crawler indexing this page is still active or if a custom API integration that adds this page is active, the page will reappear next time the crawler runs or the integration syncs.
- Remove pages that one or more crawlers are using from the sitemap. The next time the crawler runs it will re-index the remaining pages and delete the ones removed from the sitemap.
- Point one or more crawlers to a sitemap with fewer pages. The next time the crawler runs it will re-index the remaining pages and delete the ones removed from the sitemap.
The page is using JavaScript location redirects - The search crawler does not observe JavaScript location redirects. If the page uses JavaScript location redirects, the crawler cannot reach the content of the page.
To resolve this issue, do one of the following:
- Make sure the sitemap points directly to the page you want to index.
- Implement HTTP redirects.
Susan R.
What do we do if we have the Locale not Detected error, but our pages have clear headers with lang="en" in them?
Global Support: Rein
Susan R. we experienced a similar issue, and it is because "en" isn't an officially supported language code. In Zendesk, the language can be either en-US or en-GB. See Zendesk language support by product. After we changed this in our documentation from "en" to "en-US", it could index the content and did not encounter the "Locale not detected" error anymore. Hope this helps resolve your issue too!
Jon Bolden
We are experiencing an issue where the domain name is correct and setup correctly (matches the URLs being crawled) but still getting a Invalid url domain error. Has anyone else run into this problem?
I noticed that one of your admin have submitted a ticket similar to this concern. Please keep track of that ticket for the resolution.
Lars Schweikardt
We have setup a crawler but the sitemap also contains URLs which lead to images is there a way to exclude certain pages/types from the results?
Viktor Osetrov
What do you mean - do you want to filter search results or Google search results?
Regarding excluding certain pages/types from the Google search results, you can use the following string:
<meta name="robots" content="noindex" />
I believe an alternative solution is generating and uploading sitemaps directly inside Google Search Central - using this instruction. All steps you can find here.
Hope it helps
Lars Schweikardt
@... I have a crawler which is connected to the page of our company. The sitemap.xml file contains also URLs which directs you to an image. Those images then occur in the Guide Search but they are irrelevant and therefore I do not want to include those in the search results. We could not index those, but we do that to occur in Google Search result. This is why Iam asking if it is possible to filter the results of the crawler somehow.
Viktor Osetrov
Thanks for your clarifications. Zendesk currently does not provide any functionality to selectively exclude parts of the content from its search results.
The possible solutions only are:
1. Add a robot.txt file to your server:
You could add a robots.txt file to your website which tells web crawlers to ignore certain pages. The robots.txt file gives instructions to web robots about which pages on your site to crawl. You could set this up to prevent the Zendesk crawler from crawling the URLs of the images. Here is an example of what that could look like:
In this example, it disallows the Zendesk crawler from crawling any URLs that include "/images/". Replace "/images/" with the appropriate path according to your website's structure.
Please note that this solution would also prevent other search engines from indexing those image URLs.
2. Create Separate Sitemaps:
Another way would be to separate your sitemap.xml into two: one for Zendesk that does not include the image URLs and another for Google that includes the image URLs. This way, you can control what URLs each search engine gets to see and index.
For Zendesk, use a sitemap without the URLs directing to the images and for Google use the sitemap with the URLs to the images.
Apologies for the limitations. Hope it helps.
Alex Duffey
I keep getting "Indexing failed" errors that say "Invalid url domain" in the error report. But the link works correctly, the site map works, and the site verification works. I don't understand what is failing and this article doesn't have anything about the errors above.
Viktor Osetrov
Regarding Indexing failed" errors that say "Invalid url domain"
Could you please check the following moments:
1. Ensure that your domain is correctly set up
2. Check 'robots.txt'. For example, "https://www.google.com/robots.txt"
3. Please make sure that you are using links like that "http://www.example.com"
4. Check DNS settings
5. Please notice sandboxes have their own limitations
Hope it helps
Stacie Loving
Within the past week or so, we've started receiving "Body can't be blank" errors. These are for two typedoc pages where the content of the body is supplied dynamically. The crawler was previously able to index these two pages and is successfully indexing other similar pages. We haven't made any changes to these pages recently. How can I correct this issue?
Hiedi Kysther
Hey Stacie Loving
Thanks for bringing this to our attention. I see you already created a ticket on Support regarding this issue. This is the right move as this may need further review and investigation. Please keep an eye out on our team's update on your ticket.
Thanks! And, have a great day!
Peter Rittau
Does the <meta name="zd-site-verification" content="crawler-verification-token"> always have to be first after <head> or can it be anywhere between <head> </head>?
Dainne Kiara Lucena-Laxamana
Hi Peter Rittau
It can be anywhere between <head> </head>