What is the search crawler?

Pinned

37 Comments

  • Jacob Christensen
    Community Moderator

    Sounds very interesting Gorka Cardona-Lauridsen! 🙌

    Will searching external sources be limited to queries made from within the Guide product? Or will it also be something that could be available for agents from within the Support product?

    1
  • Patrick Morgan

    Very cool! 

    Curious how the 10,000 character body length limit works. Does content that exceeds that limit simply not appear at all in searches? Or does the crawler crawl only up to 10,000 characters of a piece of content and then stop—but the content still appears in search based on what the crawler indexed in that 10,000 characters?

    1
  • Julien SERVILLAT

    Hey, that sounds very interesting, is that content availalbe for Answer Bot ? (this was in your roadmap for Federated Search...)

    2
  • 이사원

    Will this also be available for other areas where Zendesk recommend articles like web form (subject) & mobile SDK? 

    0
  • Gorka Cardona-Lauridsen
    Zendesk Product Manager

    Jacob ChristensenJulien SERVILLAT and Lee Thank you for your interest in the EAP! I'll try to answer all of you at once.

    Right now users can search external content through Help center native search and the Unified search API. Soon it will also be possible within the Knowledge search in the context panel in Zendesk Support.

    As for other search interfaces, such as Article suggestions, Answer Bot (powers messaging and email suggestions), Web Widget, Mobile SDK and Instant search we are at different stages of development with all of them, but too early to give an ETA. We intend to eventually deliver all of them and you will see releases in this area in 2022 if nothing unforeseen prevents us.

    1
  • Gorka Cardona-Lauridsen
    Zendesk Product Manager

    Patrick Morgan Thank you for your question and interest in the EAP!

    As it is now, the crawler will not index the page. You will get email with a CSV of all pages the crawler has intended to index where you can see the error for each page that failed so you can see which page is longer than 10.000 characters and split it or shorten it.

    I'm curious which behaviour of the two you would prefer? and why?

    Anyone else that has an opinion on the matter I would also love to hear from here in the comments.

    0
  • Patrick Morgan

    Gorka Cardona-Lauridsen, I suppose I'd say I'd prefer there were no length limits :-), but having some content crawled and indexed is better than none, as long as that piece of content appears in search. (And if a piece of content is 10k characters long, it's likely the important keywords and context for your search engine appears in that first 10k characters.)

    Having some but not all content findable through search defeats the purpose of federated search. If users search in our Zendesk-based Guide for a training that lives in our LMS and they don't find it because it's too long, they'll assume it just doesn't exist. I want federated search to make it so our users can know for certain what content is available to them, regardless of the platform it's hosted on.

    Here's some background on our use cases:

    Product Training Courses from a 3rd-party LMS

    In one of our use cases, we have a number of product training courses in our LMS that I'd love to be indexed for search in our Zendesk-based Guide. These courses include mostly videos, but we transcribe them for accessibility reasons. I'm sure most of our trainings exceed the 10,000 character limit.

    (Note: I suppose this depends on what records are being indexed from our LMS. If it's the Course-level descriptions, 10k characters is probably fine. If it's lessons within those courses, 10k characters is probably too few)

    Industry Content from Our Corporate Site

    Our other use case is to include some of industry-related educational content from our corporate site into our Zendesk-based Guide. Much of this content is usually quite long form, and is created and managed by another department. They won't be splitting or shortening this content just to have it be found through federated search in our Guide. It's already been optimized to be useful for our audience and findable on public search engines.

     

    I'm happy to answer any other questions about our use cases~ Thanks!

    1
  • Gorka Cardona-Lauridsen
    Zendesk Product Manager

    Patrick Morgan Thank you for the very thorough explanation and description of your use cases, I understand why it is important.

    For now I think I have the information I need, but I'll reach out if I have followup questions.

    0
  • Gina Guerra

    Hi, 

    You note that there is a limitation based on whether or not the information is public. 

    Can you expand on what you mean by public? Does this mean anything that requires a password to access cannot be used? is it everything that's set to public within a shared environment?

    For example, can we crawl information on our company's Confluence, which all of our agents have access to, but the general public does not? 

    0
  • Jordan Brown

    Will the crawler search metafields?

    0
  • Gorka Cardona-Lauridsen
    Zendesk Product Manager

    Gina Guerra good question. It means that it needs to be on a website with no restrictions, like user logins, password restriction, IP restriction or similar because the crawler does not have the capability to subvert these barriers. With IP restrictions you could white label our IP adrress on your side and it should work, but that's a work around you would have to implement in your system.

    I'm actually not sure which IP addressed the crawlers use, but I can figure it out if it is needed, but here would be a place to start.

    0
  • Gorka Cardona-Lauridsen
    Zendesk Product Manager

    Jordan Brown Could you expand on which meta fields you mean?

    We do for example try to determine the language and locale from among other things the lang attribute in the <html> tag and the <meta> tag.

    0
  • Jordan Brown

    Gorka Cardona-Lauridsen Metafields are extra, hidden data in each objects or in your shopfront that informs you more about the object itself without revealing them. These look like drop downs or accordion type content that is hidden unless it's expanded.

    0
  • Gorka Cardona-Lauridsen
    Zendesk Product Manager

    We have heard that several EAP participants have had problems adding sources and types when setting up the crawler. There is a UX issue that we are aware of, but until we roll out a better experience I have added a post on how to work around it.

    0
  • Gorka Cardona-Lauridsen
    Zendesk Product Manager

    Jordan Brown

    Content that is in the page source once initially loaded is crawled even if it's hidden by something like an accordion, but the crawler does not crawl content that is dynamically rendered after the initial page load or rendered by JavaScript since the crawler does not execute JavaScript (Added to post).

    I'm not super familiar with metafields, though I can see that at least sometimes they are rendered on the initial page load and would thus be indexed, but wether that is always the case I can't say. 

    0
  • Korak Purkayastha

    Hi,

    The crawler is failing to index some of our pages. The error description in the CSV report says "Locale not detected". However, when I visit these pages, I do see that <html> tag has a valid lang attribute. What could be the issue here?

    Thanks for your feedback.

    0
  • Gorka Cardona-Lauridsen
    Zendesk Product Manager

    Hi Korak Purkayastha

    This may be because the lang tag that is there does not match a locale in your help center(s). I will publish a post with more details ASAP, but in short the crawler tries to detect the locale of the pages it crawls and match that with the locales enabled in the Guide account that the crawler is from. If there is no match the page will not be indexed.

    0
  • Gorka Cardona-Lauridsen
    Zendesk Product Manager

    Hey all,

    I wanted to let you know that we have added the ability to edit a crawler that is already created.

    You can now edit:

    • Sitemap URL
    • Source
    • Type
    • Owner

    You access it by clicking the the oveflow menu icon on the crawler and the "Details" in the "Crawler" tab.

    Full navigation:

    Settings --> Search settings --> Federated search --> Crawler --> Overflow menu icon --> Details

     

    0
  • Jordan Brown

    Are there plans to remove the 3,000 record limit at some point? When will this flow through to messaging as well?

    0
  • Gorka Cardona-Lauridsen
    Zendesk Product Manager

    Hi Ron,

    1. We are looking at the 3000 record limit. What are the number of pages you need to index?
    2. I can't give you an ETA at this point for when external content can be served via Answer Bot that powers the messaging answer suggestions, but it is part of our roadmap. We expect to be able to serve external content for Web Widget within 6 months, but as always this is not a promise just a target. Did I understand your question right?
    0
  • Malcolm Walker

    We have been struggling with the search crawler, which is not able to completely scan our shared documentation site.  Each product in our product line has its own subsite on the documentation site, e.g. https://docs.ourbrandname.com/productname/

    There is a top level sitemap.xml, on https://docs.ourbrandname.com/sitemap.xml which contains urlloc tags referencing subsite sitemaps, like https://docs.ourbrandname.com/productname/sitemap.xml

    When I configure it as I would like, with a single crawler pointing at the root level sitemap.xml, the crawl fails with simply

    Sitemap setup:  Failed

    I think this is happening because there are no HTML pages in this sitemap, only pointers to other sitemaps, but it is impossible to tell with this error message.  Is there a way to know for sure this is the problem?

    When I configure the crawler to use the product-specific sitemap, I am able to successfully crawl most of the content.  Some pages return a 302 redirect but this does not seem to be a major issue (those pages are for embedding the documentation in another site, and don't need to be indexed)

    Since I can configure a crawler for each product I considered working around the first issue by adding crawlers for each product.  There are 17 products in this site.

    However, if I configure multiple crawlers to the same site, each crawler generates a unique zd-site-verificaiton value for each crawler, rather than re-using the one for the site that was already defined.

    Is there a way to configure this?  Will Zendesk add support for referenced sitemap files?  I would love to give our users the ability to find content in our product documentation from our help center!

    1
  • Eli Towle

    I also think the crawler would benefit from more detailed troubleshooting information. The sitemap I tried to use initially failed. When I viewed the crawler at Guide Admin>Settings>Search settings>Federated Search>Crawlers, I saw the error message:

    Sitemap fetch failed:
    The crawler has not been able to access or correctly parse the sitemap on <link to sitemap>. Take a look at the content crawler troubleshooting article to solve the issue. <link>

    It took us some time to figure out why the sitemap failed. It turns out the dates in the <lastmod> tags were not in proper W3C datetime format.

    My comments:

    1. As noted in the post above, the email notification for a failed sitemap configuration simply states Sitemap setup: Failed. It would be useful if the more detailed error message displayed in Guide Admin were included in this email.
    2. The error message in Guide Admin is followed by a link that I expected to take me to a content crawler troubleshooting article. However, the link takes me to the same crawler details page I am already on.
    3. It would be very helpful if the error messages were more specific and clear (line numbers for parsing failures, the response status code for access failures, etc.).
    0
  • Cedric Duffy

    Hi. I'm also getting an error "The sitemap for (my source) couldn't be processed. Review your sitemap setup to make sure the latest content appears in your help center search results."

    The sitemap I'm trying to add is being generated by a third party tool (Document360). It's formatted as such, starting with: <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

    And then within that <urlset> tag, has individual URLs contained within <url> tags, with the URL itself in a <loc> tag, then <lastmod> and <changefreq> tags.
     
    I'm not really an expert on sitemap formatting or these standards so I might just be missing something obvious, is this just not going to work? The error message and the documentation provided hasn't really gotten me anywhere. Do I need to convert this sitemap somehow or are there plans to expand the crawler to cover this format of sitemap?
    0
  • Dan Cooper
    Community Moderator

    It appears that the available records we have access to has increased from 3000 to 50000 records in a few screens, but there are some places that are documented that seem to still indicate 3000 might be the norm.  Can we get confirmation that from a usage approach that 50000 is supported?  We are looking to coordinate with some internal teams that own the external content and might need to take a different path in working with them if we need to account for running into the 3000 limit. 

    Understanding that this is EAP, these are the areas where we are still seeing the 3000 record number:

    • On the Configure Federated Search screen we see 3000 in the tooltip (The tag next to records remaining reads as 50000 of 50000)
    • On the Guide Product limits for your help center article, we are still seeing 3000 listed
    0
  • Gorka Cardona-Lauridsen
    Zendesk Product Manager

    Malcolm Walker first apologies for the very long wait!

    When I configure it as I would like, with a single crawler pointing at the root level sitemap.xml, the crawl fails with simply

    Sitemap setup:  Failed

    I think this is happening because there are no HTML pages in this sitemap, only pointers to other sitemaps, but it is impossible to tell with this error message.  Is there a way to know for sure this is the problem?

    Will Zendesk add support for referenced sitemap files?  I would love to give our users the ability to find content in our product documentation from our help center!

    You are correct in your assumption, the crawler does not currently support sitemap indexes so the work around is right now as you describe.

    We are aware of this limitation and will prioritise the issue in relation to other missing functionality and bugs. I can not promise this particular issue will be prioritised before the make the crawler generally available, but if not it is not because we will not add support for sitemap indexes, we will just do it after making the feature generally available.

    However, if I configure multiple crawlers to the same site, each crawler generates a unique zd-site-verificaiton value for each crawler, rather than re-using the one for the site that was already defined.

    Is there a way to configure this? 

    Not right now. Currently each crawler has it's own verification code. I completely understand that it is more cumbersome to have to add a verification code for each crawler instead of for each site. This decision was made based on a tradeoff between the benefit and the cost to implement. We decided to initially launch with one tag pr. crawler, but we have this issue in our backlog for further improvement of the crawler.

    0
  • Gorka Cardona-Lauridsen
    Zendesk Product Manager

    Eli Towle apologies for the very long wait!

    Thank you for your very detailed feedback. I will lift that straight to a backlog item.

    We are working on better troubleshooting documentation and apologies for it taking so long. The link in the product links to this article as a second best option since we don't have the troubleshooting documentation in place yet, but it will be pointed to a better troubleshooting article once it is ready.

    0
  • Gorka Cardona-Lauridsen
    Zendesk Product Manager

    Cedric Duffy Again apologies for the very long wait!

    We are looking into your issue and will provide an answer soon.

    0
  • Gorka Cardona-Lauridsen
    Zendesk Product Manager

    Dan Cooper The limit has been increased to 50000 records. The update to the documentation seems to have gotten stuck somewhere in the update workflow. I will follow up on it, but 50000 is the new limit. 

    1
  • Gorka Cardona-Lauridsen
    Zendesk Product Manager

    Cedric Duffy We are expecting to deploy a fix today or latest tomorrow. Could you let me know if it works or not in a couple of days?

    0
  • Korak Purkayastha

    Is it possible to set up search crawlers in different Zendesk accounts which would use the same sitemap from the external site? We wanted to replicate the federated search in a different Zendesk subdomain, and wanted to confirm if we would need a separate sitemap for that.

    0

Please sign in to leave a comment.

Powered by Zendesk