Tallan's Technology Blog

Tallan's Top Technologists Share Their Thoughts on Today's Technology Challenges

Crawling RSS Feeds in FAST Search for SharePoint

Overview

The most effective way to crawl RSS Content in a SharePoint / FAST Farm is to use the FAST Web Crawler.  The FAST Web crawler is a component supplied with FAST that is administered completely outside of SharePoint on the FAST server itself.

Configuration

In order to configure an RSS Crawl,  you must first set up some XML configuration files.

First, copy CrawlerCollectionDefaults.xml.generic.template from \FASTSearch\META\config\profiles\default\templates\installer to your FASTSearch\etc folder.

Next, make a copy of \FASTSearch\etc\CrawlerConfigTemplate-RSS.xml and save it in \FASTSearch\etc with some unique name.   (I use \FASTSearch\etc\rss.xml.)

Now, open the file in a text editor.

Find the Domain Specification line.  Ensure the name property is the same as the Content Collection you crawl into (“sp” by default for most people) or you will be unable to query your results.

<DomainSpecification name="sp">

 

Next, set the RSS URL(s) to crawl.  In the ‘rss’ section, add <member> lines as seen below for each RSS feed to crawl.

<section name="rss">

            <!-- List of start (seed) URIs pointing to RSS feeds. -->

            <attrib name="start_uris" type="list-string">


              <!-- <member> http://www.contoso.com/feed.rss </member> -->

            </attrib>

 

If any of your feeds require authentication, you’ll need to add the following markup OUTSIDE of the rss section:

 

<section name="passwd">

    <attrib name="http://www.contoso.com/confidential1/" type="string">

      <username>:<password>:<realm>:<authScheme>

    </attrib>

</section>

For each URL requiring authentication, add an attrib line:

<attrib name="%URL%" type="string">%Credentials%</attrib>

The credentials can use the following formats:

 

  • <username>:<password>
  • <username>:<password>:<realm>:<authScheme>

When the first format is used, Basic authentication is automatically applied.  To use any other authentication scheme, use the second format.

 

Enter the domain for the <realm> attribute.

For the authScheme attribute, you can use any of the following:

 

  • basic
  • digest
  • ntlmv1
  • ntlmv2
  • auto   ( the web crawler will automatically determine the scheme)

 

Save the configuration file.

 

There are many other configuration options available, including options for including/excluding domains from crawl, setting crawl activity limits, and more.

See the Web Crawler XML configuration reference on MSDN for more information.

Setting up the Crawl

Open up a FAST Search Powershell command window as Administrator.

Execute the following command:

crawleradmin.exe –f \fastsearch\etc\rss.xml  (or whatever you named your configuration file)

You should see the following:

image

Next, execute the following command to review the status of the crawl:

crawleradmin –status

You should see output showing all FAST Web crawler collections and their statuses:

image

For more detailed statistics on your FAST Web Crawler collection, execute the following command:

crawleradmin –q sp (or the name of your content collection)

You will see output below showing detailed statistics

 

image

 

Verifying The Crawl

Fetch Logs

To ensure the crawl is running successful, open the most recent log in the following folder:

C:\FASTSearch\var\log\crawler\node\fetch\sp (or the name of your content collection)

If the crawl has been successful, you should rows indicating 200 as the HTML status code as below.

image

If unsuccessful, you will likely see 401 errors or other HTML error codes indicating issues with the crawl.

 

QRServer

On your fast server, navigate to http://localhost:13280.

image

 

in the FQL Query box, enter a query to search for content that would be in the RSS articles you crawled.

If successfull, you will see results from your crawled RSS Content.

Please fee free to ask any questions in the comments below.

No comments

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

\\\