Crawling RSS Feeds in FAST Search for SharePoint
The most effective way to crawl RSS Content in a SharePoint / FAST Farm is to use the FAST Web Crawler. The FAST Web crawler is a component supplied with FAST that is administered completely outside of SharePoint on the FAST server itself.
In order to configure an RSS Crawl, you must first set up some XML configuration files.
First, copy CrawlerCollectionDefaults.xml.generic.template from \FASTSearch\META\config\profiles\default\templates\installer to your FASTSearch\etc folder.
Next, make a copy of \FASTSearch\etc\CrawlerConfigTemplate-RSS.xml and save it in \FASTSearch\etc with some unique name. (I use \FASTSearch\etc\rss.xml.)
Now, open the file in a text editor.
Find the Domain Specification line. Ensure the name property is the same as the Content Collection you crawl into (“sp” by default for most people) or you will be unable to query your results.
Next, set the RSS URL(s) to crawl. In the ‘rss’ section, add <member> lines as seen below for each RSS feed to crawl.
If any of your feeds require authentication, you’ll need to add the following markup OUTSIDE of the rss section:
For each URL requiring authentication, add an attrib line:
<attrib name="%URL%" type="string">%Credentials%</attrib>
The credentials can use the following formats:
When the first format is used, Basic authentication is automatically applied. To use any other authentication scheme, use the second format.
Enter the domain for the <realm> attribute.
For the authScheme attribute, you can use any of the following:
- auto ( the web crawler will automatically determine the scheme)
Save the configuration file.
There are many other configuration options available, including options for including/excluding domains from crawl, setting crawl activity limits, and more.
See the Web Crawler XML configuration reference on MSDN for more information.
Setting up the Crawl
Open up a FAST Search Powershell command window as Administrator.
Execute the following command:
crawleradmin.exe –f \fastsearch\etc\rss.xml (or whatever you named your configuration file)
You should see the following:
Next, execute the following command to review the status of the crawl:
You should see output showing all FAST Web crawler collections and their statuses:
For more detailed statistics on your FAST Web Crawler collection, execute the following command:
crawleradmin –q sp (or the name of your content collection)
You will see output below showing detailed statistics
Verifying The Crawl
To ensure the crawl is running successful, open the most recent log in the following folder:
C:\FASTSearch\var\log\crawler\node\fetch\sp (or the name of your content collection)
If the crawl has been successful, you should rows indicating 200 as the HTML status code as below.
If unsuccessful, you will likely see 401 errors or other HTML error codes indicating issues with the crawl.
On your fast server, navigate to http://localhost:13280.
in the FQL Query box, enter a query to search for content that would be in the RSS articles you crawled.
If successfull, you will see results from your crawled RSS Content.